In [16]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
Web Scraping German Companies from Wikipedia using BeautifulSoup¶
Step-by-Step Documentation¶
Import Libraries
- Import
requests
andBeautifulSoup
frombs4
.
- Import
Fetch Web Page
- Define URL of Wikipedia page for largest German companies.
- Retrieve the webpage using
requests.get
.
Parse HTML Content
- Parse the HTML of the page using
BeautifulSoup
.
- Parse the HTML of the page using
Locate the Table
- Find and select the second table in the HTML.
Extract Table Headers
- Find all
<th>
tags in the table. - Extract and clean the text from these headers.
- Find all
Create DataFrame
- Initialize a pandas DataFrame with column names from table headers.
Extract Row Data
- Find all
<tr>
tags in the table. - Loop through each row to extract cell data.
- Add the extracted data as rows in the DataFrame. ge.
- Find all
In [17]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_German_companies'
response = requests.get('https://api.exchangerate-api.com/v4/latest/USD')
exchange_rate = response.json()['rates']['EUR']
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
In [18]:
table = soup.find_all('table')[1]
In [19]:
columns_titles = table.find_all('th')
In [20]:
companies_table_titles = [title.text.strip() for title in columns_titles]
In [21]:
df = pd.DataFrame(columns = companies_table_titles)
In [22]:
column_data = table.find_all('tr')
In [23]:
for row in column_data[1:] :
row_data = row.find_all('td')
individual_row_data = [data.text.strip() for data in row_data]
length = len(df)
df.loc[length] = individual_row_data
ETL Process for German Company Data¶
Overview¶
This Jupyter Notebook documents the ETL (Extract, Transform, Load) process applied to a dataset of the largest German companies. The data is sourced from a Wikipedia page.
Extract¶
- Data Extraction: Utilizing
requests
andBeautifulSoup
, data is extracted from the Wikipedia page listing the largest German companies.
Transform¶
- Data Cleaning: The extracted data undergoes various transformation steps, including:
- Converting string-formatted financial figures (revenues and profits) to numerical values, handling formats like currency symbols and parentheses for negative numbers.
- Resolving issues related to non-standard minus signs and string-to-float conversion errors.
- Reordering DataFrame columns and resolving
SettingWithCopyWarning
by using the.loc[]
method. - Typecasting certain columns, like 'Rank' and 'Fortune 500 rank', from strings to integers.
Conclusion¶
This notebook serves as a practical example of handling and transforming web-sourced data, showcasing essential data cleaning and manipulation techniques in Python.
In [24]:
df['Revenue(USD)'] = df['Revenue(USD millions)'].replace({',': '', r'\$': '', ' million': ''}, regex=True).astype(float) * 1e6
df['Revenue (EUR)'] = df['Revenue(USD)'] * exchange_rate
In [25]:
# Replace non-standard minus sign and other characters
df['Profits(USD)'] = df['Profits(USD millions)'].replace({',': '', r'\$': '', ' million': '', '−': '-'}, regex=True)
# Convert to float and add numbers to make ir a million
df['Profits(USD)'] = df['Profits(USD)'].astype(float) * 1e6
In [26]:
# Convert 'Rank' and 'Fortune 500rank' columns to integers using .loc[]
df.loc[:, 'Rank'] = df['Rank'].astype(int)
df.loc[:, 'Fortune 500rank'] = df['Fortune 500rank'].astype(int)
In [27]:
df = df[['Rank', 'Fortune 500rank', 'Name', 'Industry', 'Revenue(USD)', 'Revenue (EUR)', 'Profits(USD)', 'Headquarters']]
In [28]:
df
Out[28]:
Rank | Fortune 500rank | Name | Industry | Revenue(USD) | Revenue (EUR) | Profits(USD) | Headquarters | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 15 | Volkswagen | Automotive | 2.936850e+11 | 2.660786e+11 | 1.522300e+10 | Wolfsburg |
1 | 2 | 16 | Uniper | Electric utility | 2.883090e+11 | 2.612080e+11 | -1.996100e+10 | Düsseldorf |
2 | 3 | 47 | Mercedes-Benz Group | Automotive | 1.577820e+11 | 1.429505e+11 | 1.525200e+10 | Stuttgart |
3 | 4 | 57 | BMW | Automotive | 1.499910e+11 | 1.358918e+11 | 1.887000e+10 | Munich |
4 | 5 | 67 | Allianz | Insurance | 1.290590e+11 | 1.169275e+11 | 7.087000e+09 | Munich |
5 | 6 | 73 | E.ON | Electric utility | 1.216460e+11 | 1.102113e+11 | 1.926000e+09 | Essen |
6 | 7 | 79 | Deutsche Telekom | Telecommunications | 1.201080e+11 | 1.088178e+11 | 8.415000e+09 | Bonn |
7 | 8 | 103 | Deutsche Post | Transportation | 9.932400e+10 | 8.998754e+10 | 5.636000e+09 | Bonn |
8 | 9 | 118 | Robert Bosch GmbH | Automotive parts | 9.184700e+10 | 8.321338e+10 | 1.367000e+09 | Stuttgart |
9 | 10 | 119 | BASF | Chemicals | 9.184600e+10 | 8.321248e+10 | -6.600000e+08 | Ludwigshafen |
10 | 11 | 162 | Siemens | Conglomerate | 7.786000e+10 | 7.054116e+10 | 4.027000e+09 | Munich / Berlin |
11 | 12 | 170 | Munich Re | Insurance | 7.574700e+10 | 6.862678e+10 | 3.610000e+09 | Munich |
12 | 13 | 232 | Deutsche Bahn | Transportation | 5.921000e+10 | 5.364426e+10 | -2.620000e+08 | Berlin |
13 | 14 | 237 | EnBW | Electric utility | 5.890100e+10 | 5.336431e+10 | 1.828000e+09 | Karlsruhe |
14 | 15 | 249 | Talanx | Insurance | 5.602900e+10 | 5.076227e+10 | 1.233000e+09 | Hanover |
15 | 16 | 264 | Daimler Truck | Automotive | 5.358200e+10 | 4.854529e+10 | 2.803000e+09 | Leinfelden-Echterdingen |
16 | 17 | 265 | Bayer | Pharmaceuticals | 5.336500e+10 | 4.834869e+10 | 4.365000e+09 | Leverkusen |
17 | 18 | 288 | Edeka Zentrale | Retail | 4.948100e+10 | 4.482979e+10 | 4.160000e+08 | Hamburg |
18 | 19 | 315 | ZF Friedrichshafen | Automotive parts | 4.606800e+10 | 4.173761e+10 | 2.390000e+08 | Friedrichshafen |
19 | 20 | 330 | ThyssenKrupp | Conglomerate | 4.450200e+10 | 4.031881e+10 | 1.229000e+09 | Essen |
20 | 21 | 346 | Fresenius | Healthcare | 4.295400e+10 | 3.891632e+10 | 1.443000e+09 | Bad Homburg |
21 | 22 | 354 | Deutsche Bank | Banking | 4.228500e+10 | 3.831021e+10 | 5.701000e+09 | Frankfurt |
22 | 23 | 362 | Continental AG | Automotive parts | 4.144900e+10 | 3.755279e+10 | 7.000000e+07 | Hanover |
23 | 24 | 367 | RWE | Electric utility | 4.035200e+10 | 3.655891e+10 | 2.858000e+09 | Essen |
24 | 25 | 388 | Phoenix Pharmahandel | Retail | 3.804500e+10 | 3.446877e+10 | 2.520000e+08 | Mannheim |
25 | 26 | 416 | Hapag-Lloyd | Shipping | 3.633100e+10 | 3.291589e+10 | 2.520000e+08 | Hamburg |
26 | 27 | 438 | Lufthansa | Airline | 3.444600e+10 | 3.120808e+10 | 8.320000e+08 | Cologne |
27 | 28 | 472 | SAP SE | Technology | 3.246900e+10 | 2.941691e+10 | 2.402000e+09 | Walldorf |
28 | 29 | 474 | Metro AG | Retail | 3.218600e+10 | 2.916052e+10 | -3.610000e+08 | Düsseldorf |
29 | 30 | 490 | Siemens Energy | Energy technology | 3.136700e+10 | 2.841850e+10 | -4.370000e+08 | Munich |
In [29]:
plt.figure(figsize=(10,6))
# Sort the DataFrame based on 'Revenue' and select the top 5
top5_df = df.sort_values(by='Revenue (EUR)', ascending=False).head(5)
plt.bar(top5_df['Name'], top5_df['Revenue (EUR)'])
plt.xlabel('Name')
plt.ylabel('Revenue')
plt.title('Top 5 Company')
Out[29]:
Text(0.5, 1.0, 'Top 5 Company')
In [30]:
df.to_csv(r'C:\Users\filip\Downloads\data-exercise\WebScraping\CSV\Companies_Revenue.csv', sep = ';', index = False, encoding = 'utf-8-sig')
In [ ]:
In [ ]: