In [16]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt

Web Scraping German Companies from Wikipedia using BeautifulSoup¶

Step-by-Step Documentation¶

  1. Import Libraries

    • Import requests and BeautifulSoup from bs4.
  2. Fetch Web Page

    • Define URL of Wikipedia page for largest German companies.
    • Retrieve the webpage using requests.get.
  3. Parse HTML Content

    • Parse the HTML of the page using BeautifulSoup.
  4. Locate the Table

    • Find and select the second table in the HTML.
  5. Extract Table Headers

    • Find all <th> tags in the table.
    • Extract and clean the text from these headers.
  6. Create DataFrame

    • Initialize a pandas DataFrame with column names from table headers.
  7. Extract Row Data

    • Find all <tr> tags in the table.
    • Loop through each row to extract cell data.
    • Add the extracted data as rows in the DataFrame. ge.
In [17]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_German_companies'

response = requests.get('https://api.exchangerate-api.com/v4/latest/USD')

exchange_rate = response.json()['rates']['EUR']

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')
In [18]:
table = soup.find_all('table')[1]
In [19]:
columns_titles = table.find_all('th')
In [20]:
companies_table_titles = [title.text.strip() for title in columns_titles]
In [21]:
df = pd.DataFrame(columns = companies_table_titles)
In [22]:
column_data = table.find_all('tr')
In [23]:
for row in column_data[1:] : 
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    length = len(df)
    df.loc[length] = individual_row_data

ETL Process for German Company Data¶

Overview¶

This Jupyter Notebook documents the ETL (Extract, Transform, Load) process applied to a dataset of the largest German companies. The data is sourced from a Wikipedia page.

Extract¶

  • Data Extraction: Utilizing requests and BeautifulSoup, data is extracted from the Wikipedia page listing the largest German companies.

Transform¶

  • Data Cleaning: The extracted data undergoes various transformation steps, including:
    • Converting string-formatted financial figures (revenues and profits) to numerical values, handling formats like currency symbols and parentheses for negative numbers.
    • Resolving issues related to non-standard minus signs and string-to-float conversion errors.
    • Reordering DataFrame columns and resolving SettingWithCopyWarning by using the .loc[] method.
    • Typecasting certain columns, like 'Rank' and 'Fortune 500 rank', from strings to integers.

Conclusion¶

This notebook serves as a practical example of handling and transforming web-sourced data, showcasing essential data cleaning and manipulation techniques in Python.

In [24]:
df['Revenue(USD)'] = df['Revenue(USD millions)'].replace({',': '', r'\$': '', ' million': ''}, regex=True).astype(float) * 1e6
df['Revenue (EUR)'] = df['Revenue(USD)'] * exchange_rate
In [25]:
# Replace non-standard minus sign and other characters
df['Profits(USD)'] = df['Profits(USD millions)'].replace({',': '', r'\$': '', ' million': '', '−': '-'}, regex=True)

# Convert to float and add numbers to make ir a million
df['Profits(USD)'] = df['Profits(USD)'].astype(float) * 1e6
In [26]:
# Convert 'Rank' and 'Fortune 500rank' columns to integers using .loc[]
df.loc[:, 'Rank'] = df['Rank'].astype(int)
df.loc[:, 'Fortune 500rank'] = df['Fortune 500rank'].astype(int)
In [27]:
df = df[['Rank', 'Fortune 500rank', 'Name', 'Industry', 'Revenue(USD)', 'Revenue (EUR)', 'Profits(USD)', 'Headquarters']]
In [28]:
df
Out[28]:
Rank Fortune 500rank Name Industry Revenue(USD) Revenue (EUR) Profits(USD) Headquarters
0 1 15 Volkswagen Automotive 2.936850e+11 2.660786e+11 1.522300e+10 Wolfsburg
1 2 16 Uniper Electric utility 2.883090e+11 2.612080e+11 -1.996100e+10 Düsseldorf
2 3 47 Mercedes-Benz Group Automotive 1.577820e+11 1.429505e+11 1.525200e+10 Stuttgart
3 4 57 BMW Automotive 1.499910e+11 1.358918e+11 1.887000e+10 Munich
4 5 67 Allianz Insurance 1.290590e+11 1.169275e+11 7.087000e+09 Munich
5 6 73 E.ON Electric utility 1.216460e+11 1.102113e+11 1.926000e+09 Essen
6 7 79 Deutsche Telekom Telecommunications 1.201080e+11 1.088178e+11 8.415000e+09 Bonn
7 8 103 Deutsche Post Transportation 9.932400e+10 8.998754e+10 5.636000e+09 Bonn
8 9 118 Robert Bosch GmbH Automotive parts 9.184700e+10 8.321338e+10 1.367000e+09 Stuttgart
9 10 119 BASF Chemicals 9.184600e+10 8.321248e+10 -6.600000e+08 Ludwigshafen
10 11 162 Siemens Conglomerate 7.786000e+10 7.054116e+10 4.027000e+09 Munich / Berlin
11 12 170 Munich Re Insurance 7.574700e+10 6.862678e+10 3.610000e+09 Munich
12 13 232 Deutsche Bahn Transportation 5.921000e+10 5.364426e+10 -2.620000e+08 Berlin
13 14 237 EnBW Electric utility 5.890100e+10 5.336431e+10 1.828000e+09 Karlsruhe
14 15 249 Talanx Insurance 5.602900e+10 5.076227e+10 1.233000e+09 Hanover
15 16 264 Daimler Truck Automotive 5.358200e+10 4.854529e+10 2.803000e+09 Leinfelden-Echterdingen
16 17 265 Bayer Pharmaceuticals 5.336500e+10 4.834869e+10 4.365000e+09 Leverkusen
17 18 288 Edeka Zentrale Retail 4.948100e+10 4.482979e+10 4.160000e+08 Hamburg
18 19 315 ZF Friedrichshafen Automotive parts 4.606800e+10 4.173761e+10 2.390000e+08 Friedrichshafen
19 20 330 ThyssenKrupp Conglomerate 4.450200e+10 4.031881e+10 1.229000e+09 Essen
20 21 346 Fresenius Healthcare 4.295400e+10 3.891632e+10 1.443000e+09 Bad Homburg
21 22 354 Deutsche Bank Banking 4.228500e+10 3.831021e+10 5.701000e+09 Frankfurt
22 23 362 Continental AG Automotive parts 4.144900e+10 3.755279e+10 7.000000e+07 Hanover
23 24 367 RWE Electric utility 4.035200e+10 3.655891e+10 2.858000e+09 Essen
24 25 388 Phoenix Pharmahandel Retail 3.804500e+10 3.446877e+10 2.520000e+08 Mannheim
25 26 416 Hapag-Lloyd Shipping 3.633100e+10 3.291589e+10 2.520000e+08 Hamburg
26 27 438 Lufthansa Airline 3.444600e+10 3.120808e+10 8.320000e+08 Cologne
27 28 472 SAP SE Technology 3.246900e+10 2.941691e+10 2.402000e+09 Walldorf
28 29 474 Metro AG Retail 3.218600e+10 2.916052e+10 -3.610000e+08 Düsseldorf
29 30 490 Siemens Energy Energy technology 3.136700e+10 2.841850e+10 -4.370000e+08 Munich
In [29]:
plt.figure(figsize=(10,6))

# Sort the DataFrame based on 'Revenue' and select the top 5
top5_df = df.sort_values(by='Revenue (EUR)', ascending=False).head(5)

plt.bar(top5_df['Name'], top5_df['Revenue (EUR)'])

plt.xlabel('Name')
plt.ylabel('Revenue')
plt.title('Top 5 Company')
Out[29]:
Text(0.5, 1.0, 'Top 5 Company')
No description has been provided for this image
In [30]:
df.to_csv(r'C:\Users\filip\Downloads\data-exercise\WebScraping\CSV\Companies_Revenue.csv', sep = ';', index = False, encoding = 'utf-8-sig')
In [ ]:
 
In [ ]: