webscraper

In [16]:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt

Web Scraping German Companies from Wikipedia using BeautifulSoup¶

Step-by-Step Documentation¶

Import Libraries
- Import requests and BeautifulSoup from bs4.
Fetch Web Page
- Define URL of Wikipedia page for largest German companies.
- Retrieve the webpage using requests.get.
Parse HTML Content
- Parse the HTML of the page using BeautifulSoup.
Locate the Table
- Find and select the second table in the HTML.
Extract Table Headers
- Find all <th> tags in the table.
- Extract and clean the text from these headers.
Create DataFrame
- Initialize a pandas DataFrame with column names from table headers.
Extract Row Data
- Find all <tr> tags in the table.
- Loop through each row to extract cell data.
- Add the extracted data as rows in the DataFrame. ge.

In [17]:

url = 'https://en.wikipedia.org/wiki/List_of_largest_German_companies'

response = requests.get('https://api.exchangerate-api.com/v4/latest/USD')

exchange_rate = response.json()['rates']['EUR']

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

In [18]:

table = soup.find_all('table')[1]

In [19]:

columns_titles = table.find_all('th')

In [20]:

companies_table_titles = [title.text.strip() for title in columns_titles]

In [21]:

df = pd.DataFrame(columns = companies_table_titles)

In [22]:

column_data = table.find_all('tr')

In [23]:

for row in column_data[1:] : 
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    length = len(df)
    df.loc[length] = individual_row_data

ETL Process for German Company Data¶

Overview¶

This Jupyter Notebook documents the ETL (Extract, Transform, Load) process applied to a dataset of the largest German companies. The data is sourced from a Wikipedia page.

Extract¶

Data Extraction: Utilizing requests and BeautifulSoup, data is extracted from the Wikipedia page listing the largest German companies.

Transform¶

Data Cleaning: The extracted data undergoes various transformation steps, including:
- Converting string-formatted financial figures (revenues and profits) to numerical values, handling formats like currency symbols and parentheses for negative numbers.
- Resolving issues related to non-standard minus signs and string-to-float conversion errors.
- Reordering DataFrame columns and resolving SettingWithCopyWarning by using the .loc[] method.
- Typecasting certain columns, like 'Rank' and 'Fortune 500 rank', from strings to integers.

Conclusion¶

This notebook serves as a practical example of handling and transforming web-sourced data, showcasing essential data cleaning and manipulation techniques in Python.

In [24]:

df['Revenue(USD)'] = df['Revenue(USD millions)'].replace({',': '', r'\$': '', ' million': ''}, regex=True).astype(float) * 1e6
df['Revenue (EUR)'] = df['Revenue(USD)'] * exchange_rate

In [25]:

# Replace non-standard minus sign and other characters
df['Profits(USD)'] = df['Profits(USD millions)'].replace({',': '', r'\$': '', ' million': '', '−': '-'}, regex=True)

# Convert to float and add numbers to make ir a million
df['Profits(USD)'] = df['Profits(USD)'].astype(float) * 1e6

In [26]:

# Convert 'Rank' and 'Fortune 500rank' columns to integers using .loc[]
df.loc[:, 'Rank'] = df['Rank'].astype(int)
df.loc[:, 'Fortune 500rank'] = df['Fortune 500rank'].astype(int)

In [27]:

df = df[['Rank', 'Fortune 500rank', 'Name', 'Industry', 'Revenue(USD)', 'Revenue (EUR)', 'Profits(USD)', 'Headquarters']]

In [28]:

df

Out[28]:

	Rank	Fortune 500rank	Name	Industry	Revenue(USD)	Revenue (EUR)	Profits(USD)	Headquarters
0	1	15	Volkswagen	Automotive	2.936850e+11	2.660786e+11	1.522300e+10	Wolfsburg
1	2	16	Uniper	Electric utility	2.883090e+11	2.612080e+11	-1.996100e+10	Düsseldorf
2	3	47	Mercedes-Benz Group	Automotive	1.577820e+11	1.429505e+11	1.525200e+10	Stuttgart
3	4	57	BMW	Automotive	1.499910e+11	1.358918e+11	1.887000e+10	Munich
4	5	67	Allianz	Insurance	1.290590e+11	1.169275e+11	7.087000e+09	Munich
5	6	73	E.ON	Electric utility	1.216460e+11	1.102113e+11	1.926000e+09	Essen
6	7	79	Deutsche Telekom	Telecommunications	1.201080e+11	1.088178e+11	8.415000e+09	Bonn
7	8	103	Deutsche Post	Transportation	9.932400e+10	8.998754e+10	5.636000e+09	Bonn
8	9	118	Robert Bosch GmbH	Automotive parts	9.184700e+10	8.321338e+10	1.367000e+09	Stuttgart
9	10	119	BASF	Chemicals	9.184600e+10	8.321248e+10	-6.600000e+08	Ludwigshafen
10	11	162	Siemens	Conglomerate	7.786000e+10	7.054116e+10	4.027000e+09	Munich / Berlin
11	12	170	Munich Re	Insurance	7.574700e+10	6.862678e+10	3.610000e+09	Munich
12	13	232	Deutsche Bahn	Transportation	5.921000e+10	5.364426e+10	-2.620000e+08	Berlin
13	14	237	EnBW	Electric utility	5.890100e+10	5.336431e+10	1.828000e+09	Karlsruhe
14	15	249	Talanx	Insurance	5.602900e+10	5.076227e+10	1.233000e+09	Hanover
15	16	264	Daimler Truck	Automotive	5.358200e+10	4.854529e+10	2.803000e+09	Leinfelden-Echterdingen
16	17	265	Bayer	Pharmaceuticals	5.336500e+10	4.834869e+10	4.365000e+09	Leverkusen
17	18	288	Edeka Zentrale	Retail	4.948100e+10	4.482979e+10	4.160000e+08	Hamburg
18	19	315	ZF Friedrichshafen	Automotive parts	4.606800e+10	4.173761e+10	2.390000e+08	Friedrichshafen
19	20	330	ThyssenKrupp	Conglomerate	4.450200e+10	4.031881e+10	1.229000e+09	Essen
20	21	346	Fresenius	Healthcare	4.295400e+10	3.891632e+10	1.443000e+09	Bad Homburg
21	22	354	Deutsche Bank	Banking	4.228500e+10	3.831021e+10	5.701000e+09	Frankfurt
22	23	362	Continental AG	Automotive parts	4.144900e+10	3.755279e+10	7.000000e+07	Hanover
23	24	367	RWE	Electric utility	4.035200e+10	3.655891e+10	2.858000e+09	Essen
24	25	388	Phoenix Pharmahandel	Retail	3.804500e+10	3.446877e+10	2.520000e+08	Mannheim
25	26	416	Hapag-Lloyd	Shipping	3.633100e+10	3.291589e+10	2.520000e+08	Hamburg
26	27	438	Lufthansa	Airline	3.444600e+10	3.120808e+10	8.320000e+08	Cologne
27	28	472	SAP SE	Technology	3.246900e+10	2.941691e+10	2.402000e+09	Walldorf
28	29	474	Metro AG	Retail	3.218600e+10	2.916052e+10	-3.610000e+08	Düsseldorf
29	30	490	Siemens Energy	Energy technology	3.136700e+10	2.841850e+10	-4.370000e+08	Munich

In [29]:

plt.figure(figsize=(10,6))

# Sort the DataFrame based on 'Revenue' and select the top 5
top5_df = df.sort_values(by='Revenue (EUR)', ascending=False).head(5)

plt.bar(top5_df['Name'], top5_df['Revenue (EUR)'])

plt.xlabel('Name')
plt.ylabel('Revenue')
plt.title('Top 5 Company')

Out[29]:

Text(0.5, 1.0, 'Top 5 Company')

No description has been provided for this image

In [30]:

df.to_csv(r'C:\Users\filip\Downloads\data-exercise\WebScraping\CSV\Companies_Revenue.csv', sep = ';', index = False, encoding = 'utf-8-sig')

In [ ]: