In [28]:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt

Web Scraping German Companies Market Cap using BeautifulSoup¶

In [29]:

url = 'https://companiesmarketcap.com/germany/largest-companies-in-germany-by-market-cap/'

response = requests.get('https://api.exchangerate-api.com/v4/latest/USD')

exchange_rate = response.json()['rates']['EUR']

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

In [30]:

table = soup.find('table')

In [31]:

columns_titles = table.find_all('th')

In [32]:

companies_table_titles = [title.text.strip() for title in columns_titles]

In [33]:

df = pd.DataFrame(columns = companies_table_titles)

In [34]:

for ad_tr in table.find_all('tr', class_='ad-tr no-sort'):
    ad_tr.decompose()
    
column_td = table.find_all('td')

for td in column_td:
    for company_code in td.find_all('div', class_='company-code'):
        company_code.decompose()
    for img_tag in td.find_all('img'):
        img_tag.decompose()

ETL Process¶

In this code snippet, we are using Python with BeautifulSoup to manipulate an HTML table. The code performs the following tasks:

Removing Table Rows:
- We iterate through all the <tr> elements within the HTML table that have the class attribute 'ad-tr no-sort'.
- For each matching row, we remove it from the HTML using the decompose() method.
Collecting Table Cells:
- We collect all the <td> elements from the HTML table and store them in the column_td variable.
Cleaning Table Cells:
- For each <td> element in column_td, we perform the following operations:
  - We iterate through all the <div> elements within the <td> that have the class attribute 'company-code'.
  - For each matching <div>, we remove it from the HTML using the decompose() method.
  - We also iterate through all the <img> elements within the <td> and remove them using the decompose() method.

This code is useful for filtering and cleaning HTML tables by removing specific rows and elements that match the specified criteria.

In [35]:

for i in range(0, len(column_td), 8):
    individual_row_data = [td.text.strip() if td.text.strip() != '' else None for td in column_td[i:i+8]]
    if len(individual_row_data) == len(df.columns):
        df.loc[len(df)] = individual_row_data
    else:
        print(f"Skipping row due to mismatched columns: {individual_row_data}")

This Python code defines a function convert_market_cap(value) and applies it to two columns, 'Market Cap' and 'Price', in a DataFrame named 'df'. The code performs the following tasks:

Function `convert_market_cap(value)`:¶

The function takes a single argument 'value', which is a string representing a market capitalization or price value.
It cleans the 'value' by removing commas (','), dollar signs ('$'), and euro signs ('€') using the replace() method.
It checks if the cleaned 'value' contains 'B' (indicating billion) or 'M' (indicating million).
If 'B' is present, it converts the value to a float and multiplies it by 1e9 (1 billion) to convert it to the appropriate market capitalization value.
If 'M' is present, it converts the value to a float and multiplies it by 1e6 (1 million) to convert it to the appropriate market capitalization value.
If neither 'B' nor 'M' is present, it converts the value to a float without any scaling and returns it.

Applying the Function to DataFrame Columns:¶

The code applies the convert_market_cap function to the 'Market Cap' column of the DataFrame 'df' using the apply() method. The result is stored in the 'market_cap' variable.
Similarly, it applies the convert_market_cap function to the 'Price' column of the DataFrame 'df' and stores the result in the 'price' variable.

This code is useful for converting market capitalization and price values from string representations with different formats into consistent numerical values that can be used for analysis.

In [36]:

def convert_market_cap(value):
    value = value.replace(',', '').replace('$', '').replace('€', '')
    if 'B' in value:
        return float(value.replace('B', '')) * 1e9
    elif 'M' in value:
        return float(value.replace('M', '')) * 1e6
    else:
        return float(value)
df['Market Cap USD'] = df['Market Cap']
market_cap = df['Market Cap USD'].apply(convert_market_cap)
price = df['Price'].apply(convert_market_cap)

Function `convert_to_eur(usd_value)`:¶

The function takes a single argument 'usd_value', which represents a monetary value in US dollars.
It checks if 'usd_value' is NaN (Not-a-Number) or not a numeric type using pd.isna() and isinstance() functions. If 'usd_value' is not valid, the function returns None to handle such cases.
If 'usd_value' is valid, it multiplies it by an 'exchange_rate' (which is assumed to be defined elsewhere in the code) to convert the value to euros ('€').
The function then checks the magnitude of the converted value:
- If it's greater than or equal to 1 billion ('1e9'), it formats and returns the value in billions with two decimal places and 'B' (e.g., '€10.50 B').
- If it's greater than or equal to 1 million ('1e6') but less than 1 billion, it formats and returns the value in millions with two decimal places and 'M' (e.g., '€5.75 M').
- If it's greater than or equal to 1 thousand ('1e3') but less than 1 million, it formats and returns the value in thousands with two decimal places and 'K' (e.g., '€3.25 K').
- If it's less than 1 thousand, it formats and returns the value in euros with two decimal places and the euro symbol (e.g., '€500.75').

Applying the Function to DataFrame Columns:¶

The code applies the convert_to_eur function to the 'Market Cap' column of the DataFrame 'df' and stores the result in a new column 'Market Cap EUR'.
Similarly, it applies the convert_to_eur function to the 'Price' column of the DataFrame 'df' and stores the result in a new column 'Price EUR'.

This code is useful for converting monetary values from US dollars to euros, formatting them appropriately based on their magnitude, and storing the converted values in new columns for further analysis.

In [37]:

def convert_to_eur(usd_value):
    if pd.isna(usd_value) or not isinstance(usd_value, (int, float)):
        return None  # Handle NaN and non-numeric values

    value = usd_value * exchange_rate

    if value >= 1e9:  # Billions
        return f'€{value / 1e9:.2f} B'
    elif value >= 1e6:  # Millions
        return f'{value / 1e6:.2f} M'
    elif value >= 1e3:  # Thousands
        return f'{value / 1e3:.2f} K'
    else:  # Euros
        return f'{value:.2f} €'

df['Market Cap EUR'] = market_cap.apply(convert_to_eur)
df['Price EUR'] = price.apply(convert_to_eur)

In [38]:

df = df[['Rank', 'Name', 'Market Cap EUR', 'Market Cap USD', 'Price EUR', 'Today']]
df

Out[38]:

	Rank	Name	Market Cap EUR	Market Cap USD	Price EUR	Today
0	1	SAP	€164.38 B	$181.43 B	140.06 €	0.17%
1	2	Siemens	€134.25 B	$148.18 B	169.96 €	0.85%
2	3	Deutsche Telekom	€108.18 B	$119.40 B	21.75 €	0.62%
3	4	Allianz	€94.79 B	$104.63 B	242.00 €	0.54%
4	5	Porsche	€72.80 B	$80.35 B	79.92 €	0.00%
...	...	...	...	...	...	...
95	96	Encavis	€2.51 B	$2.77 B	15.39 €	0.49%
96	97	HelloFresh	€2.47 B	$2.73 B	14.31 €	0.28%
97	98	Nordex	€2.46 B	$2.72 B	10.43 €	2.61%
98	99	Lanxess	€2.45 B	$2.70 B	28.25 €	1.53%
99	100	IONOS Group	€2.44 B	$2.69 B	17.47 €	2.22%

100 rows × 6 columns

In [39]:

df['Market Cap EUR'] = df['Market Cap EUR'].apply(convert_market_cap)
df['Market Cap USD'] = df['Market Cap USD'].apply(convert_market_cap)
df['Price EUR'] = df['Price EUR'].apply(convert_market_cap)
df

Out[39]:

	Rank	Name	Market Cap EUR	Market Cap USD	Price EUR	Today
0	1	SAP	1.643800e+11	1.814300e+11	140.06	0.17%
1	2	Siemens	1.342500e+11	1.481800e+11	169.96	0.85%
2	3	Deutsche Telekom	1.081800e+11	1.194000e+11	21.75	0.62%
3	4	Allianz	9.479000e+10	1.046300e+11	242.00	0.54%
4	5	Porsche	7.280000e+10	8.035000e+10	79.92	0.00%
...	...	...	...	...	...	...
95	96	Encavis	2.510000e+09	2.770000e+09	15.39	0.49%
96	97	HelloFresh	2.470000e+09	2.730000e+09	14.31	0.28%
97	98	Nordex	2.460000e+09	2.720000e+09	10.43	2.61%
98	99	Lanxess	2.450000e+09	2.700000e+09	28.25	1.53%
99	100	IONOS Group	2.440000e+09	2.690000e+09	17.47	2.22%

100 rows × 6 columns

In [41]:

plt.figure(figsize=(10,6))

# Sort the DataFrame based on 'Revenue' and select the top 5
top5_df = df.sort_values(by='Market Cap USD', ascending=False).head(5)

plt.bar(top5_df['Name'], top5_df['Market Cap USD'])

plt.xlabel('Name')
plt.ylabel('Market Cap USD')
plt.title('Top 5 Company')

Out[41]:

Text(0.5, 1.0, 'Top 5 Company')

No description has been provided for this image

In [42]:

df.to_csv(r'C:\Users\filip\Downloads\data-exercise\WebScraping\CSV\Companies_MarketCap.csv', sep = ';', index = False, encoding = 'utf-8-sig')

In [ ]:

Web Scraping German Companies Market Cap using BeautifulSoup¶

ETL Process¶

Function convert_market_cap(value):¶

Applying the Function to DataFrame Columns:¶

Function convert_to_eur(usd_value):¶

Applying the Function to DataFrame Columns:¶

Function `convert_market_cap(value)`:¶

Function `convert_to_eur(usd_value)`:¶