from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
Web Scraping German Companies Market Cap using BeautifulSoup¶
url = 'https://companiesmarketcap.com/germany/largest-companies-in-germany-by-market-cap/'
response = requests.get('https://api.exchangerate-api.com/v4/latest/USD')
exchange_rate = response.json()['rates']['EUR']
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
table = soup.find('table')
columns_titles = table.find_all('th')
companies_table_titles = [title.text.strip() for title in columns_titles]
df = pd.DataFrame(columns = companies_table_titles)
for ad_tr in table.find_all('tr', class_='ad-tr no-sort'):
ad_tr.decompose()
column_td = table.find_all('td')
for td in column_td:
for company_code in td.find_all('div', class_='company-code'):
company_code.decompose()
for img_tag in td.find_all('img'):
img_tag.decompose()
ETL Process¶
In this code snippet, we are using Python with BeautifulSoup to manipulate an HTML table. The code performs the following tasks:
Removing Table Rows:
- We iterate through all the
<tr>
elements within the HTML table that have the class attribute 'ad-tr no-sort'. - For each matching row, we remove it from the HTML using the
decompose()
method.
- We iterate through all the
Collecting Table Cells:
- We collect all the
<td>
elements from the HTML table and store them in thecolumn_td
variable.
- We collect all the
Cleaning Table Cells:
- For each
<td>
element incolumn_td
, we perform the following operations:- We iterate through all the
<div>
elements within the<td>
that have the class attribute 'company-code'. - For each matching
<div>
, we remove it from the HTML using thedecompose()
method. - We also iterate through all the
<img>
elements within the<td>
and remove them using thedecompose()
method.
- We iterate through all the
- For each
This code is useful for filtering and cleaning HTML tables by removing specific rows and elements that match the specified criteria.
for i in range(0, len(column_td), 8):
individual_row_data = [td.text.strip() if td.text.strip() != '' else None for td in column_td[i:i+8]]
if len(individual_row_data) == len(df.columns):
df.loc[len(df)] = individual_row_data
else:
print(f"Skipping row due to mismatched columns: {individual_row_data}")
This Python code defines a function convert_market_cap(value)
and applies it to two columns, 'Market Cap' and 'Price', in a DataFrame named 'df'. The code performs the following tasks:
Function convert_market_cap(value)
:¶
- The function takes a single argument 'value', which is a string representing a market capitalization or price value.
- It cleans the 'value' by removing commas (','), dollar signs ('$'), and euro signs ('€') using the
replace()
method. - It checks if the cleaned 'value' contains 'B' (indicating billion) or 'M' (indicating million).
- If 'B' is present, it converts the value to a float and multiplies it by 1e9 (1 billion) to convert it to the appropriate market capitalization value.
- If 'M' is present, it converts the value to a float and multiplies it by 1e6 (1 million) to convert it to the appropriate market capitalization value.
- If neither 'B' nor 'M' is present, it converts the value to a float without any scaling and returns it.
Applying the Function to DataFrame Columns:¶
- The code applies the
convert_market_cap
function to the 'Market Cap' column of the DataFrame 'df' using theapply()
method. The result is stored in the 'market_cap' variable. - Similarly, it applies the
convert_market_cap
function to the 'Price' column of the DataFrame 'df' and stores the result in the 'price' variable.
This code is useful for converting market capitalization and price values from string representations with different formats into consistent numerical values that can be used for analysis.
def convert_market_cap(value):
value = value.replace(',', '').replace('$', '').replace('€', '')
if 'B' in value:
return float(value.replace('B', '')) * 1e9
elif 'M' in value:
return float(value.replace('M', '')) * 1e6
else:
return float(value)
df['Market Cap USD'] = df['Market Cap']
market_cap = df['Market Cap USD'].apply(convert_market_cap)
price = df['Price'].apply(convert_market_cap)
Function convert_to_eur(usd_value)
:¶
- The function takes a single argument 'usd_value', which represents a monetary value in US dollars.
- It checks if 'usd_value' is NaN (Not-a-Number) or not a numeric type using
pd.isna()
andisinstance()
functions. If 'usd_value' is not valid, the function returnsNone
to handle such cases. - If 'usd_value' is valid, it multiplies it by an 'exchange_rate' (which is assumed to be defined elsewhere in the code) to convert the value to euros ('€').
- The function then checks the magnitude of the converted value:
- If it's greater than or equal to 1 billion ('1e9'), it formats and returns the value in billions with two decimal places and 'B' (e.g., '€10.50 B').
- If it's greater than or equal to 1 million ('1e6') but less than 1 billion, it formats and returns the value in millions with two decimal places and 'M' (e.g., '€5.75 M').
- If it's greater than or equal to 1 thousand ('1e3') but less than 1 million, it formats and returns the value in thousands with two decimal places and 'K' (e.g., '€3.25 K').
- If it's less than 1 thousand, it formats and returns the value in euros with two decimal places and the euro symbol (e.g., '€500.75').
Applying the Function to DataFrame Columns:¶
- The code applies the
convert_to_eur
function to the 'Market Cap' column of the DataFrame 'df' and stores the result in a new column 'Market Cap EUR'. - Similarly, it applies the
convert_to_eur
function to the 'Price' column of the DataFrame 'df' and stores the result in a new column 'Price EUR'.
This code is useful for converting monetary values from US dollars to euros, formatting them appropriately based on their magnitude, and storing the converted values in new columns for further analysis.
def convert_to_eur(usd_value):
if pd.isna(usd_value) or not isinstance(usd_value, (int, float)):
return None # Handle NaN and non-numeric values
value = usd_value * exchange_rate
if value >= 1e9: # Billions
return f'€{value / 1e9:.2f} B'
elif value >= 1e6: # Millions
return f'{value / 1e6:.2f} M'
elif value >= 1e3: # Thousands
return f'{value / 1e3:.2f} K'
else: # Euros
return f'{value:.2f} €'
df['Market Cap EUR'] = market_cap.apply(convert_to_eur)
df['Price EUR'] = price.apply(convert_to_eur)
df = df[['Rank', 'Name', 'Market Cap EUR', 'Market Cap USD', 'Price EUR', 'Today']]
df
Rank | Name | Market Cap EUR | Market Cap USD | Price EUR | Today | |
---|---|---|---|---|---|---|
0 | 1 | SAP | €164.38 B | $181.43 B | 140.06 € | 0.17% |
1 | 2 | Siemens | €134.25 B | $148.18 B | 169.96 € | 0.85% |
2 | 3 | Deutsche Telekom | €108.18 B | $119.40 B | 21.75 € | 0.62% |
3 | 4 | Allianz | €94.79 B | $104.63 B | 242.00 € | 0.54% |
4 | 5 | Porsche | €72.80 B | $80.35 B | 79.92 € | 0.00% |
... | ... | ... | ... | ... | ... | ... |
95 | 96 | Encavis | €2.51 B | $2.77 B | 15.39 € | 0.49% |
96 | 97 | HelloFresh | €2.47 B | $2.73 B | 14.31 € | 0.28% |
97 | 98 | Nordex | €2.46 B | $2.72 B | 10.43 € | 2.61% |
98 | 99 | Lanxess | €2.45 B | $2.70 B | 28.25 € | 1.53% |
99 | 100 | IONOS Group | €2.44 B | $2.69 B | 17.47 € | 2.22% |
100 rows × 6 columns
df['Market Cap EUR'] = df['Market Cap EUR'].apply(convert_market_cap)
df['Market Cap USD'] = df['Market Cap USD'].apply(convert_market_cap)
df['Price EUR'] = df['Price EUR'].apply(convert_market_cap)
df
Rank | Name | Market Cap EUR | Market Cap USD | Price EUR | Today | |
---|---|---|---|---|---|---|
0 | 1 | SAP | 1.643800e+11 | 1.814300e+11 | 140.06 | 0.17% |
1 | 2 | Siemens | 1.342500e+11 | 1.481800e+11 | 169.96 | 0.85% |
2 | 3 | Deutsche Telekom | 1.081800e+11 | 1.194000e+11 | 21.75 | 0.62% |
3 | 4 | Allianz | 9.479000e+10 | 1.046300e+11 | 242.00 | 0.54% |
4 | 5 | Porsche | 7.280000e+10 | 8.035000e+10 | 79.92 | 0.00% |
... | ... | ... | ... | ... | ... | ... |
95 | 96 | Encavis | 2.510000e+09 | 2.770000e+09 | 15.39 | 0.49% |
96 | 97 | HelloFresh | 2.470000e+09 | 2.730000e+09 | 14.31 | 0.28% |
97 | 98 | Nordex | 2.460000e+09 | 2.720000e+09 | 10.43 | 2.61% |
98 | 99 | Lanxess | 2.450000e+09 | 2.700000e+09 | 28.25 | 1.53% |
99 | 100 | IONOS Group | 2.440000e+09 | 2.690000e+09 | 17.47 | 2.22% |
100 rows × 6 columns
plt.figure(figsize=(10,6))
# Sort the DataFrame based on 'Revenue' and select the top 5
top5_df = df.sort_values(by='Market Cap USD', ascending=False).head(5)
plt.bar(top5_df['Name'], top5_df['Market Cap USD'])
plt.xlabel('Name')
plt.ylabel('Market Cap USD')
plt.title('Top 5 Company')
Text(0.5, 1.0, 'Top 5 Company')
df.to_csv(r'C:\Users\filip\Downloads\data-exercise\WebScraping\CSV\Companies_MarketCap.csv', sep = ';', index = False, encoding = 'utf-8-sig')