Extract Transform Load (ETL) for Apple Products Specifications
This article is based on a project that we took while having an introduction to data engineering. We used Python as our programming language and a Malaysian website (https://phone.mesramobile.com/category/apple/) for data scraping. Hope this article helps you learn a bit more about ETL and Python.
According to Wikipedia , extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s). In this article, we will explain step by step of ETL pipeline using Python code. Let’s get started!
First, we need to import the necessary libraries. We are using BeautifulSoup the module requests for our scrapper. This allows us to send HTTP requests, which returns a Response Object with the response data.
!pip install beautifulsoup4
!pip install requestsfrom bs4 import BeautifulSoup
import requests
url = "https://phone.mesramobile.com/category/apple/"
response = requests.get(url)
response
data = response.text
data
soup = BeautifulSoup(data, 'html.parser')
STEP 1: EXTRACT
Before we start doing the process of scraping data, we need to identify what are data that need to scrape. There are the following information that available in the website :
- phone model
- price
- display
- camera
- operating system
- system
- memory
- battery
- wifi
- charging
- network
#name
model_tag = []for div in soup.findAll("div", {"class":"post-title"}):
model = div.findAll("a")
for mod in model:
model_tag.append(mod.text)a = model_tag[0::]
for q in a:
print(q)#price
price_tag = []
price_size_tag = []
price_sep = []for div in soup.findAll("div", {"style":"background-position: 0px -91px;"}):
price = div.find_previous("div")
n=0
for pri in price:
if n%2 == 0:
price_size_tag.append(pri)
n=n+1
else:
price_tag.append(pri)
n=n+1b = price_tag[0::]
for r in b:
print(r)#display
display_tag = []
display_size_tag = []for div in soup.findAll("div", {"style":"background-position: 0px 0px;"}):
display = div.find_previous("div")
n=0
for disp in display:
if n%2 == 0:
display_size_tag.append(disp)
n=n=1
else:
display_tag.append(disp)
n=n=1c = display_tag[0::]
for s in c:
print(s)#camera
cam_tag = []
cam_size_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -13px;"}):
camera = div.find_previous("div")
n=0
for cam in camera:
if n%2 == 0:
cam_size_tag.append(cam)
n=n=1
else:
cam_tag.append(cam)
n=n=1d = cam_tag[0::]
for t in d:
print(t)#os
os_tag = []
os_type_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -26px;"}):
oss = div.find_previous("div")
n=0
for os in oss:
if n%2 == 0:
os_type_tag.append(os)
n=n=1
else:
os_tag.append(os)
n=n=1e = os_tag[0::]
for u in e:
print(u)#system
system_tag = []
system_waste_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -39px;"}):
system = div.find_previous("div")
n=0
for sys in system:
if n%2 == 0:
system_waste_tag.append(sys)
n=n=1
else:
system_tag.append(sys)
n=n=1f = system_tag[0::]
for v in f:
print(v)#memory
memory_tag = []
memory_waste_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -52px;"}):
memory = div.find_previous("div")
n=0
for mm in memory:
if n%2 == 0:
memory_waste_tag.append(mm)
n=n=1
else:
memory_tag.append(mm)
n=n=1g = memory_tag[0::]
for w in g:
print(w)#battery
battery_tag = []
battery_size_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -78px;"}):
battery = div.find_previous("div")
n=0
for bat in battery:
if n%2 == 0:
battery_size_tag.append(bat)
n=n=1
else:
battery_tag.append(bat)
n=n=1h = battery_tag[0::]
for x in h:
print(x)#wifi
wifi_tag = []
wifi_speed_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -104px;"}):
connection = div.find_previous("div")
n=0
for wifi in connection:
if n%2 == 0:
wifi_speed_tag.append(wifi)
n=n=1
else:
wifi_tag.append(wifi)
n=n=1q = wifi_tag[0::]
for l in q:
print(l)#charging
charging_tag = []
charging_type_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -117px;"}):
charging = div.find_previous("div")
n=0
for charge in charging:
if n%2 == 0:
charging_type_tag.append(charge)
n=n=1
else:
charging_tag.append(charge)
n=n=1i = charging_tag[0::]
for y in i:
print(y)#network
network_tag = []
network_type_tag = []for div in soup.findAll("div", {"style":"background-position: 0px -130px;"}):
network = div.find_previous("div")
n=0
for net in network:
if n%2 == 0:
network_type_tag.append(net)
n=n=1
else:
network_tag.append(net)
n=n=1j = network_tag[0::]
for z in j:
print(z)
The scrapped attributes are then displayed as dataframes.
#display all attributes
import pandas as pd
df = pd.DataFrame({"Phone Model":a, "Price":b, "Display":c, "Camera":d, "Operating System":e, "System":f, "Memory":g, "Battery":h, "WiFi":l, "Charging":i, "Network":j})
df
This is the example of the output.
STEP 2: TRANSFORM
DATA PROCESSING
The attributes scraped from the website are already cleaned so not much data cleaning can be done. We checked for null/missing data, tried to remove the remaining whitespaces, dropped (RM) in the price section and (MP) in the camera section, renamed Price to Price(RM), renamed Camera to Camera(MP), and dropped the WiFi attribute as it is not meaningful.
df=df.replace({'':None}) #replace empty string with NONE
df.isnull() #checking the column is null or not#remove whitespace in all columns
df.columns = df.columns.str.strip()
df#removing redundancy from columns
df["Price"] = df["Price"].replace({'RM':''}, regex=True)
df["Camera"] = df["Camera"].replace({'MP':''}, regex=True)
df["Battery"] = df["Battery"].replace({'Battery':''}, regex=True)
df#renaming column
df = df.rename(columns = {"Price": "Price(RM)"})
df = df.rename(columns = {"Camera": "Camera(MP)"})
df#dropping unrelated column
df = df.drop("WiFi", axis=1)
df
DATA VISUALIZATION
The data from the website is already shown nicely but we can further visualize some parts on its own. Such as below:
import numpy as np
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import math# example of data visualization for attribute without any transformation/cleaning
df.dtypes
count_each_tag = (df['Charging']).value_counts()
plt.figure()
plt.bar(count_each_tag.index,count_each_tag.values)
plt.xlabel('gadgets')
plt.ylabel('type of charging')
plt.show()# example of data visualization for price before cleaning
df.dtypes
count_each_tag = (df['Price']).value_counts()
plt.figure(figsize=(15,15))
plt.bar(count_each_tag.index,count_each_tag.values)
plt.xlabel('Price')
plt.ylabel('Number of Apple Product')
plt.show()# example of data visualization for price after cleaning
df.dtypes
count_each_tag = (df['Price(RM)']).value_counts()
plt.figure(figsize=(8,8))
plt.bar(count_each_tag.index,count_each_tag.values)
plt.xlabel('Price(RM)')
plt.ylabel('Number of Apple Product')
plt.show()
STEP 3: LOAD
Since the evaluation for the missing data shows that every row and column have been filled with data and there are not any missing data, we can proceed to save it into the CSV file.
df.to_csv("AppleMalaysia_data.csv", index=False, encoding="utf-8")
I hope you enjoyed reading the article!
You can find the full codes here :