Data Engineering Project — Online Retail Store
<p>In this project, I will get into the shoes of a data engineer/BI developer working for an online retail service. The service allows users to browse their website and order items online.</p>
<p>The managers at the company want us to provide a simple analysis of the data. To complete the assignment, we gathered the transactions data from the data warehouse.</p>
<h1>Article Steps</h1>
<ol>
<li>Data Quality Check and Cleaning</li>
<li>Creating a Tableau Dashboard</li>
</ol>
<h1>Step #1 — Data Quality Check and Cleaning</h1>
<p>Before jumping to the analysis step, we need to check the quality of the data and clean the issues we can identify. Let’s print some general information about the data:</p>
<pre>
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("Online Retail Data Set.csv", encoding = "ISO-8859-1")
df.head()</pre>
<p><iframe frameborder="0" height="234" scrolling="no" src="https://towardsdev.com/media/6ddc87b486398154247d89adf1c27a92" title="a" width="680"></iframe></p>
<p>Sample data</p>
<pre>
df.info()</pre>
<pre>
Index: 536641 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 536641 non-null object
1 StockCode 536641 non-null object
2 Description 535187 non-null object
3 Quantity 536641 non-null int64
4 InvoiceDate 536641 non-null datetime64[ns]
5 UnitPrice 536641 non-null float64
6 CustomerID 401604 non-null float64
7 Country 536641 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 36.8+ MB</pre>
<pre>
df[["Quantity", "UnitPrice", "CustomerID"]].describe()</pre>
<pre>
Quantity UnitPrice CustomerID
count 531285.000000 531285.000000 397924.000000
mean 10.655262 3.857296 15294.315171
std 156.830323 41.810047 1713.169877
min 1.000000 -11062.060000 12346.000000
25% 1.000000 1.250000 13969.000000
50% 3.000000 2.080000 15159.000000
75% 10.000000 4.130000 16795.000000
max 80995.000000 13541.330000 18287.000000</pre>
<h2>Removing Transactions with Negative Quantity or UnitPrice</h2>
<p>Looks like the <code>quantity</code> column has some issues. Transactions with a negative quantity are illogical and should be dropped. Same issue with the column <code>Unit Price</code>.</p>
<p><a href="https://towardsdev.com/data-engineering-project-online-retail-store-c6bcce764861">Website</a></p>