Data Engineering Project — Online Retail Store

<p>In this project, I will get into the shoes of a data engineer/BI developer working for an online retail service. The service allows users to browse their website and order items online.</p> <p>The managers at the company want us to provide a simple analysis of the data. To complete the assignment, we gathered the transactions data from the data warehouse.</p> <h1>Article Steps</h1> <ol> <li>Data Quality Check and Cleaning</li> <li>Creating a Tableau Dashboard</li> </ol> <h1>Step #1 &mdash; Data Quality Check and Cleaning</h1> <p>Before jumping to the analysis step, we need to check the quality of the data and clean the issues we can identify. Let&rsquo;s print some general information about the data:</p> <pre> import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv(&quot;Online Retail Data Set.csv&quot;, encoding = &quot;ISO-8859-1&quot;) df.head()</pre> <p><iframe frameborder="0" height="234" scrolling="no" src="https://towardsdev.com/media/6ddc87b486398154247d89adf1c27a92" title="a" width="680"></iframe></p> <p>Sample data</p> <pre> df.info()</pre> <pre> Index: 536641 entries, 0 to 541908 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 536641 non-null object 1 StockCode 536641 non-null object 2 Description 535187 non-null object 3 Quantity 536641 non-null int64 4 InvoiceDate 536641 non-null datetime64[ns] 5 UnitPrice 536641 non-null float64 6 CustomerID 401604 non-null float64 7 Country 536641 non-null object dtypes: datetime64[ns](1), float64(2), int64(1), object(4) memory usage: 36.8+ MB</pre> <pre> df[[&quot;Quantity&quot;, &quot;UnitPrice&quot;, &quot;CustomerID&quot;]].describe()</pre> <pre> Quantity UnitPrice CustomerID count 531285.000000 531285.000000 397924.000000 mean 10.655262 3.857296 15294.315171 std 156.830323 41.810047 1713.169877 min 1.000000 -11062.060000 12346.000000 25% 1.000000 1.250000 13969.000000 50% 3.000000 2.080000 15159.000000 75% 10.000000 4.130000 16795.000000 max 80995.000000 13541.330000 18287.000000</pre> <h2>Removing Transactions with Negative Quantity or UnitPrice</h2> <p>Looks like the&nbsp;<code>quantity</code>&nbsp;column has some issues. Transactions with a negative quantity are illogical and should be dropped. Same issue with the column&nbsp;<code>Unit Price</code>.</p> <p><a href="https://towardsdev.com/data-engineering-project-online-retail-store-c6bcce764861">Website</a></p>
Tags: Data Tableau