Have you ever been given a few different spreadsheets and had to combine them into one?
- Use clunky VLOOKUPs in Excel
- Copy and paste (hoping you don’t break something)
- Give up and ask a more technical coworker for help
It’s a waste of time.
And it doesn’t take you any closer to becoming a six-figure data analyst. Never mind what your boss thinks of you not being able to work with data. Embarrassing.
Blending data is a breeze in Python.
People see Python and run for the hills.
I get it. And I used to be the same way. But this is expertise you can build up over time.
Start small with this tutorial. Master the fundamentals. Go from being clueless to solving real problems and turning messy data into something useful.
Introduction to Data Blending in Python
This step-by-step tutorial covers the basics of blending different data sets in Python.
Here’s why:
- simplify your data management
- become the go-to Python data person
- level up your Python data analytics skills
- hands-on experience with a real-world example
- increased confidence in your data career journey
Prerequisites
- A Google account to access Google Colab.
- Basic understanding of Python and pandas library.
- Here’s a copy of the notebook for you to follow along
Alright, let’s create this thing.
Step 1: Setup and Data Loading
Start by importing necessary libraries and loading our data.
Paste this code into a new Google Collab notebook cell and run it.
# Import necessary libraries import pandas as pd import matplotlib.pyplot as plt # Load data from CSV files sales_data = pd.read_csv('sales_data.csv') customer_details = pd.read_csv('customer_details.csv') # Quick view of the datasets print(sales_data.head()) print(customer_details.head())
Step 2: Data Blending with Different Joins
Next, let’s explore various join operations: left, right, and inner joins.
Left join keeps all rows from the left table, right join from the right table, and inner join only the rows that match in both tables.
# Left join left_joined_data = pd.merge(sales_data, customer_details, on='customer_id', how='left') print(left_joined_data.head()) # Right join right_joined_data = pd.merge(sales_data, customer_details, on='customer_id', how='right') print(right_joined_data.head()) # Inner join inner_joined_data = pd.merge(sales_data, customer_details, on='customer_id', how='inner') print(inner_joined_data.head())
Step 3: Data Analysis and Visualization
Now that our joins are created, let’s create a visualization to understand the impact of different join types.
- Left Join: Includes all records from the left table (sales data) and matches with the right table (customer details) wherever possible.
- Right Join: Includes all records from the right table (customer details) and matches with the left table (sales data) wherever possible.
- Inner Join: Includes only the records that have matching entries in both tables.
# Counting records for each join type left_count = left_joined_data['customer_id'].count() right_count = right_joined_data['customer_id'].count() inner_count = inner_joined_data['customer_id'].count() # Creating a bar plot with descriptive labels join_descriptions = ['Left Join \n(All Sales, Matched Customers)', 'Right Join \n(All Customers, Matched Sales)', 'Inner Join \n(Only Matched Sales and Customers)'] counts = [left_count, right_count, inner_count] colors = ['#FF9999', '#99CC99', '#9999FF'] # Soft red, green, and blue plt.figure(figsize=(10, 6)) bars = plt.bar(join_descriptions, counts, color=colors) # Adding data labels for bar in bars: yval = bar.get_height() plt.text(bar.get_x() + bar.get_width()/2, yval, int(yval), verticalalignment='bottom', horizontalalignment='center') plt.xlabel('Type of Join Operation') plt.ylabel('Record Count') plt.title('Comparative Analysis of Data Blending Techniques') plt.show()
Looks great!