Mastering Data Cleaning in Python: A Practical Guide for Growth Marketers

Nicola Rubino
July 2, 2023
No Comments
in Data Science

What is Data Cleaning?

Table of Contents

What is Data Cleaning?

Are you a Growth Marketer seeking a practical guide on data manipulation using pandas? This blog post will walk you through the process of data preparation and cleaning a sales marketing CSV file in pandas.

As a marketer, it is crucial to perform data transformations before conducting analysis, such as adding, removing, or reordering columns, renaming columns, and handling missing values. When done correctly, data manipulation can help marketing teams improve data quality, gain valuable insights from their campaigns, and enhance their ability to predict which campaigns will yield a higher return on investment (ROI).

In the next section, I will delve into the details of marketing data cleaning and provide a comprehensive guide on how to perform it using pandas.

Let’s get started!

What is Data Cleaning?

First and foremost, it is crucial to gain a deeper understanding of the concept of data cleaning and preparation in pandas.

Data cleaning refers to the process of identifying and rectifying any inaccuracies or inconsistencies within datasets. When combining data from various sources, there is a potential risk of encountering duplicated or mislabelled data, as well as data that may end up in the wrong dataset altogether.

Incomplete or incorrect data often arises due to variations in data collection methods across different tools or systems. Data cleansing plays a pivotal role in marketing analytics, as it consumes a significant portion of project efforts—approximately 60%—devoted to data processing and exploration (Baig, 2021). However, when executed correctly, data processing can unveil substantial value and insights.

It is important to highlight that utilizing a dataset containing erroneous information can lead to misleading outcomes in subsequent analyses, distorting the true reality. Marketers may unknowingly make adjustments to their strategies based on inaccurate aggregated data and metrics.

Data Cleaning Challenges for Startups

Data cleaning challenges can be divided into two groups: single sources and multi-source. When loading data from sources such as files, we may encounter issues because they typically lack a predefined data schema that would regulate the quality of the data being inputted and stored. However, these constraints are usually not sufficient to prevent errors and inconsistencies.

Below, you will find potential data issues you might encounter when working with DataFrames

Missing values
Misspellings
Wrongly split columns
Multiple values entered in a single data row
Duplicates: A single record was entered more than once.
Incorrect dependencies (ie. Email + User)
The same name is used for different objects in more DataFrames

At this point, you may be wondering, how to resolve these errors? Let’s see it right away in the following steps.

1) Loading the CSV File

When using pandas, you should use the following command to import pandas and load a new CSV file. Let’s imagine you work as a Growth Marketing Manager or Analyst for a Startup. The CSV file may contain revenue records of customers, their acquisition costs, product types sold, and information about the advertising campaigns and channels that brought in new customers.

Open Jupyter and insert the following lines of code:

import pandas as pd
marketing = pd.read_csv ('marketing.csv')
marketing.head()

	Year	Campaign	name.v2	Revenue	Cost	Leads
₀	2022	Lead Gen	Mar	€956	€956	579
1	2022	Lead Gen	May	€456	€556	262
2	2022	Lead Gen	Jul	€856	€756	356
3	2023	Lead Gen	NaN	NaN	€5	2

Take a look at the table and you will notice a few issues with this dataset. The column named ‘campaign name’ has been split into two different columns. Additionally, the value at index 3 of the ‘revenue’ column is recorded as ‘Not a Number’.

If you are a beginner and wish to learn more about creating your first DataFrames in pandas, I recommend reading my previous blog post. It is a practical blog that provides step-by-step examples, allowing you to acquire new skills in pandas for marketing.

2) Handling missing value

Once we have loaded the CSV file, we can address the first issue. We will now convert the NaN values to blank values so that we can combine the ‘Campaigns’ columns.

marketing.fillna(' ')

Here you can see the NaN errors resolved.

	Year	Campaign	Name	rev.v2	Cost	Leads
₀	2022	Lead Gen	Mar	€956	€656	579
1	2022	Lead Gen	May	€456	€556	262
2	2022	Lead Gen	Jun	€856	€756	356
3	2023	Lead Gen			€5	2

3) Rename & Merge Columns

From the table above, you can see that the column ‘campaign name’ has been incorrectly divided into two parts, and there is also a misspelling in the ‘Revenue’ column.

Let’s resolve these issues!

marketing = marketing.rename(columns = {'revenue.v2': 'Revenue'}, axis = 1)

marketing ['Campaign Name'] = marketing['Campaign]+ ' ' +marketing['Name] 
marketing = marketing.drop(['Campaign', 'Name'], axis=1) 
marketing.head()

marketing [‘Campaign Name’] -> ‘Campaign name’ is the name of the new column
marketing[‘Campaign]+ ‘ ‘ -> We’re concatenating 2 new columns by adding a ‘ ‘ blank space between the two columns

4) Delete Useless Columns

We have just created two new columns by combining two already existing columns. Now we need to delete the previous columns using the following command:

marketing = marketing.drop(['Campaign', 'Name'], axis=1)  
marketing.head()

‘Campaign’, ‘Name’ are the two columns you will want to delete
axis = 1: refers to horizontal axis or rows
axis = 0: refers to vertical axis or columns.

By calling the function marking.head() you should obtain a new DataFrame with the following outputs:

	Year	Campaign Name	Revenue	Cost	Leads
₀	2022	Lead Gen Mar	€956	€656	798
1	2022	Lead Gen May	€456	€556	262
2	2022	Lead Gen Jul	€856	€756	306
3	2023	Lead Gen		€5	2

5) Create A New KPI ‘Cost-per-Lead’

The formula for calculating the cost per lead is quite simple. All you need to do is take your total marketing cost and divide it by the total number of new leads generated. By performing this calculation, you will obtain the cost per lead (CPL), which is a valuable metric for evaluating the effectiveness of your marketing efforts.

marketing['CPL'] = marketing['Costs'] / marketing['Leads']
marketing.head()

	Year	Campaign Name	Revenue	Cost	Leads	CPL
₀	2022	Lead Gen March	€956	€656	798	€0.83
1	2022	Lead Gen May	€456	€556	262	€2.12
2	2022	Lead Gen July	€856	€756	306	€2.47
3	2023	Lead Gen		€5	2	€2.5

Conclusion

In conclusion, this blog post provided a practical guide on data manipulation using pandas for growth marketers. We covered essential concepts such as data manipulation, preparing and cleaning a sales marketing CSV file for python.

Data manipulation plays a crucial role in marketing analytics, as it allows marketers to improve data quality, gain valuable insights from campaigns, and make informed decisions. By understanding how to add, remove, rename columns, handle missing values, and perform calculations, marketers can create predictive campaigns evaluating and choosing the best segmentation approach.

Remember, data manipulation is an ongoing process, and continuous improvement and exploration of data are essential for staying ahead in the dynamic marketing landscape. With pandas as a powerful tool in your arsenal, you are well-equipped to tackle data challenges and drive growth in your marketing initiatives.

Happy data manipulation and marketing analysis! Reach out to me if you have any questions!

Bibliography

Govindan, G., Baig, M. R., & Shrimali, V. R. (2021). Data Science for Marketing Analytics: A Practical Guide to Forming a Killer Marketing Strategy Through Data Analysis with Python. Packt.

Hadley, W., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.

McKinney, W. (2018). Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. O’Reilly Media.

Paskhaver, B. (2021). Pandas in Action. Manning Publications.

VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. O’Reilly Media.

Wickham, H., & Grolemund, G. (2017). R packages: Organize, test, document, and share your code. O’Reilly Media.

Zaki, M. J., & Meira Jr, W. (2014). Data mining and analysis: Fundamental concepts and algorithms. Cambridge University Press.

https://nicorubino.com/wp-content/uploads/2023/07/clean-data.jpg 2048 768 Nicola Rubino Nicola Rubino https://secure.gravatar.com/avatar/67217918279b1a30fcad305e38dfdc51?s=96&d=mm&r=g July 2, 2023 February 25, 2024