How to Clean and Filter Data efficiently in Python Like a Pro

Introduction

Data manipulation is the cornerstone of effective data analysis. Whether you're a data scientist, analyst, or enthusiast, mastering these skills transforms raw, messy data into actionable insights. In this comprehensive article, we'll explore key data manipulation concepts through a real-world weather analysis project.

Prerequisities

To effectively follow along with this article and get the most out of it, it is essential that you:

Have a Python 3.7+ installed on your computer
A Jupyter environment (Notebook or Lab) for running the code
Basic knowledge of Python programming — variables, data types, operations, control structures, and functions
Some basic familiarity with pandas and matplotlib for data analysis and visualization

Getting the data

We're continuing our analysis with the London 2014 dataset from last week, and have supplemented it with new data from Weather Underground which provides historical weather data from major airports worldwide. This second dataset contains daily weather records for London Heathrow (LHR) throughout 2023, including comprehensive metrics such as temperature, humidity, dew point, precipitation, and wind speed. If you’d like to explore the data yourself, I’ve shared the links below:

Understanding Data Manipulation

Data manipulation refers to the process of modifying, organizing, and transforming data to make it more suitable for analysis. It's the crucial bridge between raw data collection and meaningful insights. Think of it as preparing ingredients before cooking a complex dish - without proper preparation, your final results will be disappointing.

Let's explore the essential data manipulation techniques through our weather analysis case study.

Part 1: Laying the Groundwork with Basic Data Cleaning

Before any analysis can begin, we need to ensure our data is clean and structured properly. Let’s take a look at the steps taken to achieve this.

Taming Messy Data

The raw data often comes with inconsistencies. In this case, the wind direction column had an HTML tag (<br />) stuck to its values. To clean this we use the code below

# Correcting column names and cleaning values
london = london.rename(columns={'WindDirDegrees<br />' : 'WindDirDegrees'})
london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br />')
london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')

This code renames the column, strips the unwanted characters from each value, and converts the data type to a number, making it ready for analysis.

Filtering and Querying Data

With clean data, we can start asking questions. For example, we can filter the dataset to find all days with northerly winds (greater than or equal to 350 degrees and less than or equal to 10 degrees ).

northerly_wind = london[(london['WindDirDegrees'] >= 350) | (london['WindDirDegrees'] <= 10)]
print(northerly_wind[['GMT', 'WindDirDegrees']])

We also convert the date column (GMT) to a proper datetime format, which unlocks powerful time-based filtering.

london['GMT'] = pd.to_datetime(london['GMT'])
dates = london['GMT']
start = datetime(2014, 4, 1)
end = datetime(2014, 4, 11)
london[(dates >= start) & (dates <= end)]

Handling Missing Data

Real-world data is rarely perfect. It's common to have missing values. We can easily identify these gaps to understand the limitations of our dataset.

missing_gustdata = london[london['Max Gust SpeedKm/h'].isnull()]
missing_gustdata.head()

Part 2: Introduction to Data Visualization

A key part of data analysis is visualization. It helps us see trends and patterns that are difficult to spot in a table of numbers.

Visualizing Temperature Trends

A simple line plot of the daily temperatures immediately shows us the seasonal flow of the year.

london[['Mean TemperatureC', 'Max TemperatureC', 'Min TemperatureC']].plot(grid=True)

This would generate a chart showing the peaks of summer and the troughs of winter, with the max, min, and mean temperatures clearly delineated.

Working with Time Series Indices

To make time-based analysis smoother, we set the date column as the index of our DataFrame.

london['GMT'] = pd.to_datetime(london['GMT'], errors='coerce')
london = london.set_index('GMT')
print(london.index) # Confirm it's a DatetimeIndex

This allows us to filter data by time attributes effortlessly. For instance, we can isolate the spring months (March, April, May) and plot the humidity.

spring = london[london.index.month.isin([3,4,5])]
spring['Mean Humidity'].plot(grid=True, figsize=(10,5))

Part 3: The Main Project - Finding the Ideal Holiday

Now, we apply all we've learned to the main task: finding the best summer break in the 2023 Weather Underground data. It is worth noting that precipitation levels remained at 0 inches throughout the year, indicating an absence of recorded rainfall during the period under review.

Step 1: Acquiring and Preprocessing the Data

We load the Weather Underground dataset, inspect the dataset and clean the column names by stripping spaces and replacing them with underscores, which is a best practice for coding.

df = pd.read_csv("LHR_Dataset.csv")
df.head()
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(' ', '_')

Step 2: Format and Isolate the Summer Months

Next, we format the date column to a datetime64 format, set the date as index and since our goal is a summer holiday, we filter the data. For the Northern Hemisphere, summer is June, July, and August.

df['Date'] = pd.to_datetime(df['Date'])
df.index = df['Date']
summer_north = df[df.index.month.isin([6, 7, 8])]

We can then plot the average temperature throughout the summer to get a preliminary feel for the data.

ax = summer_north['Avg_Temperature_(°F)'].plot(grid=True, figsize=(10,5))
ax.set_xlabel('Months')
ax.set_ylabel('Avg Temperature (°F)')

After visualizing the average temperature trend for the Northern Hemisphere, we can repeat the same process for the Southern Hemisphere to observe its summer temperature pattern (December–February).

Step 3: Engineering a "Comfort Score"

This is the core of our analysis. Instead of relying on just temperature, we create a holistic score that considers multiple weather factors important for a pleasant holiday.

The comfort_score function assigns points based on:

Temperature: Ideal range (70-85°F) gets the most points.
Dew Point: A lower dew point (<65°F) indicates less muggy, more comfortable air.
Humidity: The ideal range for comfort is 40-60%.
Wind Speed: A gentle breeze (5-15 mph) is pleasant.
Pressure: Stable pressure (29.9-30.2 inHg) suggests settled weather.
Precipitation: Dry days (Precipitation == 0) are preferred.

# --- Define comfort scoring function ---
def comfort_score(row):
    score = 0

    # Temperature: ideal 70–85°F
    if 70 <= row['Avg_Temperature_(°F)'] <= 85:
        score += 2
    elif 60 <= row['Avg_Temperature_(°F)'] < 70 or 85 < row['Avg_Temperature_(°F)'] <= 90:
        score += 1

    # Dew Point: ideal ≤ 65°F
    if row['Avg_Dew_Point_(°F)'] <= 65:
        score += 2
    elif 65 < row['Avg_Dew_Point_(°F)'] <= 70:
        score += 1

    # Humidity: ideal 40–60%
    if 40 <= row['Avg_Humidity_(%)'] <= 60:
        score += 2
    elif 30 <= row['Avg_Humidity_(%)'] < 40 or 60 < row['Avg_Humidity_(%)'] <= 70:
        score += 1

    # Wind Speed: ideal 5–15 mph
    if 5 <= row['Avg_Wind_Speed_(mph)'] <= 15:
        score += 1

   # Pressure: stable weather 29.9–30.2 inHg
    if 29.9 <= row['Avg_Pressure_(in)'] <= 30.2:
        score += 2

    # Precipitation: dry days are better
    if row['Precipitation_(in)'] == 0:
        score += 1

    return score

df['Score'] = df.apply(comfort_score, axis=1)

Step 4: Finding the Best Two-Week Window

A single perfect day isn't enough for a holiday; we need a sustained period of good weather. We use a rolling average to find the best 14-day stretch.

# Compute 14-day rolling average of the comfort score
df['RollingScore'] = df['Score'].rolling(window=14).mean()

# Find the single highest 14-day average
best_idx = df['RollingScore'].idxmax()
best_start = best_idx - pd.Timedelta(days=13)
best_end = best_idx

print(f"Best 2-week vacation window: {best_start.date()} to {best_end.date()}")
print(f"Average comfort score: {df.loc[best_idx, 'RollingScore']:.2f}")

The Result: Based on the 2023 data, the algorithm determined that June 13th to June 26th was the ideal time for a holiday in London, with an average comfort score of 8.21.

Step 5: Visualizing the Result

Finally, we create a plot to visualize our finding, comparing the daily comfort scores to the 14-day rolling average and highlighting the best period.

import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot(df.index, df['Score'], color='skyblue', alpha=0.6, label='Daily Comfort Score')
# ... (code to plot the rolling average and highlight the best period)
plt.legend()
plt.show()

Challenges Encountered

During this analysis, I faced two main challenges:

1. Data Accessibility: The Weather Underground website repeatedly blocked web scraping attempts, making it difficult to automate data extraction. To overcome this, I manually copied the data from the website into Microsoft Excel, then converted it into a CSV file for analysis. I’ve already completed this step and included the link to the prepared dataset , so there’s no need for you to repeat it when following along.

2. Incomplete Precipitation Data: The precipitation values were recorded as zero throughout the entire year, which is unrealistic and likely indicates missing or unrecorded data rather than actual dry conditions. This was taken into account during the analysis and interpretation of the results.

Conclusion

This project demonstrates the full lifecycle of a data analysis problem:

Cleaning messy real-world data.
Exploring it through filtering and visualization.
Engineering new features (like the comfort score) to answer a specific question.
Applying advanced techniques (rolling averages) to find a optimal solution.

While past weather doesn't guarantee future conditions, this data-driven approach gives you a much stronger foundation for planning than a simple guess. So the next time you plan a trip, why not let a little Python code point you in the right direction? Happy holidays😂

Connect with me on :

Linkedin

Github

Mastering Data Manipulation with Pandas: How to Clean, Filter, and Analyze Data Efficiently

Introduction

Prerequisities

Getting the data

Understanding Data Manipulation

Part 1: Laying the Groundwork with Basic Data Cleaning

Taming Messy Data

Filtering and Querying Data

Handling Missing Data

Part 2: Introduction to Data Visualization

Visualizing Temperature Trends

Working with Time Series Indices

Part 3: The Main Project - Finding the Ideal Holiday

Step 1: Acquiring and Preprocessing the Data

Step 2: Format and Isolate the Summer Months

Step 3: Engineering a "Comfort Score"

Step 4: Finding the Best Two-Week Window

Step 5: Visualizing the Result

Challenges Encountered

Conclusion

Comments

More from this blog

From Regression Basics to Churn Prediction: A Data Storytelling Journey

🚗 From Guesswork to Goldmine: How Machine Learning Predicts Car Prices With 93.5% Accuracy

Data Science in Real Estate: Building an Interpretable Machine Learning Pipeline for Housing Price Prediction

From Global Energy to Lagos Real Estate: A Data Scraping & Analysis Journey.

From Concrete to Code: A Civil Engineer's Data-Driven Look at Nigeria's Construction Trade

Command Palette

Introduction

Prerequisities

Getting the data

Understanding Data Manipulation

Part 1: Laying the Groundwork with Basic Data Cleaning

Taming Messy Data

Filtering and Querying Data

Handling Missing Data

Part 2: Introduction to Data Visualization

Visualizing Temperature Trends

Working with Time Series Indices

Part 3: The Main Project - Finding the Ideal Holiday

Step 1: Acquiring and Preprocessing the Data

Step 2: Format and Isolate the Summer Months

Step 3: Engineering a "Comfort Score"

Step 4: Finding the Best Two-Week Window

Step 5: Visualizing the Result

Challenges Encountered

Conclusion

Comments

More from this blog