Exploring Data Analysis Tools: NumPy and pandas as Core Libraries
It’s been a while I last communicated on here, and I believe I owe you an explanation of what happened during the brief hiatus. Okay, so I was away from blogging for a bit because I had to focus on a project titled “Predicting electricity access in Sub-Saharan Africa using Machine learning models”. This Project stressed the hell out of me, especially because I am still new to this and I had to do tons of research, but I’m back and eager to write again.
From the topic above, I believe you already know what I want to write on. If you’re learning data science or machine learning, two libraries you’ll use constantly are NumPy and Pandas. These tools make it easier to handle large datasets, perform numerical operations, and explore your data efficiently, all within Python.
In this post, we’ll walk through practical exercises that cover essential NumPy and Pandas operations. Each example is short, hands-on, and beginner-friendly, perfect for reinforcing your core skills.
What is numpy
NumPy short for "Numerical Python," is a powerful, well-optimized, free open-source library for the Python programming language, adding support for large, multi-dimensional arrays (also called matrices or tensors) and used in scientific computing and data analysis.
Now let’s look at a few examples
Example 1
Write a Numpy program to test element-wise for NaN of a given array
import numpy as np
# Sample array
arr = np.array([1, np.nan, 3, np.nan, 5])
# Test element-wise for NaN
result = np.isnan(arr)
print("NaN check:\n", result)
Example 2
Write a NumPy program to test whether two arrays are element-wise equal within a tolerance.
# Sample arrays
a = np.array([1.0, 1.5, 3.0])
b = np.array([1.0, 1.51, 2.99])
# Element-wise equal within tolerance
equal_within_tol = np.allclose(a, b, atol=0.02)
print("Equal within tolerance:", equal_within_tol)
I also got to work on a dataset WHO POP TB all.csv where I made use of the numpy library for analysis, I will add an example below so you can get a better context of what I am saying
Example 3
In the code cell below, select and display the first eight rows from the <code>'Country'</code> and <code>'TB deaths'</code> columns.
df.loc[:7, ['Country', 'TB deaths']]
What is Pandas
Pandas is open-source Python library which is used for data manipulation and analysis. It consists of data structures and functions to perform efficient operations on data.
Here are a few examples
Example 4
Write a Python program to add, subtract, multiple and divide two Pandas Series. Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
import pandas as pd
a = pd.Series([2, 4, 6, 8, 10])
b = pd.Series([1, 3, 5, 7, 9])
print("Addition:\n", a + b)
print("Subtraction:\n", a - b)
print("Multiplication:\n", a * b)
print("Division:\n", round(a / b,2))
Example 5
Write a Python program to convert a NumPy array to a Pandas series. Sample Series: NumPy array:[10 20 30 40 50]Converted Pandas series:0 10 1 20 2 30 3 40 4 50dtype: int64
import numpy as np
import pandas as pd
arr = np.array([10, 20, 30, 40, 50])
s = pd.Series(arr)
print("Converted Pandas Series:")
print(s)
Example 6
Write a Pandas program to add some data to an existing Series.Sample Output:Original Data Series:0 100 1 200 2 python 3 300.12 4 400 dtype: objectData Series after adding some data:0 100 1 200 2 python 3 300.12 4 400 0 500 1 php dtype: object
import pandas as pd
s1 = pd.Series([100, 200, 'python', 300.12, 400])
print("Original Data Series:")
print(s1)
s2 = pd.Series([500, 'php'])
# concat
combined = pd.concat([s1, s2], ignore_index=True)
print("\nData Series after adding some data:")
print(combined)
The Project: Climate Exploratory data analysis across 6 Cities
Problem Statement
The goal of this project is to analyse climate data from 6 different cities(Beijing, Delhi, Moscow, London, Capetown and brasilia) to understand variations in weather patterns such as temperature, rainfall, and humidity
Data Methodology
Step 1: Load and inspect the CSV file (Comma-Separated Values file) from your computer into Python as a DataFrame. I attached all the links to the data frames, so you can follow along (Beijing_PEK_2014.csv, Brasilia_BSB_2014.csv, CapeTown_CPT_2014.csv, Delhi_DEL_2014.csv, London_2014.csv, Moscow_SVO_2014.csv)
df1 = pd.read_csv(r"c:\Users\PC\dataraflow-cohort-1\Week-6 (Numpy & Pandas-1)\3-Data Analysis-Pandas-1\Beijing_PEK_2014.csv")
df2 = pd.read_csv(r"c:\Users\PC\dataraflow-cohort-1\Week-6 (Numpy & Pandas-1)\3-Data Analysis-Pandas-1\Brasilia_BSB_2014.csv")
df3 = pd.read_csv(r"c:\Users\PC\dataraflow-cohort-1\Week-6 (Numpy & Pandas-1)\3-Data Analysis-Pandas-1\CapeTown_CPT_2014.csv")
df4 = pd.read_csv(r"c:\Users\PC\dataraflow-cohort-1\Week-6 (Numpy & Pandas-1)\3-Data Analysis-Pandas-1\Delhi_DEL_2014.csv")
df5 = pd.read_csv(r"c:\Users\PC\dataraflow-cohort-1\Week-6 (Numpy & Pandas-1)\3-Data Analysis-Pandas-1\London_2014.csv")
df6 = pd.read_csv(r"c:\Users\PC\dataraflow-cohort-1\Week-6 (Numpy & Pandas-1)\3-Data Analysis-Pandas-1\Moscow_SVO_2014.csv")
df1.head() was used to inspect the df1 dataframe
Step 2: Data Cleaning and Exploration to count the number of null values and duplicates
df1.isnull().sum() and df1.duplicated().sum()
Step 3: Generate Summary Statistics of the Data frame
df1.describe()
Step 4 : Analysis
# Convert 'Date' to datetime (force conversion, skip bad rows)
df1['Date'] = pd.to_datetime(df1['Date'], errors='coerce')
# Now extract the month name
df1['Month'] = df1['Date'].dt.month_name()
# Group by month and compute average temperature
monthly_avg_temp = df1.groupby('Month')['Mean TemperatureC'].mean().sort_values(ascending=False)
print("Average Temperature per Month:")
print(monthly_avg_temp)
Here are my Key Findings
Beijing has the highest maximum temperature of 42 degrees, which occurred in May and also the highest mean temperature of 31 degrees
Moscow has the lowest minimum temperature of -26 degrees and also the lowest mean temperature of -21 degrees
From my analysis, we can see the month of July seemed to top in the mean temperature across all countries except in Brasilia and Cape Town
Beijing also recorded the highest Precipitation compared to other Countries of about 75.95mmHg
There was no record of rainfall in Moscow
London has the highest mean humidity of 96%, followed by Delhi, which is 94% and Beijing has the least humidity of 8%
Challenges
One of the biggest hurdles I faced was right at the start: loading and combining all the data files into my notebook, just so I don’t have an individual data frame for each. I tried all I could but ended up having individual data frames, and the solution I later got would have me move it all to Google Colab and tbh, time wasn’t on my side.
The next challenge I faced was getting myself accustomed to each term in the dataset(I mean the column headers) to better understand them, but this wasn’t much of a challenge, though.
Another one is missing values, especially in the Max Gust, Cloud cover and Events columns because if not properly taken care of could distort the analysis and minimize data quality.
Conclusion
This analysis not only sharpened my technical skills but also deepened my appreciation of how data can tell powerful environmental stories, and to be honest, there was still so much more to uncover, but having a clear scope helped narrow down my analysis to what I should focus on
I’m looking forward to discovering what the next set of tasks will bring and how they’ll help me grow further.
Connect with me on :
