Pandas from basic to advanced for Data Scientists

Pandas make life easier for any data scientist

sampath kumar gajawada
Towards Data Science

--

A data scientist is someone who can obtain, scrub, explore, model, and interpret data, blending hacking, statistics, and machine learning — Hilary Mason

Where there is smoke, there is a fire. In similar, if you are dealing with data pandas will never take a holiday. It helps you to explore and interpret the data as fast as it can.

Pandas is the most commonly used python library used for data manipulation and data analysis.

In this article, I will try to address the pandas concepts or tricks which make our life easier. Let us start from the basic to advanced levels. I will walk through the pandas with the help of weather data.

Import Pandas and create a data frame

There are many ways of creating a data frame using files, lists, dictionaries, etc. Here I have created a data frame by reading data from the CSV file.

import pandas as pddf = pd.read_csv("weather.csv")
df

Output

Weather data

Select the specific column(s)

Sometimes you need to operate or manipulate only specific columns. Let us assume you would like to analyze how temperature is changing daily. In this case, we will select the temperature and day.

df[[‘temperature’,’day’]]

Rename the column

Pandas provide you the simple function(rename) to change the name of a column or set of columns to make the job easy.

df.rename(columns = {‘temperature’: ‘temp’, ‘event’:’eventtype’})

Filtering a data frame

Suppose you would like to see the cities which are hotter along with dates.

df[[‘day’,’city’]][df.event==’Sunny’]

Output

Filtered data

So far we have seen some basics lets dive into deep and the real pandas start here. As I said pandas will never take a holiday even you would like to have complex queries.

Grouping

Suppose if you want to Manipulate on a particular group of data. In this case, let us get only the rows that belong to new york. With group object, you can get a summary of the sum, mean, median of all groups at a time.

Group by City

city_group = df.groupby(‘city’)

A group object was created and if you want to see specific group data, just need to get the group.

city_group.get_group(‘new york’)

Output

new york group

Aggregations

In the above section, we just grouped the data by the city but what if I would like to see the average temperature and average wind speed ???. We will use aggregations here.

Group by and aggregate

df.groupby(‘city’).agg({‘temperature’:’mean’, ‘windspeed’:’mean’})

Output

Mean temperature and wind speed

Merging

In the above sections, we dealt with having a single data frame. If there are two data frames and you would like to analyze them together !!!. In this scenario, the merge plays a key role and simplifies the join of two data frames.

create two data frames

df1 = pd.DataFrame({
“city”: [“new york”,”florida”,”mumbai”],
“temperature”: [22,37,35],
})
df2 = pd.DataFrame({
“city”: [“chicago”,”new york”,”florida”],
“humidity”: [65,68,75],
})

Simple Merge: This gives you the matching rows in both data frames

pd.merge(df1,df2,on=’city’)

Output

Matching rows of two data frames

Outer: Get all rows from both data frames. Add a new parameter (how).

pd.merge(df1,df2,on=”city”,how=”outer”)

Output

Outer join

In similar, we can get all the matching rows along with the left data frame (left join) and right data frame (right join). By specifying parameter how with values left/right.

Crosstab

Suppose if you want to see the frequency count of the event type ( rainy/sunny) in each city. Cross tab makes these things easier.

pd.crosstab(df.city,df.event)

Output

Frequency count of the event by city

Note: We can get any aggregation mean, median, etc. Just we need to pass an extra parameter to the function.

Reshape with melt

If you want to get the columns as rows along with values, suppose for each city I would like to have temperature and wind speed in a separate value column. In this case temperature, windspeed hold a single column and their values hold another column.

pd.melt(df,id_vars=[‘day’,’city’,’event’],var_name=’attribute’)

Output

Reshaped data

References

code basics, https://www.youtube.com/channel/UCh9nVJoWXmFb7sLApWGcLPQ

Hope you enjoyed it !!! You can also check the aritcle on pandas tricks and this will be more interesting Pandas tricks for Data Scientists !!!Stay tuned !!!! Please do comment on any queries or suggestions !!!!!

--

--