A Gentle Introduction to Dataframes — Part 2of 3

Becoming a Data Alchemist — Dataframes

…Learning My First Trick

Russell Pihlstrom

--

Transforming Data To Gold

Introduction

In my previous post I introduced you to some of the basics when viewing, cleaning and transforming your data using Dataframes (see post). In this post I go a step further by showing you how to summarize your data using .groupby(). Like my previous posts, this post is for the beginner, perhaps an old Excel pro looking to make the jump from Excel to Python and needing a gentle introduction to Dataframes. Making the transition from Excel to Python or incorporating Python into your analytic repertoire can be daunting. Mastering Python Dataframes is the right first step in this journey. In addition to showing you how to summarize your data using Dataframes, I will also show you a few ways to optimize the viewing of your Dataframes within your jupyter notebook. During my first few weeks of using jupyter notebooks, I became frustrated by the default display behavior of jupyter. Below are the specifics. Nevertheless, this frustration led me to researched how I could modify the default behavior to better suit my needs. I share my findings with you along with demonstrating how to summarize your data. Without further ado….Data Alchemists… “Start Your Engines!”

Step1. Loading and Displaying Your Data?

# Code snippets shown in grey for easy cutting and pasting
import pandas as pd
# Loading dataframe
df_MovieStarsSummary = pd.read_excel(“./df_MovieStarsSummary.xlsx”)

Default behavior/ output jupyter provides when displaying a larger sets of data

Step 2. Changing jupyter display behavior

I Need To See My Data!!!!

Before diving into summarizing our data, notice the highlights from the above example. As a new to jupyter user and a fan of seeing all the data before beginning analysis this view is frustrating. Normally before I start my analysis, I like to scroll through the data quickly to get an understanding of the details. With the given output, seeing the rows beyond those shown is not possible. Furthermore, if I had several columns, juypter would not allow me to see all the columns and instead would only display a subset of columns. How do I see the entire dataset? In Excel when you open your dataset, in most cases, you can see all the data. How can I see all my data?

Thankfully with the following 5 lines of code you can change the defula display behavior in your jupyter notebook

pd.set_option(‘display.max_rows’, None)
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.expand_frame_repr’, False)
pd.set_option(‘max_colwidth’, None)

Now that’s better! Notice the scroll bar (highlighted). Also a vertical scroll bar would be available if my data had more columns. Also notice the “link” column is now wrapped vs. shortened as shown in the initial display.

Step 3. Summarizing and formatting your data with. groupby()

Now that I can see and scroll through the entire dataset, lets summarize our data. Python provides multiple easy to follow techniques for summarizing data; however, under more complex scenarios, formatting this output can be challenging to the beginner. Below I will show you one of my favorite ways to aggregate data using .groupby(). and then apply a format to that output.

Lets start simple and build to a more complex scenarios where formatting with .groupyby() gets more challenging.

1. Simple Scenario — One aggregation
df.groupby().sum() and styling with .style.format().
df.groupby().sum() and styling with .style.format().
df_MovieStarsSummary_groupby = df_MovieStarsSummary[[“Function”, “Domestic Box Office”]].groupby([“Function”]).sum()
df_MovieStarsSummary_groupby.style.format({‘Domestic Box Office’: ‘${:,.0f}’})

Fairly straight forward, just sum “Domestic Box Office” by function, then apply “$” and “,” to improve comprehension.

2. More Complex Scenario — Multiple Aggregation
What if you are seeking to perform multiple aggregate functions? For example sum and mean to the column “Domestic Box Office”. In order to do this you append the .agg() function to your .groupby() vs. appending the .sum() as shown in the earlier example. Within the “( )” of your .agg() function you then indicate the column(s) you are looking to aggregate in dictionary form. Python provides several aggregation function that you can use with .agg() (see list here).

df_MovieStarsSummary_groupby = df_MovieStarsSummary.groupby([“Function”]).agg({‘Domestic Box Office’:[‘sum’, ‘mean’]})
df_MovieStarsSummary_groupby.style.format({‘Domestic Box Office’: ‘${:,.0f}’})

Uh-Oh… what happened? My formatting is no longer working!

Problem: 2 things changed

1. Notice the gap “layer” between our first row and the column names (highlighted).

2. Notice the names of the columns changes (highlighted)

When using the .agg() technique, combined with multiple aggregate functions, python adds this layer and inserts new column names to aid in comprehending the output. Unfortunately, this additional layer also adds an additional layer between our .style.format() function resulting in our function no longer “seeing” our column “Box Office”.

NOW WHAT!?

Solution:

  • 1. Use df.to_flat_index() +
  • 2 .Then df.reset_index() +
  • 3. Lastly, update the names of the referenced columns in .style.format()

df_MovieStarsSummary_groupby = df_MovieStarsSummary.groupby([“Function”]).agg({‘Domestic Box Office’:[‘sum’, ‘mean’]})
df_MovieStarsSummary_groupby.columns = [f”{x}_{y}” for x, y in df_MovieStarsSummary_groupby.columns.to_flat_index()]
df_MovieStarsSummary_groupby = df_MovieStarsSummary_groupby.reset_index()
df_MovieStarsSummary_groupby.style.format({‘Domestic Box Office_sum’: ‘${:,.0f}’, ‘Domestic Box Office_mean’: ‘${:,.0f}’})

Tada… Output back to being presentation ready! Whats going on here?

1. .flat_index() — Removes the layer between “Box Office” and the aggregate functions “sum” and “mean” and creates a new name in the form of a tuple. Essentially combing the original name of the column with the aggregate function being performed. We then loop through these list of tuples (two in our case), join the tuple together and you those combined tuples to rename our columns.

2. .reset_index() — Eliminates the layer between the column name and the first row — typical structure created with a dataframe. Technically this step is not required to format our columns assuming we still do step 3. However, I like to have a “flat” header row

3. .style.format() — Changed the names being used by the function to the new names discussed in step 1.

Conclusion

Above I covered how to overcome some of the difficulties with using jupyters default display behavior and how to use dataframes to view, aggregate and create presentation ready formatting to your data. My aim was to give the novice a few palatable solutions to obstacles that I have incurred while learning to use Python. In future blogs I will get more advanced with my material, including my next post which is related to this post in addressing formatting issues while leveraging one of the most useful descriptive functions in Python “df.describe()”. Checkout that post here. I look forward to seeing you in our next adventure in becoming Data Alchemists!

Next Stop — Data Alchemy

--

--