Formatting Frustrations — Dataframes

Getting Presentation Ready Formats with Aggregate Functions

Introduction

In a previous post I showed how to use .groupby() with .agg()to summarize large amounts of data (see here). Furthermore, I provided a solution to formatting the output of .agg() functions which can be tricky for the Python beginner given the additional “layer” aggregate functions create when displaying the output. In this post I will provide a solution to a related problem, formatting the output of an aggregate function when you are looking to add different formats to outputs that resides in the same column or row. For the old excel pro this seems like nothing to write about; however, given Python lacks the point and click functionality offered in Excel, accessing individual elements within a dataframe can be challenging. I will demonstrate this dilemma by introducing you to the function that was giving me the formatting fits: df.describe(). Before I start I will remind you of the target audience for this blog. This blog along with my previous posts are targeted toward the beginner. Furthermore, the solution proposed below, to me seems like genius, but may seem like painting by the numbers to more advance data alchemists. As a side not during the process of developing my work around I felt like doing what the guy above is doing 😊. Now lets get started

What is a Df.describe()?

df.describe() is one of my favorite aggregate functions within Python. With one word and some parenthesis “()” python generates 8 key statistics providing information on the distribution of your data. Below is an example, using a small set of movie data (# of movies, box office gross lifetime sales for the Top 10 movie actors).

Step 1. Loading And df.describing() Your Data
Code snippets shown in grey for easy cutting and pasting
import pandas as pd

# Create data for analysis
d ={‘Function’:[‘Actor’,’Actor’,’Actor’,’Director’,’Director’,’Director’,’Producer’,’Producer’,’Producer’,’Writer’,’Writer’,’Writer’],
‘Name’:[‘Will Smith’,’Samuel L. Jackson’,’Robert Downey, Jr.’,’Joe Russo’,’Steven Spielberg’,’Jennifer Lee’,’Kevin Feige’,’Kathleen Kennedy’,’Jerry Bruckheimer’,’Jennifer Lee’,’Stephen McFeely’,’Christopher Markus’],
'Box Office’:[7052831352,5655231264,5474166329,4561643598,4542164603,2634334761,8545426433,5999104450,4306736018,3504008616,3175631122,3175631122],
‘MovieCount’:[68,64,43,16,36,6,30,35,36,15,12,12]}

# Load your data into a DataFrame
df_Movie_Data = pd.DataFrame(data=d)
# Aggregate your data using df.describe()
df_Movie_Data.describe()

Wow look at that output! We were able to generate 8 descriptive statistics (count, mean, std, min, max top 25%, 50%, 75%), on our data…across both columns! Very easy and pretty awesome! However, there is a problem. The formatting for our “Box Office” column is in scientific notation and it should be in $’s and our “Movie Counts” column which is just the number of movies added together have several trailing 0’s. I really can’t understand the results myself nor could I present these numbers to folks in the business without appearing lazy, and we don’t want that! These numbers definitely are not what I would call “presentation ready”. No problem Python has a solution for formatting, as we saw in my previous post, we can use df.style.format().

Step 2. Formatting Output of df.Describe() using .style.format()

The .format() function take in as arguments the name of the column(s) and the format you want. In my case {‘Box Office’:’${0:,.0f}’, ‘MovieCount’: ‘{0:,.0f}’}
# Use .style.format() to format and create a dictionary for each column defining the format
format_dict = {‘Box Office’:’${0:,.0f}’, ‘MovieCount’: ‘{0:,.0f}’}

df_Movie_Data.describe().style.format(format_dict)

Great! My formats are working! But wait… “count” in column “Box Office” is formatted as a $ (see highlight), and that makes no sense. How do I format “count” to be a number and not currency? No problem, I will simply use the index of the dataframe and reset that row to format without the “$” sign…Right?

Wrong! I get an error. Once you apply a format to your entire dataframe using .style.format(), the return object is of type Styler, not dataframe and Styler has no way to target its elements by row or column. You are not able to use lambda on a Styler either…. ARG….. So what should you do? Cut and paste the dataframe into Excel, reformate the cell, and drop into your presentation? Sure, that is a solution but becoming a Python users requires fortitude so we pressed on!

Proposed Solution/ Workaround

After several frustrating hours of searching and reaching out to my Python network at Flatiron bootcamp I found the following work around
# Use Apply Map for formatting with Criteria by element instead of applying a format to entire column
df_Movie_Data_Desc.applymap(lambda x: ‘{:.0f}’.format(x) if int(x) <100 else ‘${:,.0f}’.format(x))

Explanation: Essentially what is happening is I am using lambda and .applymap() NOT .apply() (see differences here) to access and format each individual cell/ element and looking at its value to determine if I should apply a format. Given I know the data and that smaller numbers in my data set are actually just counts vs currency I apply my formats to the entire dataframe element by element using the value to drive the format I apply.

Final Note

Certainly, there are limitations to my work around. Thankfully in my case the value of the data was an indicator of the formatting I wanted to apply at the element level. Something to note here using the .applymap() vs. apply was necessary as .apply() tries to access the .format() function using the entire Series and Series does not have a function called .format(). However, applymap feeds the lambda function element wise — meaning cell by cell in Excel terms. By feeding the lambda function elementwise vs. Series wise you are able to access the .format() method which is available on an int object level, essentially the value of the specific cell.

Hopefully this example helps others looking to apply different formats to the same column/ row within a dataframe. As I learned during this journey having two or more formats in one column or row is not as easy as just using the df.stlye().format() technique. In hindsight this task was especially frustrating knowing how easily this could be done using Excel. However, becoming a data alchemist requires mastery of Python and mastery requires commitment. Overcoming small mountains today, fuels larger victories tomorrow! Hopefully you enjoyed this writeup and it is helpful to you? See you in the next journey 😊

Next Stop — DATA ALCHEMY!

Innovation Leader and Insight Enthusiast !