My favorite pandas pattern

Waylon Walker

My favorite pandas pattern

I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders.

import pandas as pd
import numpy as np
% matplotlib inline

Lets Fake some data

Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an irregular date range.

n = 365*5
cols = {'level_0': 'date',
        'level_1': 'item',
        0: 'qty', }
data = (pd.DataFrame(np.random.randint(0, 10, size=(n, 4)),
                     columns=['paper', 'pencils', 'note cards', 'markers'],
                     index=pd.date_range('1/1/2017', periods=n, freq='d'),
                     )
        .stack()
        .to_frame()
        .reset_index()
        .rename(columns=cols))
data.head()

	date	item	qty
0	2017-01-01	paper	1
1	2017-01-01	pencils	4
2	2017-01-01	note cards	5
3	2017-01-01	markers	9
4	2017-01-02	paper	3

The pattern

Here I am going to take my groupby date and item, this will take care of duplicate entries with the same time stamp. Select the value I want to sum on. unstack the items index into columns. Resample the data by month. I could easily use any of the available rules. Fill any missing months with 0, since there wasnt a transaction during that month. Apply a rolling window to get the annual sum. I find that this helps to ground values in values that my stakeholders are used to seeing on a regular basis and reduces the need for them to recalculate in their head. Then I am going to drop the nulls created by the rolling window for the first 11 rows.

plot_data = (data
             .groupby(['date', 'item'])
             .sum()
             ['qty']
             .unstack()
             .resample('m')
             .sum()
             .fillna(0)
             .rolling(12)
             .sum()
             .dropna()
             )
plot_data.head()

item	markers	note cards	paper	pencils
date
2017-12-31	1543.0	1739.0	1613.0	1657.0
2018-01-31	1572.0	1744.0	1635.0	1635.0
2018-02-28	1563.0	1717.0	1645.0	1645.0
2018-03-31	1596.0	1703.0	1629.0	1600.0
2018-04-30	1557.0	1693.0	1648.0	1581.0

plot_data.plot(title='Rolling annual sum of Categorical Random Data');

For the Visual Learners

Groupby

plot_data = (data
             .groupby(['date', 'item'])
             .sum()
             )
plot_data.head()

		qty
date	item
2017-01-01	markers	9
	note cards	5
	paper	1
	pencils	4
2017-01-02	markers	4

Select Values

In this case I chose to do this to avoid working with a multiple levels in the columns that would be created in the unstack() step.

plot_data = plot_data['qty']
plot_data.head()

date        item
2017-01-01  markers       9
            note cards    5
            paper         1
            pencils       4
2017-01-02  markers       4
Name: qty, dtype: int32

unstack

transform the last column in the index ('item') into rows.

plot_data = plot_data.unstack()
plot_data.head()

item	markers	note cards	paper	pencils
date
2017-01-01	9	5	1	4
2017-01-02	4	2	3	7
2017-01-03	9	5	2	3
2017-01-04	2	0	0	5
2017-01-05	0	1	6	2

resample

This step is important for irregular data in order to get the data into regular intervals.

plot_data = plot_data.resample('m').sum()
plot_data.head()

item	markers	note cards	paper	pencils
date
2017-01-31	145	128	117	146
2017-02-28	136	140	133	135
2017-03-31	112	145	125	163
2017-04-30	143	148	112	147
2017-05-31	86	134	139	141

rolling

I like to use rolling because it get the data into annual numbers, and reduces noise. I have found that most of my datasets have patterns and trends that are greater than 1y. This is just due to the industry that I am in. Play with the resample and rolling rules to fit the need of your own data.

plot_data = plot_data.rolling(12).sum()
plot_data.head(20)

item	markers	note cards	paper	pencils
date
2017-01-31	NaN	NaN	NaN	NaN
2017-02-28	NaN	NaN	NaN	NaN
2017-03-31	NaN	NaN	NaN	NaN
2017-04-30	NaN	NaN	NaN	NaN
2017-05-31	NaN	NaN	NaN	NaN
2017-06-30	NaN	NaN	NaN	NaN
2017-07-31	NaN	NaN	NaN	NaN
2017-08-31	NaN	NaN	NaN	NaN
2017-09-30	NaN	NaN	NaN	NaN
2017-10-31	NaN	NaN	NaN	NaN
2017-11-30	NaN	NaN	NaN	NaN
2017-12-31	1543.0	1739.0	1613.0	1657.0
2018-01-31	1572.0	1744.0	1635.0	1635.0
2018-02-28	1563.0	1717.0	1645.0	1645.0
2018-03-31	1596.0	1703.0	1629.0	1600.0
2018-04-30	1557.0	1693.0	1648.0	1581.0
2018-05-31	1624.0	1674.0	1632.0	1592.0
2018-06-30	1582.0	1645.0	1657.0	1593.0
2018-07-31	1662.0	1654.0	1680.0	1613.0
2018-08-31	1654.0	1617.0	1650.0	1616.0

dropna

get rid of the first 11 null rows

plot_data = plot_data.dropna()
plot_data.head(10)

item	markers	note cards	paper	pencils
date
2017-12-31	1543.0	1739.0	1613.0	1657.0
2018-01-31	1572.0	1744.0	1635.0	1635.0
2018-02-28	1563.0	1717.0	1645.0	1645.0
2018-03-31	1596.0	1703.0	1629.0	1600.0
2018-04-30	1557.0	1693.0	1648.0	1581.0
2018-05-31	1624.0	1674.0	1632.0	1592.0
2018-06-30	1582.0	1645.0	1657.0	1593.0
2018-07-31	1662.0	1654.0	1680.0	1613.0
2018-08-31	1654.0	1617.0	1650.0	1616.0
2018-09-30	1669.0	1648.0	1638.0	1634.0

My favorite pandas pattern

Tags

My favorite pandas pattern

Lets Fake some data

The pattern

For the Visual Learners

Groupby

Select Values

unstack

resample

rolling

dropna

Recent Posts

Recent Thoughts

Recent Stars