My favorite pandas pattern

I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders.


import pandas as pd
import numpy as np
% matplotlib inline

Lets Fake some data

Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an irregular date range.


n = 365*5
cols = {'level_0': 'date',
        'level_1': 'item',
        0: 'qty', }
data = (pd.DataFrame(np.random.randint(0, 10, size=(n, 4)),
                     columns=['paper', 'pencils', 'note cards', 'markers'],
                     index=pd.date_range('1/1/2017', periods=n, freq='d'),
                     )
        .stack()
        .to_frame()
        .reset_index()
        .rename(columns=cols))
data.head()
date item qty
0 2017-01-01 paper 1
1 2017-01-01 pencils 4
2 2017-01-01 note cards 5
3 2017-01-01 markers 9
4 2017-01-02 paper 3

The pattern

Here I am going to take my groupby date and item, this will take care of duplicate entries with the same time stamp. Select the value I want to sum on. unstack the items index into columns. Resample the data by month. I could easily use any of the available rules. Fill any missing months with 0, since there wasnt a transaction during that month. Apply a rolling window to get the annual sum. I find that this helps to ground values in values that my stakeholders are used to seeing on a regular basis and reduces the need for them to recalculate in their head. Then I am going to drop the nulls created by the rolling window for the first 11 rows.


plot_data = (data
             .groupby(['date', 'item'])
             .sum()
             ['qty']
             .unstack()
             .resample('m')
             .sum()
             .fillna(0)
             .rolling(12)
             .sum()
             .dropna()
             )
plot_data.head()
item markers note cards paper pencils
date
2017-12-31 1543.0 1739.0 1613.0 1657.0
2018-01-31 1572.0 1744.0 1635.0 1635.0
2018-02-28 1563.0 1717.0 1645.0 1645.0
2018-03-31 1596.0 1703.0 1629.0 1600.0
2018-04-30 1557.0 1693.0 1648.0 1581.0

plot_data.plot(title='Rolling annual sum of Categorical Random Data');

For the Visual Learners

Groupby


plot_data = (data
             .groupby(['date', 'item'])
             .sum()
             )
plot_data.head()
qty
date item
2017-01-01 markers 9
note cards 5
paper 1
pencils 4
2017-01-02 markers 4

Select Values

In this case I chose to do this to avoid working with a multiple levels in the columns that would be created in the unstack() step.


plot_data = plot_data['qty']

plot_data.head()
date        item
2017-01-01  markers       9
            note cards    5
            paper         1
            pencils       4
2017-01-02  markers       4
Name: qty, dtype: int32

unstack

transform the last column in the index ('item') into rows.


plot_data = plot_data.unstack()

plot_data.head()
item markers note cards paper pencils
date
2017-01-01 9 5 1 4
2017-01-02 4 2 3 7
2017-01-03 9 5 2 3
2017-01-04 2 0 0 5
2017-01-05 0 1 6 2

resample

This step is important for irregular data in order to get the data into regular intervals.


plot_data = plot_data.resample('m').sum()

plot_data.head()
item markers note cards paper pencils
date
2017-01-31 145 128 117 146
2017-02-28 136 140 133 135
2017-03-31 112 145 125 163
2017-04-30 143 148 112 147
2017-05-31 86 134 139 141

rolling

I like to use rolling because it get the data into annual numbers, and reduces noise. I have found that most of my datasets have patterns and trends that are greater than 1y. This is just due to the industry that I am in. Play with the resample and rolling rules to fit the need of your own data.


plot_data = plot_data.rolling(12).sum()

plot_data.head(20)
item markers note cards paper pencils
date
2017-01-31 NaN NaN NaN NaN
2017-02-28 NaN NaN NaN NaN
2017-03-31 NaN NaN NaN NaN
2017-04-30 NaN NaN NaN NaN
2017-05-31 NaN NaN NaN NaN
2017-06-30 NaN NaN NaN NaN
2017-07-31 NaN NaN NaN NaN
2017-08-31 NaN NaN NaN NaN
2017-09-30 NaN NaN NaN NaN
2017-10-31 NaN NaN NaN NaN
2017-11-30 NaN NaN NaN NaN
2017-12-31 1543.0 1739.0 1613.0 1657.0
2018-01-31 1572.0 1744.0 1635.0 1635.0
2018-02-28 1563.0 1717.0 1645.0 1645.0
2018-03-31 1596.0 1703.0 1629.0 1600.0
2018-04-30 1557.0 1693.0 1648.0 1581.0
2018-05-31 1624.0 1674.0 1632.0 1592.0
2018-06-30 1582.0 1645.0 1657.0 1593.0
2018-07-31 1662.0 1654.0 1680.0 1613.0
2018-08-31 1654.0 1617.0 1650.0 1616.0

dropna

get rid of the first 11 null rows


plot_data = plot_data.dropna()

plot_data.head(10)
item markers note cards paper pencils
date
2017-12-31 1543.0 1739.0 1613.0 1657.0
2018-01-31 1572.0 1744.0 1635.0 1635.0
2018-02-28 1563.0 1717.0 1645.0 1645.0
2018-03-31 1596.0 1703.0 1629.0 1600.0
2018-04-30 1557.0 1693.0 1648.0 1581.0
2018-05-31 1624.0 1674.0 1632.0 1592.0
2018-06-30 1582.0 1645.0 1657.0 1593.0
2018-07-31 1662.0 1654.0 1680.0 1613.0
2018-08-31 1654.0 1617.0 1650.0 1616.0
2018-09-30 1669.0 1648.0 1638.0 1634.0