My favorite pandas pattern
━━━━━━━━━━━━━━━━━━━━━━━━━━

I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This...

Date: March 1, 2018

[1m[38;2;167;192;128mMy favorite pandas pattern[0m
[38;2;71;82;88m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders.

[38;2;122;132;120m[code][0m
  import pandas as pd
  import numpy as np
  % matplotlib inline

[1m[38;2;167;192;128mLets Fake some data[0m
[38;2;71;82;88m───────────────────[0m

Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an irregular date range.

[38;2;122;132;120m[code][0m
  n = 365*5
  cols = {'level_0': 'date',
          'level_1': 'item',
          0: 'qty', }
  data = (pd.DataFrame(np.random.randint(0, 10, size=(n, 4)),
                       columns=['paper', 'pencils', 'note cards', 'markers'],
                       index=pd.date_range('1/1/2017', periods=n, freq='d'),
                       )
          .stack()
          .to_frame()
          .reset_index()
          .rename(columns=cols))
  data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre> <p>

+---+------------+------------+-----+
|   | [1mdate[0m       | [1mitem[0m       | [1mqty[0m |
+---+------------+------------+-----+
| 0 | 2017-01-01 | paper      | 1   |
| 1 | 2017-01-01 | pencils    | 4   |
| 2 | 2017-01-01 | note cards | 5   |
| 3 | 2017-01-01 | markers    | 9   |
| 4 | 2017-01-02 | paper      | 3   |
+---+------------+------------+-----+

[1m[38;2;167;192;128mThe pattern[0m
[38;2;71;82;88m───────────[0m

Here I am going to take my groupby date and item, this will take care of duplicate entries with the same time stamp. Select the value I want to sum on. unstack the items index into columns. Resample the data by month. I could easily use any of the [4m[38;2;127;187;179mavailable rules[0m <[38;2;122;132;120mhttps://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases[0m>. Fill any missing months with 0, since there wasnt a transaction during that month. Apply a rolling window to get the annual sum. I find that this helps to ground values in values that my stakeholders are used to seeing on a regular basis and reduces the need for them to recalculate in their head. Then I am going to drop the nulls created by the rolling window for the first 11 rows.

[38;2;122;132;120m[code][0m
  plot_data = (data
               .groupby(['date', 'item'])
               .sum()
               ['qty']
               .unstack()
               .resample('m')
               .sum()
               .fillna(0)
               .rolling(12)
               .sum()
               .dropna()
               )
  plot_data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre> <p>

+------------+---------+------------+--------+---------+
| [1mitem[0m       | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m  | [1mpencils[0m |
| [1mdate[0m       |         |            |        |         |
+------------+---------+------------+--------+---------+
| 2017-12-31 | 1543.0  | 1739.0     | 1613.0 | 1657.0  |
| 2018-01-31 | 1572.0  | 1744.0     | 1635.0 | 1635.0  |
| 2018-02-28 | 1563.0  | 1717.0     | 1645.0 | 1645.0  |
| 2018-03-31 | 1596.0  | 1703.0     | 1629.0 | 1600.0  |
| 2018-04-30 | 1557.0  | 1693.0     | 1648.0 | 1581.0  |
+------------+---------+------------+--------+---------+

[38;2;122;132;120m[code][0m
  plot_data.plot(title='Rolling annual sum of Categorical Random Data');

[1m[38;2;167;192;128mFor the Visual Learners[0m
[38;2;71;82;88m───────────────────────[0m

[1m[38;2;167;192;128m### Groupby[0m

[38;2;122;132;120m[code][0m
  plot_data = (data
               .groupby(['date', 'item'])
               .sum()
               )
  plot_data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre> <p>

+------------+---------+-----+
|            |         | [1mqty[0m |
| [1mdate[0m       | [1mitem[0m    |     |
+------------+---------+-----+
| 2017-01-01 | markers | 9   |
| note cards | 5       |     |
| paper      | 1       |     |
| pencils    | 4       |     |
| 2017-01-02 | markers | 4   |
+------------+---------+-----+

[1m[38;2;167;192;128m### Select Values[0m

In this case I chose to do this to avoid working with a multiple levels in the columns that would be created in the unstack() step.

[38;2;122;132;120m[code][0m
  plot_data = plot_data['qty']

  plot_data.head()

[38;2;122;132;120m[code][0m
  date        item
  2017-01-01  markers       9
              note cards    5
              paper         1
              pencils       4
  2017-01-02  markers       4
  Name: qty, dtype: int32

[1m[38;2;167;192;128m### unstack[0m

transform the last column in the index (‘item’) into rows.

[38;2;122;132;120m[code][0m
  plot_data = plot_data.unstack()

  plot_data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre> <p>

+------------+---------+------------+-------+---------+
| [1mitem[0m       | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m | [1mpencils[0m |
| [1mdate[0m       |         |            |       |         |
+------------+---------+------------+-------+---------+
| 2017-01-01 | 9       | 5          | 1     | 4       |
| 2017-01-02 | 4       | 2          | 3     | 7       |
| 2017-01-03 | 9       | 5          | 2     | 3       |
| 2017-01-04 | 2       | 0          | 0     | 5       |
| 2017-01-05 | 0       | 1          | 6     | 2       |
+------------+---------+------------+-------+---------+

[1m[38;2;167;192;128m### resample[0m

This step is important for irregular data in order to get the data into regular intervals.

[38;2;122;132;120m[code][0m
  plot_data = plot_data.resample('m').sum()

  plot_data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre> <p>

+------------+---------+------------+-------+---------+
| [1mitem[0m       | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m | [1mpencils[0m |
| [1mdate[0m       |         |            |       |         |
+------------+---------+------------+-------+---------+
| 2017-01-31 | 145     | 128        | 117   | 146     |
| 2017-02-28 | 136     | 140        | 133   | 135     |
| 2017-03-31 | 112     | 145        | 125   | 163     |
| 2017-04-30 | 143     | 148        | 112   | 147     |
| 2017-05-31 | 86      | 134        | 139   | 141     |
+------------+---------+------------+-------+---------+

[1m[38;2;167;192;128m### rolling[0m

I like to use rolling because it get the data into annual numbers, and reduces noise. I have found that most of my datasets have patterns and trends that are greater than 1y. This is just due to the industry that I am in. Play with the resample and rolling rules to fit the need of your own data.

[38;2;122;132;120m[code][0m
  plot_data = plot_data.rolling(12).sum()

  plot_data.head(20)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre> <p>

+------------+---------+------------+--------+---------+
| [1mitem[0m       | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m  | [1mpencils[0m |
| [1mdate[0m       |         |            |        |         |
+------------+---------+------------+--------+---------+
| 2017-01-31 | NaN     | NaN        | NaN    | NaN     |
| 2017-02-28 | NaN     | NaN        | NaN    | NaN     |
| 2017-03-31 | NaN     | NaN        | NaN    | NaN     |
| 2017-04-30 | NaN     | NaN        | NaN    | NaN     |
| 2017-05-31 | NaN     | NaN        | NaN    | NaN     |
| 2017-06-30 | NaN     | NaN        | NaN    | NaN     |
| 2017-07-31 | NaN     | NaN        | NaN    | NaN     |
| 2017-08-31 | NaN     | NaN        | NaN    | NaN     |
| 2017-09-30 | NaN     | NaN        | NaN    | NaN     |
| 2017-10-31 | NaN     | NaN        | NaN    | NaN     |
| 2017-11-30 | NaN     | NaN        | NaN    | NaN     |
| 2017-12-31 | 1543.0  | 1739.0     | 1613.0 | 1657.0  |
| 2018-01-31 | 1572.0  | 1744.0     | 1635.0 | 1635.0  |
| 2018-02-28 | 1563.0  | 1717.0     | 1645.0 | 1645.0  |
| 2018-03-31 | 1596.0  | 1703.0     | 1629.0 | 1600.0  |
| 2018-04-30 | 1557.0  | 1693.0     | 1648.0 | 1581.0  |
| 2018-05-31 | 1624.0  | 1674.0     | 1632.0 | 1592.0  |
| 2018-06-30 | 1582.0  | 1645.0     | 1657.0 | 1593.0  |
| 2018-07-31 | 1662.0  | 1654.0     | 1680.0 | 1613.0  |
| 2018-08-31 | 1654.0  | 1617.0     | 1650.0 | 1616.0  |
+------------+---------+------------+--------+---------+

[1m[38;2;167;192;128m### dropna[0m

get rid of the first 11 null rows

[38;2;122;132;120m[code][0m
  plot_data = plot_data.dropna()

  plot_data.head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre> <p>

+------------+---------+------------+--------+---------+
| [1mitem[0m       | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m  | [1mpencils[0m |
| [1mdate[0m       |         |            |        |         |
+------------+---------+------------+--------+---------+
| 2017-12-31 | 1543.0  | 1739.0     | 1613.0 | 1657.0  |
| 2018-01-31 | 1572.0  | 1744.0     | 1635.0 | 1635.0  |
| 2018-02-28 | 1563.0  | 1717.0     | 1645.0 | 1645.0  |
| 2018-03-31 | 1596.0  | 1703.0     | 1629.0 | 1600.0  |
| 2018-04-30 | 1557.0  | 1693.0     | 1648.0 | 1581.0  |
| 2018-05-31 | 1624.0  | 1674.0     | 1632.0 | 1592.0  |
| 2018-06-30 | 1582.0  | 1645.0     | 1657.0 | 1593.0  |
| 2018-07-31 | 1662.0  | 1654.0     | 1680.0 | 1613.0  |
| 2018-08-31 | 1654.0  | 1617.0     | 1650.0 | 1616.0  |
| 2018-09-30 | 1669.0  | 1648.0     | 1638.0 | 1634.0  |
+------------+---------+------------+--------+---------+