My favorite pandas pattern ━━━━━━━━━━━━━━━━━━━━━━━━━━ I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This... Date: March 1, 2018 My favorite pandas pattern ━━━━━━━━━━━━━━━━━━━━━━━━━━ I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders. ``` import pandas as pd import numpy as np % matplotlib inline ``` Lets Fake some data ─────────────────── Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an irregular date range. ``` n = 365*5 cols = {'level_0': 'date', 'level_1': 'item', 0: 'qty', } data = (pd.DataFrame(np.random.randint(0, 10, size=(n, 4)), columns=['paper', 'pencils', 'note cards', 'markers'], index=pd.date_range('1/1/2017', periods=n, freq='d'), ) .stack() .to_frame() .reset_index() .rename(columns=cols)) data.head() ``` .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
+---+------------+------------+-----+
| | date | item | qty |
+---+------------+------------+-----+
| 0 | 2017-01-01 | paper | 1 |
| 1 | 2017-01-01 | pencils | 4 |
| 2 | 2017-01-01 | note cards | 5 |
| 3 | 2017-01-01 | markers | 9 |
| 4 | 2017-01-02 | paper | 3 |
+---+------------+------------+-----+
The pattern
───────────
Here I am going to take my groupby date and item, this will take care of duplicate entries with the same time stamp. Select the value I want to sum on. unstack the items index into columns. Resample the data by month. I could easily use any of the available rules
+------------+---------+------------+--------+---------+
| item | markers | note cards | paper | pencils |
| date | | | | |
+------------+---------+------------+--------+---------+
| 2017-12-31 | 1543.0 | 1739.0 | 1613.0 | 1657.0 |
| 2018-01-31 | 1572.0 | 1744.0 | 1635.0 | 1635.0 |
| 2018-02-28 | 1563.0 | 1717.0 | 1645.0 | 1645.0 |
| 2018-03-31 | 1596.0 | 1703.0 | 1629.0 | 1600.0 |
| 2018-04-30 | 1557.0 | 1693.0 | 1648.0 | 1581.0 |
+------------+---------+------------+--------+---------+
```
plot_data.plot(title='Rolling annual sum of Categorical Random Data');
```
For the Visual Learners
───────────────────────
### Groupby
```
plot_data = (data
.groupby(['date', 'item'])
.sum()
)
plot_data.head()
```
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
+------------+---------+-----+
| | | qty |
| date | item | |
+------------+---------+-----+
| 2017-01-01 | markers | 9 |
| note cards | 5 | |
| paper | 1 | |
| pencils | 4 | |
| 2017-01-02 | markers | 4 |
+------------+---------+-----+
### Select Values
In this case I chose to do this to avoid working with a multiple levels in the columns that would be created in the unstack() step.
```
plot_data = plot_data['qty']
plot_data.head()
```
```
date item
2017-01-01 markers 9
note cards 5
paper 1
pencils 4
2017-01-02 markers 4
Name: qty, dtype: int32
```
### unstack
transform the last column in the index (‘item’) into rows.
```
plot_data = plot_data.unstack()
plot_data.head()
```
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
+------------+---------+------------+-------+---------+
| item | markers | note cards | paper | pencils |
| date | | | | |
+------------+---------+------------+-------+---------+
| 2017-01-01 | 9 | 5 | 1 | 4 |
| 2017-01-02 | 4 | 2 | 3 | 7 |
| 2017-01-03 | 9 | 5 | 2 | 3 |
| 2017-01-04 | 2 | 0 | 0 | 5 |
| 2017-01-05 | 0 | 1 | 6 | 2 |
+------------+---------+------------+-------+---------+
### resample
This step is important for irregular data in order to get the data into regular intervals.
```
plot_data = plot_data.resample('m').sum()
plot_data.head()
```
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
+------------+---------+------------+-------+---------+
| item | markers | note cards | paper | pencils |
| date | | | | |
+------------+---------+------------+-------+---------+
| 2017-01-31 | 145 | 128 | 117 | 146 |
| 2017-02-28 | 136 | 140 | 133 | 135 |
| 2017-03-31 | 112 | 145 | 125 | 163 |
| 2017-04-30 | 143 | 148 | 112 | 147 |
| 2017-05-31 | 86 | 134 | 139 | 141 |
+------------+---------+------------+-------+---------+
### rolling
I like to use rolling because it get the data into annual numbers, and reduces noise. I have found that most of my datasets have patterns and trends that are greater than 1y. This is just due to the industry that I am in. Play with the resample and rolling rules to fit the need of your own data.
```
plot_data = plot_data.rolling(12).sum()
plot_data.head(20)
```
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
+------------+---------+------------+--------+---------+
| item | markers | note cards | paper | pencils |
| date | | | | |
+------------+---------+------------+--------+---------+
| 2017-01-31 | NaN | NaN | NaN | NaN |
| 2017-02-28 | NaN | NaN | NaN | NaN |
| 2017-03-31 | NaN | NaN | NaN | NaN |
| 2017-04-30 | NaN | NaN | NaN | NaN |
| 2017-05-31 | NaN | NaN | NaN | NaN |
| 2017-06-30 | NaN | NaN | NaN | NaN |
| 2017-07-31 | NaN | NaN | NaN | NaN |
| 2017-08-31 | NaN | NaN | NaN | NaN |
| 2017-09-30 | NaN | NaN | NaN | NaN |
| 2017-10-31 | NaN | NaN | NaN | NaN |
| 2017-11-30 | NaN | NaN | NaN | NaN |
| 2017-12-31 | 1543.0 | 1739.0 | 1613.0 | 1657.0 |
| 2018-01-31 | 1572.0 | 1744.0 | 1635.0 | 1635.0 |
| 2018-02-28 | 1563.0 | 1717.0 | 1645.0 | 1645.0 |
| 2018-03-31 | 1596.0 | 1703.0 | 1629.0 | 1600.0 |
| 2018-04-30 | 1557.0 | 1693.0 | 1648.0 | 1581.0 |
| 2018-05-31 | 1624.0 | 1674.0 | 1632.0 | 1592.0 |
| 2018-06-30 | 1582.0 | 1645.0 | 1657.0 | 1593.0 |
| 2018-07-31 | 1662.0 | 1654.0 | 1680.0 | 1613.0 |
| 2018-08-31 | 1654.0 | 1617.0 | 1650.0 | 1616.0 |
+------------+---------+------------+--------+---------+
### dropna
get rid of the first 11 null rows
```
plot_data = plot_data.dropna()
plot_data.head(10)
```
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
+------------+---------+------------+--------+---------+
| item | markers | note cards | paper | pencils |
| date | | | | |
+------------+---------+------------+--------+---------+
| 2017-12-31 | 1543.0 | 1739.0 | 1613.0 | 1657.0 |
| 2018-01-31 | 1572.0 | 1744.0 | 1635.0 | 1635.0 |
| 2018-02-28 | 1563.0 | 1717.0 | 1645.0 | 1645.0 |
| 2018-03-31 | 1596.0 | 1703.0 | 1629.0 | 1600.0 |
| 2018-04-30 | 1557.0 | 1693.0 | 1648.0 | 1581.0 |
| 2018-05-31 | 1624.0 | 1674.0 | 1632.0 | 1592.0 |
| 2018-06-30 | 1582.0 | 1645.0 | 1657.0 | 1593.0 |
| 2018-07-31 | 1662.0 | 1654.0 | 1680.0 | 1613.0 |
| 2018-08-31 | 1654.0 | 1617.0 | 1650.0 | 1616.0 |
| 2018-09-30 | 1669.0 | 1648.0 | 1638.0 | 1634.0 |
+------------+---------+------------+--------+---------+
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }