--- title: "My favorite pandas pattern" description: "I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This..." date: 2018-03-01 published: false tags: - python template: blog-post --- # My favorite pandas pattern I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders. ```python import pandas as pd import numpy as np % matplotlib inline ``` ## Lets Fake some data Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an irregular date range. ```python n = 365*5 cols = {'level_0': 'date', 'level_1': 'item', 0: 'qty', } data = (pd.DataFrame(np.random.randint(0, 10, size=(n, 4)), columns=['paper', 'pencils', 'note cards', 'markers'], index=pd.date_range('1/1/2017', periods=n, freq='d'), ) .stack() .to_frame() .reset_index() .rename(columns=cols)) data.head() ```

	date	item	qty
0	2017-01-01	paper	1
1	2017-01-01	pencils	4
2	2017-01-01	note cards	5
3	2017-01-01	markers	9
4	2017-01-02	paper	3

## The pattern Here I am going to take my groupby date and item, this will take care of duplicate entries with the same time stamp. Select the value I want to sum on. unstack the items index into columns. Resample the data by month. I could easily use any of the [available rules](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases). Fill any missing months with 0, since there wasnt a transaction during that month. Apply a rolling window to get the annual sum. I find that this helps to ground values in values that my stakeholders are used to seeing on a regular basis and reduces the need for them to recalculate in their head. Then I am going to drop the nulls created by the rolling window for the first 11 rows. ```python plot_data = (data .groupby(['date', 'item']) .sum() ['qty'] .unstack() .resample('m') .sum() .fillna(0) .rolling(12) .sum() .dropna() ) plot_data.head() ```

item	markers	note cards	paper	pencils
date
2017-12-31	1543.0	1739.0	1613.0	1657.0
2018-01-31	1572.0	1744.0	1635.0	1635.0
2018-02-28	1563.0	1717.0	1645.0	1645.0
2018-03-31	1596.0	1703.0	1629.0	1600.0
2018-04-30	1557.0	1693.0	1648.0	1581.0

```python plot_data.plot(title='Rolling annual sum of Categorical Random Data'); ``` ## For the Visual Learners ### Groupby ```python plot_data = (data .groupby(['date', 'item']) .sum() ) plot_data.head() ```

		qty
date	item
2017-01-01	markers	9
	note cards	5
	paper	1
	pencils	4
2017-01-02	markers	4

### Select Values In this case I chose to do this to avoid working with a multiple levels in the columns that would be created in the unstack() step. ```python plot_data = plot_data['qty'] plot_data.head() ``` date item 2017-01-01 markers 9 note cards 5 paper 1 pencils 4 2017-01-02 markers 4 Name: qty, dtype: int32 ### unstack transform the last column in the index ('item') into rows. ```python plot_data = plot_data.unstack() plot_data.head() ```

item	markers	note cards	paper	pencils
date
2017-01-01	9	5	1	4
2017-01-02	4	2	3	7
2017-01-03	9	5	2	3
2017-01-04	2	0	0	5
2017-01-05	0	1	6	2

### resample This step is important for irregular data in order to get the data into regular intervals. ```python plot_data = plot_data.resample('m').sum() plot_data.head() ```

item	markers	note cards	paper	pencils
date
2017-01-31	145	128	117	146
2017-02-28	136	140	133	135
2017-03-31	112	145	125	163
2017-04-30	143	148	112	147
2017-05-31	86	134	139	141

### rolling I like to use rolling because it get the data into annual numbers, and reduces noise. I have found that most of my datasets have patterns and trends that are greater than 1y. This is just due to the industry that I am in. Play with the resample and rolling rules to fit the need of your own data. ```python plot_data = plot_data.rolling(12).sum() plot_data.head(20) ```

item	markers	note cards	paper	pencils
date
2017-01-31	NaN	NaN	NaN	NaN
2017-02-28	NaN	NaN	NaN	NaN
2017-03-31	NaN	NaN	NaN	NaN
2017-04-30	NaN	NaN	NaN	NaN
2017-05-31	NaN	NaN	NaN	NaN
2017-06-30	NaN	NaN	NaN	NaN
2017-07-31	NaN	NaN	NaN	NaN
2017-08-31	NaN	NaN	NaN	NaN
2017-09-30	NaN	NaN	NaN	NaN
2017-10-31	NaN	NaN	NaN	NaN
2017-11-30	NaN	NaN	NaN	NaN
2017-12-31	1543.0	1739.0	1613.0	1657.0
2018-01-31	1572.0	1744.0	1635.0	1635.0
2018-02-28	1563.0	1717.0	1645.0	1645.0
2018-03-31	1596.0	1703.0	1629.0	1600.0
2018-04-30	1557.0	1693.0	1648.0	1581.0
2018-05-31	1624.0	1674.0	1632.0	1592.0
2018-06-30	1582.0	1645.0	1657.0	1593.0
2018-07-31	1662.0	1654.0	1680.0	1613.0
2018-08-31	1654.0	1617.0	1650.0	1616.0

### dropna get rid of the first 11 null rows ```python plot_data = plot_data.dropna() plot_data.head(10) ```

item	markers	note cards	paper	pencils
date
2017-12-31	1543.0	1739.0	1613.0	1657.0
2018-01-31	1572.0	1744.0	1635.0	1635.0
2018-02-28	1563.0	1717.0	1645.0	1645.0
2018-03-31	1596.0	1703.0	1629.0	1600.0
2018-04-30	1557.0	1693.0	1648.0	1581.0
2018-05-31	1624.0	1674.0	1632.0	1592.0
2018-06-30	1582.0	1645.0	1657.0	1593.0
2018-07-31	1662.0	1654.0	1680.0	1613.0
2018-08-31	1654.0	1617.0	1650.0	1616.0
2018-09-30	1669.0	1648.0	1638.0	1634.0