My favorite pandas pattern ━━━━━━━━━━━━━━━━━━━━━━━━━━ I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This... Date: March 1, 2018 [1m[38;2;167;192;128mMy favorite pandas pattern[0m [38;2;71;82;88m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders. [38;2;122;132;120m[code][0m import pandas as pd import numpy as np % matplotlib inline [1m[38;2;167;192;128mLets Fake some data[0m [38;2;71;82;88m───────────────────[0m Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an irregular date range. [38;2;122;132;120m[code][0m n = 365*5 cols = {'level_0': 'date', 'level_1': 'item', 0: 'qty', } data = (pd.DataFrame(np.random.randint(0, 10, size=(n, 4)), columns=['paper', 'pencils', 'note cards', 'markers'], index=pd.date_range('1/1/2017', periods=n, freq='d'), ) .stack() .to_frame() .reset_index() .rename(columns=cols)) data.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } +---+------------+------------+-----+ | | [1mdate[0m | [1mitem[0m | [1mqty[0m | +---+------------+------------+-----+ | 0 | 2017-01-01 | paper | 1 | | 1 | 2017-01-01 | pencils | 4 | | 2 | 2017-01-01 | note cards | 5 | | 3 | 2017-01-01 | markers | 9 | | 4 | 2017-01-02 | paper | 3 | +---+------------+------------+-----+ [1m[38;2;167;192;128mThe pattern[0m [38;2;71;82;88m───────────[0m Here I am going to take my groupby date and item, this will take care of duplicate entries with the same time stamp. Select the value I want to sum on. unstack the items index into columns. Resample the data by month. I could easily use any of the [4m[38;2;127;187;179mavailable rules[0m <[38;2;122;132;120mhttps://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases[0m>. Fill any missing months with 0, since there wasnt a transaction during that month. Apply a rolling window to get the annual sum. I find that this helps to ground values in values that my stakeholders are used to seeing on a regular basis and reduces the need for them to recalculate in their head. Then I am going to drop the nulls created by the rolling window for the first 11 rows. [38;2;122;132;120m[code][0m plot_data = (data .groupby(['date', 'item']) .sum() ['qty'] .unstack() .resample('m') .sum() .fillna(0) .rolling(12) .sum() .dropna() ) plot_data.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } +------------+---------+------------+--------+---------+ | [1mitem[0m | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m | [1mpencils[0m | | [1mdate[0m | | | | | +------------+---------+------------+--------+---------+ | 2017-12-31 | 1543.0 | 1739.0 | 1613.0 | 1657.0 | | 2018-01-31 | 1572.0 | 1744.0 | 1635.0 | 1635.0 | | 2018-02-28 | 1563.0 | 1717.0 | 1645.0 | 1645.0 | | 2018-03-31 | 1596.0 | 1703.0 | 1629.0 | 1600.0 | | 2018-04-30 | 1557.0 | 1693.0 | 1648.0 | 1581.0 | +------------+---------+------------+--------+---------+ [38;2;122;132;120m[code][0m plot_data.plot(title='Rolling annual sum of Categorical Random Data'); [1m[38;2;167;192;128mFor the Visual Learners[0m [38;2;71;82;88m───────────────────────[0m [1m[38;2;167;192;128m### Groupby[0m [38;2;122;132;120m[code][0m plot_data = (data .groupby(['date', 'item']) .sum() ) plot_data.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } +------------+---------+-----+ | | | [1mqty[0m | | [1mdate[0m | [1mitem[0m | | +------------+---------+-----+ | 2017-01-01 | markers | 9 | | note cards | 5 | | | paper | 1 | | | pencils | 4 | | | 2017-01-02 | markers | 4 | +------------+---------+-----+ [1m[38;2;167;192;128m### Select Values[0m In this case I chose to do this to avoid working with a multiple levels in the columns that would be created in the unstack() step. [38;2;122;132;120m[code][0m plot_data = plot_data['qty'] plot_data.head() [38;2;122;132;120m[code][0m date item 2017-01-01 markers 9 note cards 5 paper 1 pencils 4 2017-01-02 markers 4 Name: qty, dtype: int32 [1m[38;2;167;192;128m### unstack[0m transform the last column in the index (‘item’) into rows. [38;2;122;132;120m[code][0m plot_data = plot_data.unstack() plot_data.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } +------------+---------+------------+-------+---------+ | [1mitem[0m | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m | [1mpencils[0m | | [1mdate[0m | | | | | +------------+---------+------------+-------+---------+ | 2017-01-01 | 9 | 5 | 1 | 4 | | 2017-01-02 | 4 | 2 | 3 | 7 | | 2017-01-03 | 9 | 5 | 2 | 3 | | 2017-01-04 | 2 | 0 | 0 | 5 | | 2017-01-05 | 0 | 1 | 6 | 2 | +------------+---------+------------+-------+---------+ [1m[38;2;167;192;128m### resample[0m This step is important for irregular data in order to get the data into regular intervals. [38;2;122;132;120m[code][0m plot_data = plot_data.resample('m').sum() plot_data.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } +------------+---------+------------+-------+---------+ | [1mitem[0m | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m | [1mpencils[0m | | [1mdate[0m | | | | | +------------+---------+------------+-------+---------+ | 2017-01-31 | 145 | 128 | 117 | 146 | | 2017-02-28 | 136 | 140 | 133 | 135 | | 2017-03-31 | 112 | 145 | 125 | 163 | | 2017-04-30 | 143 | 148 | 112 | 147 | | 2017-05-31 | 86 | 134 | 139 | 141 | +------------+---------+------------+-------+---------+ [1m[38;2;167;192;128m### rolling[0m I like to use rolling because it get the data into annual numbers, and reduces noise. I have found that most of my datasets have patterns and trends that are greater than 1y. This is just due to the industry that I am in. Play with the resample and rolling rules to fit the need of your own data. [38;2;122;132;120m[code][0m plot_data = plot_data.rolling(12).sum() plot_data.head(20) .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } +------------+---------+------------+--------+---------+ | [1mitem[0m | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m | [1mpencils[0m | | [1mdate[0m | | | | | +------------+---------+------------+--------+---------+ | 2017-01-31 | NaN | NaN | NaN | NaN | | 2017-02-28 | NaN | NaN | NaN | NaN | | 2017-03-31 | NaN | NaN | NaN | NaN | | 2017-04-30 | NaN | NaN | NaN | NaN | | 2017-05-31 | NaN | NaN | NaN | NaN | | 2017-06-30 | NaN | NaN | NaN | NaN | | 2017-07-31 | NaN | NaN | NaN | NaN | | 2017-08-31 | NaN | NaN | NaN | NaN | | 2017-09-30 | NaN | NaN | NaN | NaN | | 2017-10-31 | NaN | NaN | NaN | NaN | | 2017-11-30 | NaN | NaN | NaN | NaN | | 2017-12-31 | 1543.0 | 1739.0 | 1613.0 | 1657.0 | | 2018-01-31 | 1572.0 | 1744.0 | 1635.0 | 1635.0 | | 2018-02-28 | 1563.0 | 1717.0 | 1645.0 | 1645.0 | | 2018-03-31 | 1596.0 | 1703.0 | 1629.0 | 1600.0 | | 2018-04-30 | 1557.0 | 1693.0 | 1648.0 | 1581.0 | | 2018-05-31 | 1624.0 | 1674.0 | 1632.0 | 1592.0 | | 2018-06-30 | 1582.0 | 1645.0 | 1657.0 | 1593.0 | | 2018-07-31 | 1662.0 | 1654.0 | 1680.0 | 1613.0 | | 2018-08-31 | 1654.0 | 1617.0 | 1650.0 | 1616.0 | +------------+---------+------------+--------+---------+ [1m[38;2;167;192;128m### dropna[0m get rid of the first 11 null rows [38;2;122;132;120m[code][0m plot_data = plot_data.dropna() plot_data.head(10) .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } +------------+---------+------------+--------+---------+ | [1mitem[0m | [1mmarkers[0m | [1mnote cards[0m | [1mpaper[0m | [1mpencils[0m | | [1mdate[0m | | | | | +------------+---------+------------+--------+---------+ | 2017-12-31 | 1543.0 | 1739.0 | 1613.0 | 1657.0 | | 2018-01-31 | 1572.0 | 1744.0 | 1635.0 | 1635.0 | | 2018-02-28 | 1563.0 | 1717.0 | 1645.0 | 1645.0 | | 2018-03-31 | 1596.0 | 1703.0 | 1629.0 | 1600.0 | | 2018-04-30 | 1557.0 | 1693.0 | 1648.0 | 1581.0 | | 2018-05-31 | 1624.0 | 1674.0 | 1632.0 | 1592.0 | | 2018-06-30 | 1582.0 | 1645.0 | 1657.0 | 1593.0 | | 2018-07-31 | 1662.0 | 1654.0 | 1680.0 | 1613.0 | | 2018-08-31 | 1654.0 | 1617.0 | 1650.0 | 1616.0 | | 2018-09-30 | 1669.0 | 1648.0 | 1638.0 | 1634.0 | +------------+---------+------------+--------+---------+