Background Tasks in Python for Data Science
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This post is intended as an extension/update from background tasks in python. I started using the week that Kenneth Reitz released it. It takes away so much...

Date: September 10, 2019

This post is intended as an extension/update from [4m[38;2;127;187;179mbackground tasks in python[0m <[38;2;122;132;120mhttps://waylonwalker.com/background-1/[0m>. I started using [38;2;167;192;128mbackground[0m the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started.

This post is intended as an extension/update from [4m[38;2;127;187;179mbackground tasks in python[0m <[38;2;122;132;120mhttps://waylonwalker.com/background-1/[0m>. I started using [38;2;167;192;128mbackground[0m the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started.

[38;2;122;132;120m│ [0mI use it in more places than I probably should

Before we get into it, I want to make a shout out to Kenneth Reitz for making this so easy. Kenneth is a python God for all that he has given to the community in so many ways, especially with his ideas in building stupid simple api’s for very complicated things.

[1m[38;2;167;192;128mInstallation[0m
[38;2;71;82;88m────────────[0m

[1m[38;2;167;192;128m### install via pip[0m

[38;2;122;132;120m[code][0m
  pip install background

[1m[38;2;167;192;128m### install via github[0m

I believe one of the later pr’s to the project fixes the way arguments are passed in. I generally clone the repo or copy the module directly into my project.

[1mclone it[0m

[38;2;122;132;120m[code][0m
  git clone https://github.com/ParthS007/background.git
  cd background
  python setup.py install

[1mcopy the module[0m

[38;2;122;132;120m[code][0m
  curl https://raw.githubusercontent.com/ParthS007/background/master/background.py > background.py

[1m[38;2;167;192;128m🐌 The Slow Function[0m
[38;2;71;82;88m───────────────────[0m

Imagine that this function is a big one! This function is fairly realistic as it takes in some input and returns a DataFrame. This is what a good half of my fuctions do in data science. The internals of this function generally will include a sql query, load from s3 or a data catalog, an aggregation from another DataFrame. In general it should do one simple thing.

[1mFeel Free to copy this “boilerplate”[0m

[38;2;122;132;120m[code][0m
  import background
  from time import sleep
  import pandas as pd

  @background.task
  def long_func(i):
      """
      Simulates fetching data from a service
      and returning a pandas DataFrame.

      """
      sleep(10)
      return pd.DataFrame({'number_squared': [i**2]})

[1m[38;2;167;192;128mCalling the Slow Function[0m
[38;2;71;82;88m─────────────────────────[0m

[3mit’s the future calling 🤙[0m

If we were to call this function 10 times it would take 100s. Not bad for a dumb example, but detrimental when this gets scaled up💥. We want to utilize all of our available resources to reduce our development time and get moving on our project.

Calling [38;2;167;192;128mlong_func[0m will return a future object. This object has a number of methods that you can read about in the [4m[38;2;127;187;179mcpython docs[0m <[38;2;122;132;120mhttps://docs.python.org/3/library/concurrent.futures.html#future-objects[0m>. The main one we are interested in is [38;2;167;192;128mresult[0m. I typically call these functions many times and put them into a list object so that I can track their progress and get their results. If you needed to map inputs back to the result use a dictionary.

[38;2;122;132;120m[code][0m
  %time futures = [long_func(i) for i in range(10)]

  CPU times: user 319 µs, sys: 197 µs, total: 516 µs
  Wall time: 212 µs

[1m[38;2;167;192;128mDo something with those [38;2;167;192;128mresults()[0m[0m
[38;2;71;82;88m─────────────────────────────────[0m

Simply running the function completes in no time! This is because the future objects that are returned are non blocking and will run in a background task using the [38;2;167;192;128mProcessPoolExecutor[0m. To get the result back out we need to call the [38;2;167;192;128mresult[0m method on the future object.[38;2;167;192;128mresult[0m is a blocking function that will not realease until the function has completed.

[38;2;122;132;120m[code][0m
  %%time
  futures = [long_func(i) for i in range(10)]
  pd.concat([future.result() for future in futures])

  CPU times: user 5.38 ms, sys: 3.53 ms, total: 8.9 ms
  Wall time: 10 s

Note that this example completed in [38;2;167;192;128m10s[0m, the time it took for only one run, not all 10! 😎

[1m[38;2;167;192;128mn[0m
[38;2;71;82;88m─[0m

😫 [3mcrank it up[0m

By default the number of parallel processes wil be equal to the number of cpu threads on your machine. To increase the number of parallel processes ([38;2;167;192;128mmax_workers[0m) set increase [38;2;167;192;128mbackground.n[0m.

[38;2;122;132;120m[code][0m
  background.n = 100

[1m[38;2;167;192;128mIs it possible to overruse @background.task?[0m
[38;2;71;82;88m────────────────────────────────────────────[0m

I use this essentially anywhere that I cannot vectorize a python operation and push the compute down into those fast 💨 c extended libraries like numpy, and the operation takes more than a few minutes. Nearly every big network request I make gets broken down into chunks and multithreaded. Let me know… is is possible to overruse [38;2;167;192;128m@background.task[0m? Let me know your thoughts [4m[38;2;127;187;179m@_WaylonWalker[0m <[38;2;122;132;120mhttps://twitter.com/_WaylonWalker[0m>.

[1m[38;2;167;192;128mRepl.It[0m
[38;2;71;82;88m───────[0m

Play with the code here! Try different values of background.n and n_runs.