Background Tasks in Python for Data Science

Waylon Walker

Installation #

install via pip

pip install background

install via github

I believe one of the later pr's to the project fixes the way arguments are passed in. I generally clone the repo or copy the module directly into my project.

clone it

git clone https://github.com/ParthS007/background.git
cd background
python setup.py install

copy the module

curl https://raw.githubusercontent.com/ParthS007/background/master/background.py > background.py

🐌 The Slow Function #

Imagine that this function is a big one! This function is fairly realistic as it takes in some input and returns a DataFrame. This is what a good half of my fuctions do in data science. The internals of this function generally will include a sql query, load from s3 or a data catalog, an aggregation from another DataFrame. In general it should do one simple thing.

Feel Free to copy this "boilerplate"






        
import background
from time import sleep
import pandas as pd

@background.task
def long_func(i):
    """
    Simulates fetching data from a service
    and returning a pandas DataFrame.

    """
    sleep(10)
    return pd.DataFrame({'number_squared': [i**2]})

Calling the Slow Function #

it's the future calling 🤙

If we were to call this function 10 times it would take 100s. Not bad for a dumb example, but detrimental when this gets scaled up💥. We want to utilize all of our available resources to reduce our development time and get moving on our project.

Calling long_func will return a future object. This object has a number of methods that you can read about in the cpython docs. The main one we are interested in is result. I typically call these functions many times and put them into a list object so that I can track their progress and get their results. If you needed to map inputs back to the result use a dictionary.






        
%time futures = [long_func(i) for i in range(10)]

CPU times: user 319 µs, sys: 197 µs, total: 516 µs
Wall time: 212 µs

Do something with those `results()` #

Simply running the function completes in no time! This is because the future objects that are returned are non blocking and will run in a background task using the ProcessPoolExecutor. To get the result back out we need to call the result method on the future object.result is a blocking function that will not realease until the function has completed.






        
%%time
futures = [long_func(i) for i in range(10)]
pd.concat([future.result() for future in futures])

CPU times: user 5.38 ms, sys: 3.53 ms, total: 8.9 ms
Wall time: 10 s

Note that this example completed in 10s, the time it took for only one run, not all 10! 😎

n #

😫 crank it up

By default the number of parallel processes wil be equal to the number of cpu threads on your machine. To increase the number of parallel processes (max_workers) set increase background.n.






        
background.n = 100

Is it possible to overruse @background.task? #

I use this essentially anywhere that I cannot vectorize a python operation and push the compute down into those fast 💨 c extended libraries like numpy, and the operation takes more than a few minutes. Nearly every big network request I make gets broken down into chunks and multithreaded. Let me know... is is possible to overruse @background.task? Let me know your thoughts @_WaylonWalker.

Repl.It #

Play with the code here! Try different values of background.n and n_runs.

Reply by email

Background Tasks in Python for Data Science

Tags

Installation #

install via pip

install via github

🐌 The Slow Function #

Calling the Slow Function #

Do something with those `results()` #

n #

Is it possible to overruse @background.task? #

Repl.It #

Recent Posts

Recent Thoughts

Recent Stars

Background Tasks in Python for Data Science

Tags

Installation #

install via pip

install via github

🐌 The Slow Function #

Calling the Slow Function #

Do something with those results() #

n #

Is it possible to overruse @background.task? #

Repl.It #

Recent Posts

Recent Thoughts

Recent Stars

Do something with those `results()` #