Background Tasks in Python for Data Science ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ This post is intended as an extension/update from background tasks in python. I started using the week that Kenneth Reitz released it. It takes away so much... Date: September 10, 2019 This post is intended as an extension/update from background tasks in python <https://waylonwalker.com/background-1/>. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started. This post is intended as an extension/update from background tasks in python <https://waylonwalker.com/background-1/>. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started. │ I use it in more places than I probably should Before we get into it, I want to make a shout out to Kenneth Reitz for making this so easy. Kenneth is a python God for all that he has given to the community in so many ways, especially with his ideas in building stupid simple api’s for very complicated things. Installation ──────────── ### install via pip [code] pip install background ### install via github I believe one of the later pr’s to the project fixes the way arguments are passed in. I generally clone the repo or copy the module directly into my project. clone it [code] git clone https://github.com/ParthS007/background.git cd background python setup.py install copy the module [code] curl https://raw.githubusercontent.com/ParthS007/background/master/background.py > background.py 🐌 The Slow Function ─────────────────── Imagine that this function is a big one! This function is fairly realistic as it takes in some input and returns a DataFrame. This is what a good half of my fuctions do in data science. The internals of this function generally will include a sql query, load from s3 or a data catalog, an aggregation from another DataFrame. In general it should do one simple thing. Feel Free to copy this “boilerplate” [code] import background from time import sleep import pandas as pd @background.task def long_func(i): """ Simulates fetching data from a service and returning a pandas DataFrame. """ sleep(10) return pd.DataFrame({'number_squared': [i**2]}) Calling the Slow Function ───────────────────────── it’s the future calling 🤙 If we were to call this function 10 times it would take 100s. Not bad for a dumb example, but detrimental when this gets scaled up💥. We want to utilize all of our available resources to reduce our development time and get moving on our project. Calling long_func will return a future object. This object has a number of methods that you can read about in the cpython docs <https://docs.python.org/3/library/concurrent.futures.html#future-objects>. The main one we are interested in is result. I typically call these functions many times and put them into a list object so that I can track their progress and get their results. If you needed to map inputs back to the result use a dictionary. [code] %time futures = [long_func(i) for i in range(10)] CPU times: user 319 µs, sys: 197 µs, total: 516 µs Wall time: 212 µs Do something with those results() ───────────────────────────────── Simply running the function completes in no time! This is because the future objects that are returned are non blocking and will run in a background task using the ProcessPoolExecutor. To get the result back out we need to call the result method on the future object.result is a blocking function that will not realease until the function has completed. [code] %%time futures = [long_func(i) for i in range(10)] pd.concat([future.result() for future in futures]) CPU times: user 5.38 ms, sys: 3.53 ms, total: 8.9 ms Wall time: 10 s Note that this example completed in 10s, the time it took for only one run, not all 10! 😎 n ─ 😫 crank it up By default the number of parallel processes wil be equal to the number of cpu threads on your machine. To increase the number of parallel processes (max_workers) set increase background.n. [code] background.n = 100 Is it possible to overruse @background.task? ──────────────────────────────────────────── I use this essentially anywhere that I cannot vectorize a python operation and push the compute down into those fast 💨 c extended libraries like numpy, and the operation takes more than a few minutes. Nearly every big network request I make gets broken down into chunks and multithreaded. Let me know… is is possible to overruse @background.task? Let me know your thoughts @_WaylonWalker <https://twitter.com/_WaylonWalker>. Repl.It ─────── Play with the code here! Try different values of background.n and n_runs.