Python Bits — Moving from threads to async-await
This is the third post in the series of Python blog posts I’m writing, you can find the first one here. In this particular one we’ll move from threads to async-await style introduced in Python 3.5
Let’s start by clearing some terms, async-await is just another approach towards asynchronous programming. The other two approaches that I know are callbacks and promises (both quite famous in the JavaScript world).
One more thing thing we need to understand is that asynchronous just means that tasks don’t block. For example, if I try to fetch a web-page from a slow website, my program won’t have to wait for the download to finish, it can carry on doing other things while it waits for the page to load. This isn’t the same as doing multiple things in parallel. The following pseudocode should make things a bit easier to understand :
# Blocking code example
page = get_page_sync('some_page')
# The call to print is blocked
# If the get_page_sync is a slow function
# Our whole program would get stalled
print(page)
There are two approaches to mitigate the above scenarios. First, let’s use
threads (which we have done till now in our Imgur album downloader). By using
threads, we can offload the get_page_sync
call to a separate thread and do
something else in the main program thread.
### offload the slow code to a thread
t = threading.thread(target=get_page_sync, args=('some_page',))
t.run()
### do something else while the thread runs
do_something_else()
### wait for the thread to complete
t.join()
There are several advantages as well as disadvantages to threading, the main disadvantages being having to lock shared data structures before mutating them and taking care of exceptions (within a thread) through message passing to the main thread. At the same time, they’re a bit more convenient to work with as compare to async-await (Or I’m just not yet used to this paradigm!)
Now let’s start with the second approach to mitigating the slow blocking functions, the asynchronous programming approach, specifically we’ll be dealing with the async-await syntax introduced in Python 3.5. The main difference that I feel between async-await and multi-threaded programming is that in the latter the kernel does a context switch while in the former we ourselves are responsible for yielding control to a different part of code. What I mean by a context switch is when multiple threads of code are running, it is upto the kernel to stop executing the current thread, put it to sleep, and pick up a different thread and continue execution. This is known as preemptive multitasking.
When we ourselves yield control, it is known as non-preemptive or cooperative multitasking. Since, with nonpreemptive multitasking it is our own code which is taking care of context switches, we need a scheduler, also called an “event loop”. This event loop just loops through all the events which are waiting to be scheduled and run them. Whenever we yield control, the current task gets added to the queue, and the first task (not by sequence, but by priority) gets popped from the queue and starts executing. For example, the above pseudocode can be changed in the following way to allow for yielding control:
async def print_page():
page = await get_page_async('some_page')
print(page)
When we hit the above statement, the get_page_async
function would do a
non-blocking fetch of 'some_page'
and yield control, which would in turn mean
that our print_page
function would yield control to the event loop and the
event loop could continue doing something else till we get a response back.
Let’s start by changing our threaded code to this syntax. We’ll use the
asyncio
library the Python itself provides for the event loop and use the
aiohttp
package for doing asynchronous http requests.
We’ll create a function called main
(defined later) which would be the entry
point of our asynchronous code. We’ll then create an event loop and a “future
object”. This future object is an abstraction over an asynchronous function
which stores some more attributes such as its current state (just like
promises). We will then tell our event loop to keep running until this future
completes.
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(main())
loop.run_until_complete(future)
Now in our main function, we’ll create another list of future tasks, each one
responsible for downloading a different image from the Imgur album. We do this
because each download would lead to a network call, and for each network call we
can yield control to another piece of code. After creating a list of tasks, we
can wait for the whole list to complete by calling asyncio.gather
. This is how
its done:
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for img in img_lst:
task = asyncio.ensure_future(download_img(img, session))
tasks.append(task)
await asyncio.gather(*tasks)
And the last function we need to modify is the download_img
function which was
currently using threads. We just replace the requests.get
call with await session.get
and change the synchronous save to disk operation to asynchronous:
i = 1
async def download_img(img, session):
global i, bar
# get the file extension
file_ext = get_extension(img.link)
# create unique name by combining file id with its extension
file_name = img.id + file_ext
resp = await session.get(img.link)
with open(file_name, 'wb') as f:
async for chunk in resp.content.iter_chunked(1024):
f.write(chunk)
bar.update(i)
i += 1
One thing you’ll notice is that we no longer require the lock before updating
the value of i
. This is because as I said earlier, only one piece of code is
running at a time, so there can never be any race conditions.
The asynchronous version of our previously implemented threaded code should be slightly faster due to no overheads of either locks or threads.
Here’s the entire code:
Some very good sources for more info: