Python Bits — Using Threads
This is the second in the series of Python blog posts I’m writing, you can find the first one here. In this particular one we’ll add threads to our Imgur Album downloader, hopefully making it a tad bit faster than before.
First of, you must be wondering, due to the infamous GIL(in CPython of course), threads are not useful in Python. Luckily in our case, most of the time threads will be waiting for network activity, and the GIL would happily switch threads from one to another instead of locking on one of them. So, it would actually be beneficial to use threads since most of the times we’ll be doing either Network IO (while downloading the images) or File IO (while writing the images to disk)
Since this is Python, using threads is quite simple, we just need to import the
threading module, create a thread instance, and then tell it to run. While
creating the instance we can tell it what function to run, and if needed, we can
pass arguments to that function through this thread we have created. We will
then call the join
function in the main process loop so that we can wait for
our thread to finish
import threading | |
# we'll use a Python thread to call this function | |
def foo(arg1): | |
print(arg1) | |
# using the below syntax you can create a Python thread | |
# Note that we need to pass a tuple to args, therefore I've | |
# added a comma(,) after the "3", don't forget that | |
thread = threading.Target(target=foo, args=(3,)) | |
# starting the thread is no-brainer | |
thread.start() | |
# wait for the thread to finish | |
thread.join() |
We’ll just call our download_img
function from each thread, telling it to
download a different picture. One problem we might face now is with the progress
bar, since threads run parallel to our main process thread, our for loop will
finish as soon as we have launched all the requisite number of threads, and thus
the progress bar would reach 100% before all the images have finished
downloading.
To counter this, after each thread completes, we’ll manually update our progress bar.
This is how we do it :
bar = progressbar.ProgressBar(max_value=len(img_lst)) | |
with lock: | |
bar.update(i) | |
i += 1 |
The max_value
tells it that we have this many items, when the count reaches
that number, the progress bar should be at 100%. To update the progress bar,
we’ll take a lock to increment a variable, and use that variable to update the
progress bar. The lock is necessary to prevent multiple threads from updating
the same global variable simultaneously, and not mess up the whole thing.
#! /usr/bin/env python | |
import os | |
import re | |
import sys | |
import threading | |
import progressbar | |
import requests | |
from imgurpython import ImgurClient | |
regex = re.compile(r'\.(\w+)$') | |
def get_extension(link): | |
ext = regex.search(link).group() | |
return ext | |
lock = threading.Lock() | |
i = 1 | |
def download_img(img): | |
# If we don't specify global here, Python would complain. | |
# It would assume that "i" and "bar" are two local variables | |
# and we're using them without initialization. | |
# Using the below syntax, all out threads can access | |
# these global variables | |
global i, bar | |
file_ext = get_extension(img.link) | |
resp = requests.get(img.link, stream=True) | |
# create unique name by combining file id with its extension | |
file_name = img.id + file_ext | |
with open(file_name, ‘wb’) as f: | |
for chunk in resp.iter_content(chunk_size=1024): | |
f.write(chunk) | |
with lock: | |
bar.update(i) | |
i += 1 | |
try: | |
album_id = sys.argv[1] | |
except IndexError: | |
raise Exception(‘Please specify an album id’) | |
client_id = os.getenv(‘IMGUR_CLIENT_ID’) | |
client_secret = os.getenv(‘IMGUR_CLIENT_SECRET’) | |
client = ImgurClient(client_id, client_secret) | |
img_lst = client.get_album_images(album_id) | |
bar = progressbar.ProgressBar(max_value=len(img_lst)) | |
threads = [] | |
for img in img_lst: | |
t = threading.Thread(target=download_img, args=(img,)) | |
threads.append(t) | |
t.start() | |
# this is for the main loop to wait for all threads to finish | |
for t in threads: | |
t.join() |
Phew, that was quite some work with locks and all. In the next post, we’ll move from these messy threads to the new and shiny async-await style for doing asynchronous code.