Using ThreadPool for concurrent downloads¶

This is a simple illustration of using ThreadPool to parallelize downloads. Assumes that bandwidth is not the limiting factor, in which case concurrency doesn't help.

In [1]:

import requests
from multiprocessing.pool import ThreadPool

Test a simple request to my slow server It just replies to any request for /NUMBER with the number requested, but the server is artificially slow in its handling of requests.

In [2]:

%time r = requests.get("http://localhost:8888/10")
r.content

CPU times: user 18.1 ms, sys: 4.54 ms, total: 22.6 ms
Wall time: 224 ms

Out[2]:

'10'

Our test function downloads the URL for a given ID, and parses the result (casts str of int to int).

In [3]:

def get_data(ID):
    """function for getting data from our slow server"""
    r = requests.get("http://localhost:8888/%i" % ID)
    return int(r.content)

Now test using a threadpool to get the data, using a varying number of concurrent threads

In [4]:

IDs = range(128)
for nthreads in [1, 2, 4, 8, 16, 32]:
    pool = ThreadPool(nthreads)
    tic = time.time()
    result = pool.map(get_data, IDs)
    toc = time.time()
    print "%i threads: %3.1f seconds" % (nthreads, toc-tic)

1 threads: 26.2 seconds
2 threads: 13.3 seconds
4 threads: 6.7 seconds
8 threads: 3.4 seconds
16 threads: 1.8 seconds
32 threads: 1.1 seconds