Python is an extremely popular programming language among the scientific community - in fact, not just in the scientific community - Python is the fourth most popular programming language in use overall.
Part of the work I am involved in at CEDA involves developing data extraction software in Python that is run on some high-performance processing nodes (see JASMIN). These high-spec machines are deployed with several CPUs and a large amount of RAM, which in theory makes them ideal for performing large-scale numerical processing tasks.
The number of CPU cores available in these virtual machines are an important feature - it means that they are best-suited to performing their data processing tasks in parallel - i.e. doing more than one thing at once. To fully exploit the power available in these virtual machines, it's necessary to use some clever Python libraries. Before that, though - let's examine why parallel processing is so useful.
Newer computers are often built with a number of CPU cores, which enables a computer to perform more than one task at once - but when using a single-threaded script to perform most CPU-bound tasks, Python will only use one processor. This means that even if you're running your Python code on a high-speed multi-core behemoth... it won't necessarily run any faster than on a similar machine with only one processing core.
For example, let's take the following function into consideration: (it's rather a contrived example, but it should be illustrative)
def sum_range(a, b): """ Sums all the numbers between 'a' and 'b' See: http://stackoverflow.com/a/2846695 """ total = 0 for i in xrange(a, b): total += i return total sum_range(0, 10000000)
The simple section of code above simply adds up each number in a given range and returns the result. However, what if you wanted to sum a very large range of integers - between 1 and 10^8, for example?
This would take an exceptionally long time in a standard single-threaded Python script, but because addition is a commutative operation, it can be split into smaller, more manageable chunks. So how about using threads to make the addition faster by splitting it into two tasks and running them both at once?
import threading import Queue class SumThing: the_sum = 0 # This variable holds our sum q = Queue.Queue() # Holds our temp totals def sum_range(self, q, a, b): """ Sums all the numbers between 'a' and 'b' See: http://stackoverflow.com/a/2846695 """ total = 0 for i in xrange(a, b): total += i q.put(total) s = SumThing() # Make threads t1 = threading.Thread(target=s.sum_range, args=(s.q, 0, 50000000,)) t2 = threading.Thread(target=s.sum_range, args=(s.q, 50000000, 100000000,)) t1.start(); t2.start() # Start threads running t1.join(); t2.join() # Wait for them to finish... # Add up all the items in the queue... total = 0 for i in xrange(0, s.q.qsize()): total += s.q.get()
So, if we were to measure how long these scripts took to execute...
$ time python sum_range_serial.py real 0m7.242s user 0m7.226s sys 0m0.003s $ time python sum_range_threaded.py real 0m8.840s user 0m12.086s sys 0m2.140s
The reason for the strange behaviour we just observed (i.e. a multi-threaded script taking longer to run) is because of the way that Python is designed. Inside a single thread in Python lurks a variable known as the Global Interpreter Lock, or the GIL. The GIL is a mutex that that helps Python keep track of what's running, and grants the currently running thread exclusive access to the internals of the Python interpreter.
That means that only one thread can run in the interpreter at once - which seriously limits the performance of threaded CPU-bound tasks. It causes further problems for multi-core computers that use threaded Python applications - the effects of using Python threading on multi-core systems can be detrimental to the performance of your code - in fact, the way that Python uses the GIL is such that it can be faster to disable extra CPU cores.
Quoted from the Python Wiki:
The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
So... How do you take advantage of the full capabilities of your multi-core system with Python?
The Python multiprocessing library is a great alternative to the threading module, that doesn't suffer from the GIL problem. Instead of spawning a number of threads that have to access a single global variable, multiprocessing uses individual Python processes (each with their own instance of the GIL). This means that there's no blocking or waiting for the GIL to become free, and there are no CPU-bound threads stealing processor time from IO-bound threads.
What's more is that multiprocessing even has a similar API to the threading module, and it provides a number of useful data structures to use for sharing memory between processes or storing results from executions.
For example, see the following example adapted from the threading code:
import multiprocessing class SumThing: the_sum = 0 # This variable holds our sum q = multiprocessing.Queue() # Holds temp totals def sum_range(self, q, a, b): """ Sums all the numbers between 'a' and 'b' See: http://stackoverflow.com/a/2846695 """ total = 0 for i in xrange(a, b): total += i q.put(total) s = SumThing() # Make processes p1 = multiprocessing.Process(target=s.sum_range, args=(s.q, 0, 50000000,)) p2 = multiprocessing.Process(target=s.sum_range, args=(s.q, 50000000, 100000000,)) p1.start(); p2.start() # Start processes running p1.join(); p2.join() # Wait for them to finish... # Add up all the items in the queue... total = 0 for i in xrange(0, s.q.qsize()): total += s.q.get()
and for the difference in execution time...
$ time python sum_range_serial.py real 0m7.242s user 0m7.226s sys 0m0.003s $ time python sum_range_threaded.py real 0m8.840s user 0m12.086s sys 0m2.140s $ time python sum_range_multiprocess.py real 0m3.810s user 0m7.361s sys 0m0.023s
Much better! (Note that for this example, the serial implementation can still be slightly faster - this is due to the overhead of starting up 2 Python processes, not because of a flaw in multiprocessing)