High-Performance Python: Why?
Python is a programming language that first appeared in 1991; soon, it will have its 27th birthday. Python was created not as a fast scientific language, but rather as a general-purpose language. You can use Python as a simple scripting language or as an object-oriented language or as a functional language…and beyond; it is very flexible. Today, it is used across an extremely wide range of disciplines and is used by many companies. As such, it has an enormous number of libraries and conferences that attract thousands of people every year.
But, Python is an interpreted language, so it is very slow. Just how slow? It depends, but you can count on about 10-100 times as slow as, say, C/C++. If you want fast code, the general rule is: don’t use Python. However, a few more moments of thought lead to a more nuanced perspective. What if you spend most of the time coding, and little time actually running the code? Perhaps your familiarity with the (slow) language, or its vast set of libraries, actually saves you time overall? And, what if you learned a few tricks that made your Python code itself a bit faster? Maybe that is enough for your needs? In the end, for true high performance computing applications, you will want to explore fast languages like C++; but, not all of our needs fall into that category.
As another example, consider the fact that many applications use two languages, one for the core code and one for the wrapper code; this allows for a smoother interface between the user and the core code. A common use case is C or C++ wrapped by, of course, Python. As a user, you may not even know that the code you are using is in another language! Such a situation is referred to as the “two-language problem”. This situation is great provided you don’t need to work in the core code, or you don’t mind working in two languages – some people don’t mind, but some do. The question then arises: if you are one of those people who would like to work only in the wrapper language, because it was chosen for its user friendliness, what options are available to make that language (Python in this example) fast enough that it can also be used for the core code?
We wanted to explore these ideas a bit further by writing a code in both Python and C++. Our past experience suggested that while Python is very slow, it could be made about as fast as C using the crazily-simple-to-use library Numba. Our basic comparisons here are: basic Python, Numba and C++. Because we are not religious about Python, and you shouldn’t be either, we invited expert C++ programmers to have the chance to speed up the C++ as much as they could (and, boy could they!).
Luckily for those people who would like to use Python at all levels, there are many ways to increase the speed of Python. There is, in fact, a detailed book about this. Our interest here is specifically Numba. You can get it here. Why Numba? Numba allows for speedups comparable to most compiled languages with almost no effort: using your Python code almost as you would have written it natively and by only including a couple of lines of extra code. It seems almost too good to be true. Is it….?
In an nutshell, Numba employs the LLVM infrastructure to compile Python. Numba also has GPU capabilities, but we will not explore those in this post. Once you have installed Numba, you import it as you would any other library (e.g., NumPy). The difference is that you use decorators to give instructions to Numba; often, this is just placing “@jit“ before the function you want compiled.
Below we will compare several codes, including bare Python, Python with Numba, C++ and various forms of optimized C++. We wrote this post for three reasons:
- It’s always useful to see speed comparisons with the code examples given.
- Numba yielded code much faster (relative to C++) than we expected.
- An interesting lesson appears about human time.
Before we get to all of that, here is the background story that led to this study. If you have searched around this website, you might know that we are developing a pure Python molecular dynamics code (Sarkas), which Gautham presented at SciPy 2017. Prof. Murillo was teaching an independent study course on agent-based modeling to David, for which he write some simple cellular automata (CA) models; we applied Numba to these simple CA models to see what we would get. Moreover, at the same time, David was taking a C++ class from Prof. Punch. A comparison study was begging for us to complete it!
We used a Wolfram model as our test case – what is that? Wolfram models are a type of one dimensional cellular automata model. These types of models tend to consist of a grid of cells. The cells can be in a one of a finite number of states and an update rule is used on the grid to find the next state. A Wolfram model has N cells in a one dimensional array that can be in a “on” or “off” state. The next state of each cell is determined from the state of the current cell and its two nearest neighbors. In a scheme like this the possible combinations of states for α cells is 2α, so our Wolfram model has 23 = 8 possible combinations if 3 cells are used to calculate the next state. The states are a binary system, since the current cell can only exist in one of two possible states. So, in general the number of possible update rules for α comparison cells is 22α, so our Wolfram models with using 3 comparison cells have 28 = 256 possible update rules. Since the the update rules are a binary system, they will map onto binary numbers. The picture below shows a few examples of how update rules work and the mapping to a binary number.
Below are a few examples of some Wolfram models written in Python (code is given below).
Why Wolfram models? Because David coincidently wrote Wolfram models for two separate classes in Python and C++ at around the same time. Also, the code for the models is straightforward and the solutions are well known. Wolfram models and other cellular automata models like it are unique, so choosing an update rule and initial condition will provide the same solution every time it is solved, this makes for an easy comparison between the codes.
The algorithm used in this model iterates through the array from one end to the other while comparing each cell’s state and its two nearest neighbors. Periodic boundary conditions are used. While there are different rules for each Wolfram model, we used Rule 30 here.
Optimized C++ Code
What machine were these tested on? Desktop with:
- Xeon® Processor E5-1660 v4 (20M Cache, 3.2-3.6 GHz) 8C/16T 140W
- 4*32GB 2Rx4 4G x 72-Bit PC4-2400 CL17 Registered w/Parity 288-Pin DIMM (128GB Total)
- 2*GeForce GTX 1080 Ti Founders Edition (PNY) 11GB GDDR5X – 960GB PM863a SATA 6Gb/s 2.5″ SSD, 1,366 TBW ( OS and Scratch ) 1.92TB PM863a SATA 6Gb/s 2.5″ SSD, 2,773 TBW
- RAVE Proprietary Liquid Cooling Kit
Timing was measured in the codes using internal clocks. A time was recorded right before the Wolfram code began running and right after it finished. The code was ran ten times at different sizes of the model. The average time for the ten runs of each size was used.
And, here is the final result:
In summary, we have compared timings for a Wolfram model code in basic Python, Numba and several versions of C++. We find that Numba is more than 100 times as fast as basic Python for this application. In fact, using a straight conversion of the basic Python code to C++ is slower than Numba. With further optimization within C++, the Numba version could be beat. While this was only for one test case, it illustrates some obvious points:
- Python is slow.
- Numba speeds up basic Python by a lot with almost no effort.
- Prototyping in Python and converting to C++ can generate code slower than adding Numba.
- If you want a truly fast C++ code, you can write one, and it will beat Numba.
Our biggest concern is that the Wolfram model does not fully capture floating-point operations. We are currently repeating this study with another test case (Julia set) and hope to have that here for you soon.
In the meantime, please comment below with your thoughts, persepctives and experiences.
Following some of the comments we have received, and because the CA model used above might bias the conclusions, we have performed another set of speed comparisons using a Julia set calculation and exploring the parallel options within Numba. Go here to see that!
|C++ Naive -O2||0.0100||0.0220||0.0590||0.1590||0.5180||1.5270||5.1140|
|C++ Optimized -O2||0.0001||0.0003||0.0009||0.0023||0.0067||0.0149||0.0365|