### High-Performance Python: Why?

Python is a programming language that first appeared in 1991; soon, it will have its 27^{th} birthday. Python was created not as a fast scientific language, but rather as a general-purpose language. You can use Python as a simple scripting language or as an object-oriented language or as a functional language…and beyond; it is *very* flexible. Today, it is used across an extremely wide range of disciplines and is used by many companies. As such, it has an enormous number of libraries and conferences that attract thousands of people every year.

But, Python is an interpreted language, so it is very slow. Just how slow? It depends, but you can count on about 10-100 times as slow as, say, C/C++. If you want fast code, the general rule is: don’t use Python. However, a few more moments of thought lead to a more nuanced perspective. What if you spend most of the time coding, and little time actually running the code? Perhaps your familiarity with the (slow) language, or its vast set of libraries, actually saves you time *overall*? And, what if you learned a few tricks that made your Python code itself a bit faster? Maybe that is enough for your needs? In the end, for true high performance computing applications, you will want to explore fast languages like C++; but, not all of our needs fall into that category.

As another example, consider the fact that many applications use two languages, one for the core code and one for the wrapper code; this allows for a smoother interface between the user and the core code. A common use case is C or C++ wrapped by, of course, Python. As a user, you may not even know that the code you are using is in another language! Such a situation is referred to as the “two-language problem”. This situation is great provided you don’t need to work in the core code, or you don’t mind working in two languages – some people don’t mind, but some do. The question then arises: if you are one of those people who would like to work *only* in the wrapper language, because it was chosen for its user friendliness, what options are available to make that language (Python in this example) fast enough that it can also be used for the core code?

We wanted to explore these ideas a bit further by writing a code in both Python and C++. Our past experience suggested that while Python is very slow, it could be made about as fast as C using the crazily-simple-to-use library Numba. Our basic comparisons here are: basic Python, Numba and C++. Because we are not religious about Python, and you shouldn’t be either, we invited expert C++ programmers to have the chance to speed up the C++ as much as they could (and, boy could they!).

## Numba

Luckily for those people who would like to use Python at all levels, there are many ways to increase the speed of Python. There is, in fact, a detailed book about this. Our interest here is specifically **Numba**. You can get it here. Why Numba? Numba allows for speedups comparable to most compiled languages with almost no effort: using your Python code almost as you would have written it natively and by only including a couple of lines of extra code. It seems almost too good to be true. Is it….?

In an nutshell, Numba employs the LLVM infrastructure to compile Python. Numba also has GPU capabilities, but we will not explore those in this post. Once you have installed Numba, you import it as you would any other library (e.g., NumPy). The difference is that you use decorators to give instructions to Numba; often, this is just placing “@jit“ before the function you want compiled.

Below we will compare several codes, including bare Python, Python with Numba, C++ and various forms of optimized C++. We wrote this post for three reasons:

- It’s always useful to see speed comparisons with the code examples given.
- Numba yielded code much faster (relative to C++) than we expected.
- An interesting lesson appears about human time.

Before we get to all of that, here is the background story that led to this study. If you have searched around this website, you might know that we are developing a pure Python molecular dynamics code (Sarkas), which Gautham presented at SciPy 2017. Prof. Murillo was teaching an independent study course on agent-based modeling to David, for which he write some simple cellular automata (CA) models; we applied Numba to these simple CA models to see what we would get. Moreover, at the same time, David was taking a C++ class from Prof. Punch. A comparison study was begging for us to complete it!

## Wolfram Models

We used a Wolfram model as our test case – what is that? Wolfram models are a type of one dimensional cellular automata model. These types of models tend to consist of a grid of cells. The cells can be in a one of a finite number of states and an update rule is used on the grid to find the next state. A Wolfram model has N cells in a one dimensional array that can be in a “on” or “off” state. The next state of each cell is determined from the state of the current cell and its two nearest neighbors. In a scheme like this the possible combinations of states for α cells is 2^{α}, so our Wolfram model has 2^{3} = 8 possible combinations if 3 cells are used to calculate the next state. The states are a binary system, since the current cell can only exist in one of two possible states. So, in general the number of possible update rules for α comparison cells is 2^{2α}, so our Wolfram models with using 3 comparison cells have 2^{8} = 256 possible update rules. Since the the update rules are a binary system, they will map onto binary numbers. The picture below shows a few examples of how update rules work and the mapping to a binary number.

Below are a few examples of some Wolfram models written in Python (code is given below).

Why Wolfram models? Because David coincidently wrote Wolfram models for two separate classes in Python and C++ at around the same time. Also, the code for the models is straightforward and the solutions are well known. Wolfram models and other cellular automata models like it are unique, so choosing an update rule and initial condition will provide the same solution every time it is solved, this makes for an easy comparison between the codes.

The algorithm used in this model iterates through the array from one end to the other while comparing each cell’s state and its two nearest neighbors. Periodic boundary conditions are used. While there are different rules for each Wolfram model, we used Rule 30 here.

## Base Codes

#### Python Code

#Written by David Butts | |

#This code is an implementation of a Rule 30 Wolfram model written in Python. | |

import numpy as np | |

import time | |

def Rule30_code(): | |

Rule30 = np.zeros((1000,100000)) #initilize an array to run on (timesteps, width) | |

Rule30[0,50] = 1 | |

for y in range(Rule30.shape[0]–1): #iterate through grid | |

for x in range(Rule30.shape[1]): | |

#update the next rows values according to neighbor & self value | |

right = x + 1 | |

down = y + 1 | |

left = x – 1 | |

if right >= Rule30.shape[1]: | |

right = 0 | |

if Rule30[y,right] == 1 and Rule30[y,left] == 0: | |

Rule30[down,x] = 1 | |

elif Rule30[y,x] == 0 and Rule30[y,right] == 0 and Rule30[y,left] == 1: | |

Rule30[down,x] = 1 | |

elif Rule30[y,x] == 1 and Rule30[y,right] == 0 and Rule30[y,left] == 0: | |

Rule30[down,x] = 1 | |

else: | |

Rule30[down,x] = 0 | |

#average run time for 10 runs | |

av_time = 0 | |

for run in range(10): | |

start = time.time() | |

Rule30_code() | |

end = time.time() | |

av_time += (end–start) | |

print(av_time/10.0, 'seconds for 10 runs') |

#### Numba Code

#Written by David Butts | |

#This code is an implementation of a Rule 30 Wolfram model written in Python. | |

import numpy as np | |

import time | |

import numba | |

@nb.jit #numba the function | |

def Rule30_code(): | |

Rule30 = np.zeros((1000,100000)) #initilize an array to run on (timesteps, width) | |

Rule30[0,50] = 1 | |

for y in range(Rule30.shape[0]–1): #iterate through grid | |

for x in range(Rule30.shape[1]): | |

#update the next rows values accoring to neighbor & self value | |

right = x + 1 | |

down = y + 1 | |

left = x – 1 | |

if right >= Rule30.shape[1]: | |

right = 0 | |

if Rule30[y,right] == 1 and Rule30[y,left] == 0: | |

Rule30[down,x] = 1 | |

elif Rule30[y,x] == 0 and Rule30[y,right] == 0 and Rule30[y,left] == 1: | |

Rule30[down,x] = 1 | |

elif Rule30[y,x] == 1 and Rule30[y,right] == 0 and Rule30[y,left] == 0: | |

Rule30[down,x] = 1 | |

else: | |

Rule30[down,x] = 0 | |

#average run time for 10 runs | |

av_time = 0 | |

for run in range(10): | |

start = time.time() | |

Rule30_code() | |

end = time.time() | |

av_time += (end–start) | |

print(av_time/10.0, 'seconds for 10 runs') |

#### C++ Code

//Written by David Butts | |

#include<time.h> | |

#include<chrono> | |

#include<thread> | |

#include<iomanip> | |

int next_val(const vector<int> &v){ | |

if (v == vector<int>{0,0,0}) | |

return 0; | |

else if (v == vector<int>{0,0,1}) | |

return 0; | |

else if (v == vector<int>{0,1,0}) | |

return 0; | |

else if (v == vector<int>{0,1,1}) | |

return 1; | |

else if (v == vector<int>{1,0,0}) | |

return 1; | |

else if (v == vector<int>{1,0,1}) | |

return 1; | |

else if (v == vector<int>{1,1,0}) | |

return 1; | |

else | |

return 0; | |

} | |

vector<int> one_iteration(const vector<int> &v){ | |

vector<int> vec = v; | |

for(int i = 1; i < v.size() – 1; i++) | |

vec[i] = next_val({v[i-1],v[i],v[i+1]}); | |

vec[0] = next_val({v[v.size()-1],v[0],v[1]}); | |

vec[v.size()-1] = next_val({v[v.size() – 2],v[v.size() – 1],v[0]}); | |

return vec; | |

} | |

int main() { | |

cout << std::fixed << std::setprecision(5); | |

int iter = 1000; | |

long av_time = 0; | |

std::vector<int> v(100000,0); | |

v[50] = 1; | |

for(int run = 0; run < 10; run++){ | |

auto t_start = std::chrono::high_resolution_clock::now(); | |

for(int i=0; i<iter; ++i){ | |

v = one_iteration(v); | |

} | |

auto t_end = std::chrono::high_resolution_clock::now(); | |

auto duration = std::chrono::duration<double, std::milli>(t_end-t_start); | |

av_time += duration.count(); | |

} | |

cout << "msec = " << av_time/10.0 << endl; | |

} |

#### Optimized C++ Code

//Written by Bill Punch | |

#include<iostream> | |

using std::cout; using std::endl; using std::boolalpha; | |

#include<chrono> | |

#include<iomanip> | |

#include<bitset> | |

using std::bitset; | |

int main() { | |

const size_t sz = 100000; | |

const int iter = 1000; | |

cout << boolalpha << std::fixed << std::setprecision(5); | |

int val; | |

long av_time = 0; | |

std::bitset<sz> v(0); | |

std::bitset<sz> tmp(0); | |

v[sz/2] = 1; | |

using time_unit = std::chrono::microseconds; | |

//using time_unit = std::chrono::milliseconds; | |

for(int run = 0; run < 10; run++){ | |

auto start = std::chrono::steady_clock::now(); | |

for(int i=0; i<iter; ++i){ | |

for(size_t i = v.size(); i>=1; –i){ | |

val = (int)v[i-1] << 2 | (int)v[i] << 1 | (int)v[i+1]; | |

tmp[i] = (val == 3 || val == 5 || val == 6); | |

} | |

v = tmp; | |

} | |

auto duration = std::chrono::duration_cast<time_unit>(std::chrono::steady_clock::now() – start); | |

av_time += duration.count(); | |

} | |

cout << "msec = " << av_time/10.0 << endl; | |

} |

## Results

What machine were these tested on? Desktop with:

- Xeon® Processor E5-1660 v4 (20M Cache, 3.2-3.6 GHz) 8C/16T 140W
- 4*32GB 2Rx4 4G x 72-Bit PC4-2400 CL17 Registered w/Parity 288-Pin DIMM (128GB Total)
- 2*GeForce GTX 1080 Ti Founders Edition (PNY) 11GB GDDR5X – 960GB PM863a SATA 6Gb/s 2.5″ SSD, 1,366 TBW ( OS and Scratch ) 1.92TB PM863a SATA 6Gb/s 2.5″ SSD, 2,773 TBW
- RAVE Proprietary Liquid Cooling Kit

Timing was measured in the codes using internal clocks. A time was recorded right before the Wolfram code began running and right after it finished. The code was ran ten times at different sizes of the model. The average time for the ten runs of each size was used.

And, here is the final result:

## Summary

In summary, we have compared timings for a Wolfram model code in basic Python, Numba and several versions of C++. We find that Numba is more than 100 times as fast as basic Python for this application. In fact, using a straight conversion of the basic Python code to C++ is slower than Numba. With further optimization within C++, the Numba version could be beat. While this was only for one test case, it illustrates some obvious points:

- Python is slow.
- Numba speeds up basic Python by a lot with almost no effort.
- Prototyping in Python and converting to C++ can generate code slower than adding Numba.
- If you want a truly fast C++ code, you can write one, and it will beat Numba.

Our biggest concern is that the Wolfram model does not fully capture floating-point operations. We are currently repeating this study with another test case (Julia set) and hope to have that here for you soon.

In the meantime, please comment below with your thoughts, persepctives and experiences.

Following some of the comments we have received, and because the CA model used above might bias the conclusions, we have performed another set of speed comparisons using a Julia set calculation and exploring the parallel options within Numba. Go here to see that!

## Credits

## Data

Code/Width | 100 | 300 | 1000 | 3000 | 10000 | 30000 | 100000 |
---|---|---|---|---|---|---|---|

Python | 0.1330 | 0.3990 | 1.3565 | 4.2989 | 15.4906 | 44.7538 | 148.8903 |

Numba | 0.0193 | 0.0185 | 0.0231 | 0.0324 | 0.0617 | 0.1400 | 0.4058 |

C++ Naive | 0.0390 | 0.1120 | 0.3350 | 1.0170 | 3.3470 | 9.9960 | 33.3370 |

C++ Naive -O2 | 0.0100 | 0.0220 | 0.0590 | 0.1590 | 0.5180 | 1.5270 | 5.1140 |

C++ Optimized | 0.0150 | 0.0330 | 0.0960 | 0.2720 | 0.8810 | 2.6120 | 8.7240 |

C++ Optimized -O2 | 0.0001 | 0.0003 | 0.0009 | 0.0023 | 0.0067 | 0.0149 | 0.0365 |

How you guys try to use the parallelization option in Numba? From my experience of large array manipulation, it can give a further 40% speed boost.

Here is what I did:

“High Performance Big Data Analysis Using NumPy, Numba & Python Asynchronous Programming”. Ernest Bonat, Ph.D. July 31, 2017. (http://dataconomy.com/2017/07/big-data-numpy-numba-python/)

Let me know what you think?

Your links of links stays on display over top of the content. Probably best to avoid such gimmickry anyway, but it’s really bad when it’s broken, as is the case on this site.

To make it a proper comparison you should bring back the optimized code from c++ into Numba and create a new comparison point “Numba optimized”. To make it even better, since the c++ optimized code required someone experienced with c++ that created something optimized for c++, you should spend an equivalent amount of time in creating a version which is optimized for Numba.

I agree, in fact it looks like the main difference between the numba code and the C++ code is in what they do (what they allocate, the conditions they check), rather than their language. A more efficient numba code, closer to the c++ one, would be for example:

import numpy as np

import numba as nb

sz = 100000

iterations = 1000

@nb.jit(nopython=True, parallel=False)

def Rule30_code():

v = np.zeros(sz, np.int8)

v[sz//2] = 1

test = np.zeros(sz, np.int8)

for it in range(iterations):

test[1:sz-1] = (v[:sz-2] << 2) + (v[1:sz-1] << 1) + v[2:]

for i in range(1, sz-1):

v[i] = 1 if (0 < test[i] < 5) else 0

return v

v_fast = Rule30_code()

%timeit -n 10 Rule30_code()

On my machine, this runs about 10.5-11 times faster than the posted numba code on the size=100000 example (producing the same result). This would make "optimized numba" just as fast as "C++ optimized -O2". Could someone add this option to the benchmark?

The naive c++ code is pretty bad. If you condense the else if conditions into a handful of conditions say two or three, you can speed it up quite a bit. Additionally the naive c++ allocates a ton of std::vectors with all those initializer lists, and if you get rid of those and have take three ints as parameters instead a std:vector you can get it to run even faster. On gcc with O2 those two changes get the naive c++ down to an average run time of about 100 ms.