Numba Versus C++

2018-01-17T15:57:55-05:00

How you guys try to use the parallelization option in Numba? From my experience of large array manipulation, it can give a further 40% speed boost.

Reply

2018-02-15T18:12:35-05:00

Here is what I did:

“High Performance Big Data Analysis Using NumPy, Numba & Python Asynchronous Programming”. Ernest Bonat, Ph.D. July 31, 2017. (http://dataconomy.com/2017/07/big-data-numpy-numba-python/)

Let me know what you think?

Reply

2018-03-06T19:59:23-05:00

Your links of links stays on display over top of the content. Probably best to avoid such gimmickry anyway, but it’s really bad when it’s broken, as is the case on this site.

Reply

2019-06-02T13:10:24-04:00

To make it a proper comparison you should bring back the optimized code from c++ into Numba and create a new comparison point “Numba optimized”. To make it even better, since the c++ optimized code required someone experienced with c++ that created something optimized for c++, you should spend an equivalent amount of time in creating a version which is optimized for Numba.

Reply

2019-06-24T03:09:12-04:00

I agree, in fact it looks like the main difference between the numba code and the C++ code is in what they do (what they allocate, the conditions they check), rather than their language. A more efficient numba code, closer to the c++ one, would be for example:

import numpy as np
import numba as nb

sz = 100000
iterations = 1000

@nb.jit(nopython=True, parallel=False)
def Rule30_code():
v = np.zeros(sz, np.int8)
v[sz//2] = 1
test = np.zeros(sz, np.int8)
for it in range(iterations):
test[1:sz-1] = (v[:sz-2] << 2) + (v[1:sz-1] << 1) + v[2:]
for i in range(1, sz-1):
v[i] = 1 if (0 < test[i] < 5) else 0
return v

v_fast = Rule30_code()
%timeit -n 10 Rule30_code()

On my machine, this runs about 10.5-11 times faster than the posted numba code on the size=100000 example (producing the same result). This would make "optimized numba" just as fast as "C++ optimized -O2". Could someone add this option to the benchmark?

Reply

2020-05-20T14:49:52-04:00

The naive c++ code is pretty bad. If you condense the else if conditions into a handful of conditions say two or three, you can speed it up quite a bit. Additionally the naive c++ allocates a ton of std::vectors with all those initializer lists, and if you get rid of those and have take three ints as parameters instead a std:vector you can get it to run even faster. On gcc with O2 those two changes get the naive c++ down to an average run time of about 100 ms.

Reply

2020-08-13T17:32:55-04:00

Nice Work! Very tidy and informative!

Reply

2021-05-13T16:38:06-04:00

Why not compare numba with C++ code compiled with -O3?

Reply

	#Written by David Butts
	#This code is an implementation of a Rule 30 Wolfram model written in Python.
	import numpy as np
	import time
	def Rule30_code():
	Rule30 = np.zeros((1000,100000)) #initilize an array to run on (timesteps, width)
	Rule30[0,50] = 1
	for y in range(Rule30.shape[0]-1): #iterate through grid
	for x in range(Rule30.shape[1]):
	#update the next rows values according to neighbor & self value
	right = x + 1
	down = y + 1
	left = x – 1
	if right >= Rule30.shape[1]:
	right = 0
	if Rule30[y,right] == 1 and Rule30[y,left] == 0:
	Rule30[down,x] = 1
	elif Rule30[y,x] == 0 and Rule30[y,right] == 0 and Rule30[y,left] == 1:
	Rule30[down,x] = 1
	elif Rule30[y,x] == 1 and Rule30[y,right] == 0 and Rule30[y,left] == 0:
	Rule30[down,x] = 1
	else:
	Rule30[down,x] = 0
	#average run time for 10 runs
	av_time = 0
	for run in range(10):
	start = time.time()
	Rule30_code()
	end = time.time()
	av_time += (end-start)
	print(av_time/10.0, 'seconds for 10 runs')

	#Written by David Butts
	#This code is an implementation of a Rule 30 Wolfram model written in Python.
	import numpy as np
	import time
	import numba
	@nb.jit #numba the function
	def Rule30_code():
	Rule30 = np.zeros((1000,100000)) #initilize an array to run on (timesteps, width)
	Rule30[0,50] = 1
	for y in range(Rule30.shape[0]-1): #iterate through grid
	for x in range(Rule30.shape[1]):
	#update the next rows values accoring to neighbor & self value
	right = x + 1
	down = y + 1
	left = x – 1
	if right >= Rule30.shape[1]:
	right = 0
	if Rule30[y,right] == 1 and Rule30[y,left] == 0:
	Rule30[down,x] = 1
	elif Rule30[y,x] == 0 and Rule30[y,right] == 0 and Rule30[y,left] == 1:
	Rule30[down,x] = 1
	elif Rule30[y,x] == 1 and Rule30[y,right] == 0 and Rule30[y,left] == 0:
	Rule30[down,x] = 1
	else:
	Rule30[down,x] = 0
	#average run time for 10 runs
	av_time = 0
	for run in range(10):
	start = time.time()
	Rule30_code()
	end = time.time()
	av_time += (end-start)
	print(av_time/10.0, 'seconds for 10 runs')

	//Written by David Butts
	#include<time.h>
	#include<chrono>
	#include<thread>
	#include<iomanip>
	int next_val(const vector<int> &v){
	if (v == vector<int>{0,0,0})
	return 0;
	else if (v == vector<int>{0,0,1})
	return 0;
	else if (v == vector<int>{0,1,0})
	return 0;
	else if (v == vector<int>{0,1,1})
	return 1;
	else if (v == vector<int>{1,0,0})
	return 1;
	else if (v == vector<int>{1,0,1})
	return 1;
	else if (v == vector<int>{1,1,0})
	return 1;
	else
	return 0;
	}
	vector<int> one_iteration(const vector<int> &v){
	vector<int> vec = v;
	for(int i = 1; i < v.size() – 1; i++)
	vec[i] = next_val({v[i-1],v[i],v[i+1]});
	vec[0] = next_val({v[v.size()-1],v[0],v[1]});
	vec[v.size()-1] = next_val({v[v.size() – 2],v[v.size() – 1],v[0]});
	return vec;
	}
	int main() {
	cout << std::fixed << std::setprecision(5);
	int iter = 1000;
	long av_time = 0;
	std::vector<int> v(100000,0);
	v[50] = 1;
	for(int run = 0; run < 10; run++){
	auto t_start = std::chrono::high_resolution_clock::now();
	for(int i=0; i<iter; ++i){
	v = one_iteration(v);
	}
	auto t_end = std::chrono::high_resolution_clock::now();
	auto duration = std::chrono::duration<double, std::milli>(t_end-t_start);
	av_time += duration.count();
	}
	cout << "msec = " << av_time/10.0 << endl;
	}

	//Written by Bill Punch
	#include<iostream>
	using std::cout; using std::endl; using std::boolalpha;
	#include<chrono>
	#include<iomanip>
	#include<bitset>
	using std::bitset;

	int main() {
	const size_t sz = 100000;
	const int iter = 1000;
	cout << boolalpha << std::fixed << std::setprecision(5);

	int val;
	long av_time = 0;
	std::bitset<sz> v(0);
	std::bitset<sz> tmp(0);
	v[sz/2] = 1;
	using time_unit = std::chrono::microseconds;
	//using time_unit = std::chrono::milliseconds;
	for(int run = 0; run < 10; run++){
	auto start = std::chrono::steady_clock::now();
	for(int i=0; i<iter; ++i){
	for(size_t i = v.size(); i>=1; –i){
	val = (int)v[i-1] << 2 \| (int)v[i] << 1 \| (int)v[i+1];
	tmp[i] = (val == 3 \|\| val == 5 \|\| val == 6);
	}
	v = tmp;
	}
	auto duration = std::chrono::duration_cast<time_unit>(std::chrono::steady_clock::now() – start);
	av_time += duration.count();
	}
	cout << "msec = " << av_time/10.0 << endl;
	}

Code/Width	100	300	1000	3000	10000	30000	100000
Python	0.1330	0.3990	1.3565	4.2989	15.4906	44.7538	148.8903
Numba	0.0193	0.0185	0.0231	0.0324	0.0617	0.1400	0.4058
C++ Naive	0.0390	0.1120	0.3350	1.0170	3.3470	9.9960	33.3370
C++ Naive -O2	0.0100	0.0220	0.0590	0.1590	0.5180	1.5270	5.1140
C++ Optimized	0.0150	0.0330	0.0960	0.2720	0.8810	2.6120	8.7240
C++ Optimized -O2	0.0001	0.0003	0.0009	0.0023	0.0067	0.0149	0.0365

Numba Versus C++

High-Performance Python: Why?

Numba

Wolfram Models

Base Codes

Python Code

Numba Code

C++ Code

Optimized C++ Code

Results

Summary

Credits

Data

Like this:

8 Replies to “Numba Versus C++”

Leave a Reply to Jia Cheng Hu (@JiaChengHu)Cancel reply

High-Performance Python: Why?

Numba

Wolfram Models

Base Codes

Python Code

Numba Code

C++ Code

Optimized C++ Code

Results

Summary

Credits

Data

Share this:

Like this:

8 Replies to “Numba Versus C++”

Leave a Reply to Jia Cheng Hu (@JiaChengHu)Cancel reply