Unified code for different backends

We demonstrate here how Transonic can be used to accelerate a unique code with different Python accelerators (so-called “backends”, now, Pythran, Cython and Numba).

The code of the functions is taken from this Stackoverflow question.

As usual with Transonic, two modes are available: ahead-of-time compilation (with the boost decorator) and just-in-time compilation (with the jit decorator).

For both examples, we use some common code in a file util.py

import numpy as np

from transonic.config import backend_default
from transonic.util import timeit


def check(functions, arr, columns):
    res0 = functions[0](arr, columns)
    for func in functions[1:]:
        assert np.allclose(res0, func(arr, columns))
    print("Checks passed: results are consistent")


def bench(functions, arr, columns):
    print(backend_default.capitalize())
    for func in functions:
        result = timeit("func(arr, columns)", globals=locals())
        print(f"{func.__name__:20s} {result:.3e} s")
    print()

Ahead-of-time compilation

import numpy as np

from transonic import Array, boost, const

T_index = np.int32
# we use a type variable because it can be replaced by a fused type.
T = np.int64
A1d_i = Array[T_index, "1d"]
A1d = Array[T, "1d"]
A2d = Array[T, "2d"]
V1d_i = Array[T_index, "1d", "memview"]
V1d = Array[T, "1d", "memview"]
V2d = Array[T, "2d", "memview"]


@boost
def row_sum(arr: A2d, columns: A1d_i):
    return arr.T[columns].sum(0)


@boost(boundscheck=False, wraparound=False)
def row_sum_loops(arr: const(V2d), columns: const(V1d_i)):
    # locals type annotations are used only for Cython
    i: T_index
    j: T_index
    sum_: T
    # arr.dtype not supported for memoryview
    dtype = type(arr[0, 0])
    res: V1d = np.empty(arr.shape[0], dtype=dtype)
    for i in range(arr.shape[0]):
        sum_ = dtype(0)
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res


if __name__ == "__main__":

    from util import bench, check

    functions = [row_sum, row_sum_loops]
    arr = np.arange(1_000_000).reshape(1_000, 1_000)
    columns = np.arange(1, 1000, 2, dtype=T_index)

    check(functions, arr, columns)
    bench(functions, arr, columns)

To compile this file with different backends, one can run:

transonic -b python row_sum_boost.py
transonic -b cython row_sum_boost.py
transonic -b numba row_sum_boost.py
transonic -b pythran row_sum_boost.py -af "-march=native -DUSE_XSIMD"

To choose the backend, we can call for example:

TRANSONIC_BACKEND="cython" python row_sum_boost.py

Then, on my PC (“gre”), I get:

Python
high level: 1.31e-03 s  (=  1.00 * norm)
low level:  1.04e-01 s  (= 79.27 * norm)

Cython
high level: 1.29e-03 s  (=  0.99 * norm)
low level:  4.10e-04 s  (=  0.31 * norm)

Numba
high level: 1.04e-03 s  (=  0.80 * norm)
low level:  2.69e-04 s  (=  0.21 * norm)

Pythran
high level: 7.68e-04 s  (=  0.59 * norm)
low level:  2.55e-04 s  (=  0.19 * norm)

The fastest solutions are in this case the Numba and Pythran backends for the implementation with explicit loops.

As usual, Pythran gives quite good results with the high-level implementation, but in this case, it is still more than twice slower than the implementation with loops.

Cython does not accelerate high-level Numpy code but gives good results for the implementation with loops.

Just-in-time compilation

For JIT, type annotations for the arguments are not needed and it does not really make sense to add type annotations for all local variables. We thus remove all type annotations (which is bad for Cython).

import numpy as np

from transonic import jit


@jit(native=True, xsimd=True)
def row_sum(arr, columns):
    return arr.T[columns].sum(0)


@jit(native=True, xsimd=True)
def row_sum_loops(arr, columns):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res


if __name__ == "__main__":

    from util import bench, check

    from transonic import wait_for_all_extensions

    functions = [row_sum, row_sum_loops]
    arr = np.arange(1_000_000).reshape(1_000, 1_000)
    columns = np.arange(1, 1000, 2)

    check(functions, arr, columns)
    wait_for_all_extensions()
    check(functions, arr, columns)
    bench(functions, arr, columns)

which gives:

Python
high level: 1.20e-03 s  (=  1.00 * norm)
low level:  1.15e-01 s  (= 95.26 * norm)

Cython
high level: 1.25e-03 s  (=  1.04 * norm)
low level:  1.22e-02 s  (= 10.18 * norm)

Numba
high level: 1.12e-03 s  (=  0.93 * norm)
low level:  2.51e-04 s  (=  0.21 * norm)

Pythran
high level: 6.71e-04 s  (=  0.56 * norm)
low level:  2.41e-04 s  (=  0.20 * norm)