Thursday, October 29, 2015

PyPy 4.0.0 Released - A Jit with SIMD Vectorization and More

PyPy 4.0.0

We’re pleased and proud to unleash PyPy 4.0.0, a major update of the PyPy python 2.7.10 compatible interpreter with a Just In Time compiler. We have improved warmup time and memory overhead used for tracing, added vectorization for numpy and general loops where possible on x86 hardware (disabled by default), refactored rough edges in rpython, and increased functionality of numpy.
You can download the PyPy 4.0.0 release here:
We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors (7 new ones since PyPy 2.6.0) and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

New Version Numbering


Since the past release, PyPy 2.6.1, we decided to update the PyPy 2.x.x versioning directly to PyPy 4.x.x, to avoid confusion with CPython 2.7 and 3.5. Note that this version of PyPy uses the stdlib and implements the syntax of CPython 2.7.10.

Vectorization


Richard Plangger began work in March and continued over a Google Summer of Code to add a vectorization step to the trace optimizer. The step recognizes common constructs and emits SIMD code where possible, much as any modern compiler does. This vectorization happens while tracing running code, so it is actually easier at run-time to determine the availability of possible vectorization than it is for ahead-of-time compilers.
Availability of SIMD hardware is detected at run time, without needing to precompile various code paths into the executable.
The first version of the vectorization has been merged in this release, since it is so new it is off by default. To enable the vectorization in built-in JIT drivers (like numpy ufuncs), add –jit vec=1, to enable all implemented vectorization add –jit vec_all=1
Benchmarks and a summary of this work appear here

Internal Refactoring: Warmup Time Improvement and Reduced Memory Usage


Maciej Fijalkowski and Armin Rigo refactored internals of Rpython that now allow PyPy to more efficiently use guards in jitted code. They also rewrote unrolling, leading to a warmup time improvement of 20% or so. The reduction in guards also means a reduction in the use of memory, also a savings of around 20%.

Numpy


Our implementation of numpy continues to improve. ndarray and the numeric dtypes are very close to feature-complete; record, string and unicode dtypes are mostly supported. We have reimplemented numpy linalg, random and fft as cffi-1.0 modules that call out to the same underlying libraries that upstream numpy uses. Please try it out, especially using the new vectorization (via –jit vec=1 on the command line) and let us know what is missing for your code.

CFFI


While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. Armin Rigo continued improving it, and PyPy reaps the benefits of cffi-1.3: improved manangement of object lifetimes, __stdcall on Win32, ffi.memmove(), and percolate const, restrict keywords from cdef to C code.

What is PyPy?


PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Other Highlights (since 2.6.1 release two months ago)

  • Bug Fixes
    • Applied OPENBSD downstream fixes
    • Fix a crash on non-linux when running more than 20 threads
    • In cffi, ffi.new_handle() is more cpython compliant
    • Accept unicode in functions inside the _curses cffi backend exactly like cpython
    • Fix a segfault in itertools.islice()
    • Use gcrootfinder=shadowstack by default, asmgcc on linux only
    • Fix ndarray.copy() for upstream compatability when copying non-contiguous arrays
    • Fix assumption that lltype.UniChar is unsigned
    • Fix a subtle bug with stacklets on shadowstack
    • Improve support for the cpython capi in cpyext (our capi compatibility layer). Fixing these issues inspired some thought about cpyext in general, stay tuned for more improvements
    • When loading dynamic libraries, in case of a certain loading error, retry loading the library assuming it is actually a linker script, like on Arch and Gentoo
    • Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy
  • New features:
    • Add an optimization pass to vectorize loops using x86 SIMD intrinsics.
    • Support __stdcall on Windows in CFFI
    • Improve debug logging when using PYPYLOG=???
    • Deal with platforms with no RAND_egd() in OpenSSL
  • Numpy:
    • Add support for ndarray.ctypes
    • Fast path for mixing numpy scalars and floats
    • Add support for creating Fortran-ordered ndarrays
    • Fix casting failures in linalg (by extending ufunc casting)
    • Recognize and disallow (for now) pickling of ndarrays with objects embedded in them
  • Performance improvements and refactorings:
    • Reuse hashed keys across dictionaries and sets
    • Refactor JIT interals to improve warmup time by 20% or so at the cost of a minor regression in JIT speed
    • Recognize patterns of common sequences in the JIT backends and optimize them
    • Make the garbage collecter more incremental over external_malloc() calls
    • Share guard resume data where possible which reduces memory usage
    • Fast path for zip(list, list)
    • Reduce the number of checks in the JIT for lst[a:]
    • Move the non-optimizable part of callbacks outside the JIT
    • Factor in field immutability when invalidating heap information
    • Unroll itertools.izip_longest() with two sequences
    • Minor optimizations after analyzing output from vmprof and trace logs
    • Remove many class attributes in rpython classes
    • Handle getfield_gc_pure* and getfield_gc_* uniformly in heap.py
    • Improve simple trace function performance by lazily calling fast2locals and locals2fast only if truly necessary
Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team




Tuesday, October 20, 2015

Automatic SIMD vectorization support in PyPy

Hi everyone,

it took some time to catch up with the JIT refacrtorings merged in this summer. But, (drums) we are happy to announce that:

The next release of PyPy,  "PyPy 4.0.0", will ship the new auto vectorizer

The goal of this project was to increase the speed of numerical applications in both the NumPyPy library and for arbitrary Python programs. In PyPy we have focused a lot on improvements in the 'typical python workload', which usually involves object and string manipulations, mostly for web development. We're hoping with this work that we'll continue improving the other very important Python use case - numerics.

What it can do!

It targets numerics only. It will not execute object manipulations faster, but it is capable of enhancing common vector and matrix operations.
Good news is that it is not specifically targeted for the NumPy library and the PyPy virtual machine. Any interpreter (written in RPython) is able make use of the vectorization. For more information about that take a look here, or consult the documentation. For the time being it is not turn on by default, so be sure to enable it by specifying --jit vec=1 before running your program.

If your language (written in RPython) contains many array/matrix operations, you can easily integrate the optimization by adding the parameter 'vec=1' to the JitDriver.

NumPyPy Improvements

Let's take a look at the core functions of the NumPyPy library (*).
The following tests tests show the speedup of the core functions commonly used in Python code interfacing with NumPy, on CPython with NumPy, on the PyPy 2.6.1 relased several weeks ago, and on PyPy 15.11 to be released soon. Timeit was used to test the time needed to run the operation in the plot title on various vector (lower case) and square matrix (upper case) sizes displayed on the X axis. The Y axis shows the speedup compared to CPython 2.7.10. This means that higher is better


In comparison to PyPy 2.6.1, the speedup greatly improved. The hardware support really strips down the runtime of the vector and matrix operations. There is another operation we would like to highlight: the dot product.
It is a very common operation in numerics and PyPy now (given a moderate sized matrix and vector) decreases the time spent in that operation. See for yourself:

These are nice improvements in the NumPyPy library and we got to a competitive level only making use of SSE4.1.

Future work  


This is not the end of the road. The GSoC project showed that it is possible to implement this optimization in PyPy. There might be other improvements we can make to carry this further:
  • Check alignment at runtime to increase the memory throughput of the CPU
  • Support the AVX vector extension which (at least) doubles the size of the vector register
  • Handle each and every corner case in Python traces to enable it  globally
  • Do not rely only on loading operations to trigger the analysis, there might be cases where combination of floating point values could be done in parallel
Cheers,
The PyPy Team

(*) The benchmark code can be found here it was run using this configuration: i7-2600 CPU @ 3.40GHz (4 cores).

Friday, October 16, 2015

PowerPC backend for the JIT

Hi all,

PyPy's JIT now supports the 64-bit PowerPC architecture! This is the third architecture supported, in addition to x86 (32 and 64) and ARM (32-bit only). More precisely, we support Linux running the big- and the little-endian variants of ppc64. Thanks to IBM for funding this work!

The new JIT backend has been merged into "default". You should be able to translate PPC versions as usual directly on the machines. For the foreseeable future, I will compile and distribute binary versions corresponding to the official releases (for Fedora), but of course I'd welcome it if someone else could step in and do it. Also, it is unclear yet if we will run a buildbot.

To check that the result performs well, I logged in a ppc64le machine and ran the usual benchmark suite of PyPy (minus sqlitesynth: sqlite was not installed on that machine). I ran it twice at a difference of 12 hours, as an attempt to reduce risks caused by other users suddenly using the machine. The machine was overall relatively quiet. Of course, this is scientifically not good enough; it is what I could come up with given the limited resources.

Here are the results, where the numbers are speed-up factors between the non-jit and the jit version of PyPy. The first column is x86-64, for reference. The second and third columns are the two ppc64le runs. All are Linux. A few benchmarks are not reported here because the runner doesn't execute them on non-jit (however, apart from sqlitesynth, they all worked).

    ai                        13.7342        16.1659     14.9091
    bm_chameleon               8.5944         8.5858        8.66
    bm_dulwich_log             5.1256         5.4368      5.5928
    bm_krakatau                5.5201         2.3915      2.3452
    bm_mako                    8.4802         6.8937      6.9335
    bm_mdp                     2.0315         1.7162      1.9131
    chaos                     56.9705        57.2608     56.2374
    sphinx
    crypto_pyaes               62.505         80.149     79.7801
    deltablue                  3.3403         5.1199      4.7872
    django                    28.9829         23.206       23.47
    eparse                     2.3164         2.6281       2.589
    fannkuch                   9.1242        15.1768     11.3906
    float                     13.8145        17.2582     17.2451
    genshi_text               16.4608        13.9398     13.7998
    genshi_xml                 8.2782         8.0879      9.2315
    go                         6.7458        11.8226     15.4183
    hexiom2                   24.3612        34.7991     33.4734
    html5lib                   5.4515         5.5186       5.365
    json_bench                28.8774        29.5022     28.8897
    meteor-contest             5.1518         5.6567      5.7514
    nbody_modified            20.6138        22.5466     21.3992
    pidigits                   1.0118          1.022      1.0829
    pyflate-fast               9.0684        10.0168     10.3119
    pypy_interp                3.3977         3.9307      3.8798
    raytrace-simple           69.0114       108.8875    127.1518
    richards                  94.1863       118.1257    102.1906
    rietveld                   3.2421         3.0126      3.1592
    scimark_fft
    scimark_lu
    scimark_montecarlo
    scimark_sor
    scimark_sparsematmul
    slowspitfire               2.8539         3.3924      3.5541
    spambayes                  5.0646         6.3446       6.237
    spectral-norm             41.9148        42.1831     43.2913
    spitfire                   3.8788         4.8214       4.701
    spitfire_cstringio          7.606         9.1809      9.1691
    sqlitesynth
    sympy_expand               2.9537         2.0705      1.9299
    sympy_integrate            4.3805         4.3467      4.7052
    sympy_str                  1.5431         1.6248      1.5825
    sympy_sum                  6.2519          6.096      5.6643
    telco                     61.2416        54.7187     55.1705
    trans2_annotate
    trans2_rtype
    trans2_backendopt
    trans2_database
    trans2_source
    twisted_iteration         55.5019        51.5127     63.0592
    twisted_names              8.2262         9.0062      10.306
    twisted_pb                12.1134         13.644     12.1177
    twisted_tcp                4.9778          1.934      5.4931

    GEOMETRIC MEAN               9.31           9.70       10.01

The last line reports the geometric mean of each column. We see that the goal was reached: PyPy's JIT actually improves performance by a factor of around 9.7 to 10 times on ppc64le. By comparison, it "only" improves performance by a factor 9.3 on Intel x86-64. I don't know why, but I'd guess it mostly means that a non-jitted PyPy performs slightly better on Intel than it does on PowerPC.

Why is that? Actually, if we do the same comparison with an ARM column too, we also get higher numbers there than on Intel. When we discovered that a few years ago, we guessed that on ARM running the whole interpreter in PyPy takes up a lot of resources, e.g. of instruction cache, which the JIT's assembler doesn't need any more after the process is warmed up. And caches are much bigger on Intel. However, PowerPC is much closer to Intel, so this argument doesn't work for PowerPC. But there are other more subtle variants of it. Notably, Intel is doing crazy things about branch prediction, which likely helps a big interpreter---both the non-JITted PyPy and CPython, and both for the interpreter's main loop itself and for the numerous indirect branches that depend on the types of the objects. Maybe the PowerPC is as good as Intel, and so this argument doesn't work either. Another one would be: on PowerPC I did notice that gcc itself is not perfect at optimization. During development of this backend, I often looked at assembler produced by gcc, and there are a number of small inefficiencies there. All these are factors that slow down the non-JITted version of PyPy, but don't influence the speed of the assembler produced just-in-time.

Anyway, this is just guessing. The fact remains that PyPy can now be used on PowerPC machines. Have fun!

A bientôt,

Armin.

Monday, October 5, 2015

PyPy memory and warmup improvements (2) - Sharing of Guards

Hello everyone!

This is the second part of the series of improvements in warmup time and memory consumption in the PyPy JIT. This post covers recent work on sharing guard resume data that was recently merged to trunk. It will be a part of the next official PyPy release. To understand what it does, let's start with a loop for a simple example:

class A(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def call_method(self, z):
        return self.x + self.y + z

def f():
    s = 0
    for i in range(100000):
        a = A(i, 1 + i)
        s += a.call_method(i)

At the entrance of the loop, we have the following set of operations:

guard(i5 == 4)
guard(p3 is null)
p27 = p2.co_cellvars p28 = p2.co_freevars
guard_class(p17, 4316866008, descr=<Guard0x104295e08>)
p30 = p17.w_seq
guard_nonnull(p30, descr=<Guard0x104295db0>)
i31 = p17.index p32 = p30.strategy
guard_class(p32, 4317041344, descr=<Guard0x104295d58>)
p34 = p30.lstorage i35 = p34..item0

The above operations gets executed at the entrance, so each time we call f(). They ensure all the optimizations done below stay valid. Now, as long as nothing out of the ordinary happens, they only ensure that the world around us never changed. However, if e.g. someone puts new methods on class A, any of the above guards might fail. Despite the fact that it's a very unlikely case, PyPy needs to track how to recover from such a situation. Each of those points needs to keep the full state of the optimizations performed, so we can safely deoptimize them and reenter the interpreter. This is vastly wasteful since most of those guards never fail, hence some sharing between guards has been performed.

We went a step further - when two guards are next to each other or the operations in between them don't have side effects, we can safely redo the operations or to simply put, resume in the previous guard. That means every now and again we execute a few operations extra, but not storing extra info saves quite a bit of time and memory. This is similar to the approach that LuaJIT takes, which is called sparse snapshots.

I've done some measurements on annotating & rtyping translation of pypy, which is a pretty memory hungry program that compiles a fair bit. I measured, respectively:

  • total time the translation step took (annotating or rtyping)
  • time it took for tracing (that excludes backend time for the total JIT time) at the end of rtyping.
  • memory the GC feels responsible for after the step. The real amount of memory consumed will always be larger and the coefficient of savings is in 1.5-2x mark

Here is the table:

branch time annotation time rtyping memory annotation memory rtyping tracing time
default 317s 454s 707M 1349M 60s
sharing 302s 430s 595M 1070M 51s
win 4.8% 5.5% 19% 26% 17%

Obviously pypy translation is an extreme example - the vast majority of the code out there does not have that many lines of code to be jitted. However, it's at the very least a good win for us :-)

We will continue to improve the warmup performance and keep you posted!

Cheers,
fijal