September 24, 2021

What is time?

 I think I finally start to understand what is the time. If I understand correctly, time does not even exist. It is just a measure of change. Nobody cares about time, other than some living creatures. And that is because we want to be alive, to remain in our integrality or to maintain our identity. What if we don't exist? Time is there? I think not. Change is there, but not time. It looks like time is only a measurement. This is why it "flows" only forward. 

September 19, 2021

Java 17 GA: Simple benchmark with Vector API (Second Preview)

A few years ago I was hoping that Java will have a chance to become again an important contented into machine learning field. I was hoping for interactivity, vectorization, and seamless integration with the external world (c/c++/fortran). With the last release of Java 17 the last two dreams are closer to reality than ever.

JEP 414: Vector API (Second Incubator) is something I awaited a lot and I spent a few hours playing with it. Personally, I am really happy with the results, and I have a lot of motivation to migrate much of the linear algebra staff on that. It looks really cool.

To make a story short, I implemented a small set of microbenchmarks for two simple operations. The first operation is fillNaN and for the second test, we simply add elements of a vector

fillNaN

This is a common problem when working with large chunks of floating numbers: some of them are not numbers for various reasons: missing data, impossible operations, and so on. A panda version of it could be fillna. The whole idea is that for a given vector you want to replace all Double.NaN values with a given value to make arithmetic possible. 

The following is a listing of the fillNa benchmark. 

As you can see, nothing fancy here. The `testFillNaNArrays` method iterates over the array and if the given value is Double.NaN. Pretty straightforward. How about the results? It should be faster.

Benchmark                                      Mode  Cnt   Score   Error   Units
VectorFillNaNBenchmark.testFillNaNArrays      thrpt   10   3.405 ± 0.149  ops/ms
VectorFillNaNBenchmark.testFillNaNVectorized  thrpt   10  41.930 ± 4.437  ops/ms
VectorFillNaNBenchmark.testFillNaNArrays       avgt   10   0.289 ± 0.002   ms/op
VectorFillNaNBenchmark.testFillNaNVectorized   avgt   10   0.023 ± 0.001   ms/op

But over 10 times faster? It is a really pleasant surprise, but not quite a surprise. This is in strict connection with auto-vectorization in Java. When it works, and for simple loops it works, it gives intrinsic optimizations and sometimes even SIMD based. But calling such a thing as Double.isNaN is not a simple thing, at least for auto-vectorization. In the new Vector API this operation is vectorized and we go fast, even if we use masks, which are not the lightest things in this new API. So we get a boost of 13x in speed which looks amazing.

sum and sumNaN

For the second microbenchmark, we have the same operation in two flavors. The first sum is implemented over all elements, with no constraints. The second sum operation, we call it sumNaN skips the potential non-numeric values and computes the sum of the rest of the numbers. We do that to check two things. We want to know how vectorization behaves compared to auto-vectorization (this is the normal sum, which is implemented as a simple loop that benefits from all optimizations possible). And we also want to see another operation with masks, compared with an auto-vectorized code. Let's see the benchmark:


And with no additional comments the results:

Benchmark                                 Mode  Cnt   Score   Error   Units
VectorSumBenchmark.testSumArrays         thrpt   10   9.264 ± 1.591  ops/ms
VectorSumBenchmark.testSumVectorized     thrpt   10  12.222 ± 0.738  ops/ms
VectorSumBenchmark.testSumNanArrays      thrpt   10   2.692 ± 0.191  ops/ms
VectorSumBenchmark.testSumNanVectorized  thrpt   10  10.704 ± 0.428  ops/ms
VectorSumBenchmark.testSumArrays          avgt   10   0.120 ± 0.011   ms/op
VectorSumBenchmark.testSumVectorized      avgt   10   0.054 ± 0.011   ms/op
VectorSumBenchmark.testSumNanArrays       avgt   10   0.390 ± 0.018   ms/op
VectorSumBenchmark.testSumNanVectorized   avgt   10   0.068 ± 0.005   ms/op

We can see from those results that the unoptimized code for sumNan on arrays performs badly by distance. This is expected. What I personally did not expect was the vectorized version with masks (sum nan vectorized) to perform better than an auto-vectorized version of the simple sum (sum arrays).  Really good job. Hat off!

Conclusions

For the sake of reproduction, I have run that on 'Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz/8cores/32GB RAM'. This processor can make SIMD operations on lanes of 256 bits / 4 double floats. A better one runs faster, of course. But the absolute numbers are not important here. What is important is that you can vectorize many things in Java directly and it makes it possible to implement complex things with masks, which, at least sometimes, is faster than auto-vectorization. This is a really really amazing job.