September 19, 2021

Java 17 GA: Simple benchmark with Vector API (Second Preview)

A few years ago I was hoping that Java will have a chance to become again an important contented into machine learning field. I was hoping for interactivity, vectorization, and seamless integration with the external world (c/c++/fortran). With the last release of Java 17 the last two dreams are closer to reality than ever.

JEP 414: Vector API (Second Incubator) is something I awaited a lot and I spent a few hours playing with it. Personally, I am really happy with the results, and I have a lot of motivation to migrate much of the linear algebra staff on that. It looks really cool.

To make a story short, I implemented a small set of microbenchmarks for two simple operations. The first operation is fillNaN and for the second test, we simply add elements of a vector.

fillNaN

This is a common problem when working with large chunks of floating numbers: some of them are not numbers for various reasons: missing data, impossible operations, and so on. A panda version of it could be fillna. The whole idea is that for a given vector you want to replace all Double.NaN values with a given value to make arithmetic possible.

The following is a listing of the fillNa benchmark.

As you can see, nothing fancy here. The `testFillNaNArrays` method iterates over the array and if the given value is Double.NaN. Pretty straightforward. How about the results? It should be faster.

Benchmark Mode Cnt Score Error Units
VectorFillNaNBenchmark.testFillNaNArrays thrpt 10 3.405 ± 0.149 ops/ms
VectorFillNaNBenchmark.testFillNaNVectorized thrpt 10 41.930 ± 4.437 ops/ms
VectorFillNaNBenchmark.testFillNaNArrays avgt 10 0.289 ± 0.002 ms/op
VectorFillNaNBenchmark.testFillNaNVectorized avgt 10 0.023 ± 0.001 ms/op

But over 10 times faster? It is a really pleasant surprise, but not quite a surprise. This is in strict connection with auto-vectorization in Java. When it works, and for simple loops it works, it gives intrinsic optimizations and sometimes even SIMD based. But calling such a thing as Double.isNaN is not a simple thing, at least for auto-vectorization. In the new Vector API this operation is vectorized and we go fast, even if we use masks, which are not the lightest things in this new API. So we get a boost of 13x in speed which looks amazing.

sum and sumNaN

For the second microbenchmark, we have the same operation in two flavors. The first sum is implemented over all elements, with no constraints. The second sum operation, we call it sumNaN skips the potential non-numeric values and computes the sum of the rest of the numbers. We do that to check two things. We want to know how vectorization behaves compared to auto-vectorization (this is the normal sum, which is implemented as a simple loop that benefits from all optimizations possible). And we also want to see another operation with masks, compared with an auto-vectorized code. Let's see the benchmark:

And with no additional comments the results:

Benchmark Mode Cnt Score Error Units

VectorSumBenchmark.testSumArrays thrpt 10 9.264 ± 1.591 ops/ms

VectorSumBenchmark.testSumVectorized thrpt 10 12.222 ± 0.738 ops/ms

VectorSumBenchmark.testSumNanArrays thrpt 10 2.692 ± 0.191 ops/ms

VectorSumBenchmark.testSumNanVectorized thrpt 10 10.704 ± 0.428 ops/ms

VectorSumBenchmark.testSumArrays avgt 10 0.120 ± 0.011 ms/op

VectorSumBenchmark.testSumVectorized avgt 10 0.054 ± 0.011 ms/op

VectorSumBenchmark.testSumNanArrays avgt 10 0.390 ± 0.018 ms/op

VectorSumBenchmark.testSumNanVectorized avgt 10 0.068 ± 0.005 ms/op

We can see from those results that the unoptimized code for sumNan on arrays performs badly by distance. This is expected. What I personally did not expect was the vectorized version with masks (sum nan vectorized) to perform better than an auto-vectorized version of the simple sum (sum arrays). Really good job. Hat off!

Conclusions

For the sake of reproduction, I have run that on 'Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz/8cores/32GB RAM'. This processor can make SIMD operations on lanes of 256 bits / 4 double floats. A better one runs faster, of course. But the absolute numbers are not important here. What is important is that you can vectorize many things in Java directly and it makes it possible to implement complex things with masks, which, at least sometimes, is faster than auto-vectorization. This is a really really amazing job.

October 16, 2013

Rapaio Random Forest - a benchmark - updated

While searching for various ideas and implementations of random forests I noticed the project called fast-random-forest. This is basically a faster and optimized version of Weka implementation. Take a look here: http://code.google.com/p/fast-random-forest/.

I don't discuss the implementation. Some things seems to be faster than what Weka provides, other seems to provide no significant performance boost. Anyway, it deserves kudos for his effort and the fact that he open sourced the code.

However what I found more interesting for my purpose is a small benchmark which compares some results of Weka Random Forest and fast-random-forest. I tried to check my implementation against those two.

Here are some results:

test               Weka       FastRF      RapaioRF
anneal        0.9960000000 0.9930000000 0.8630289532
balance-scale 0.8160000000 0.8130000000 0.8208000000
breast-w      0.9670000000 0.9680000000 0.9642346209
credit-a      0.8620000000 0.8650000000 0.8724637681
credit-g      0.7580000000 0.7510000000 0.7520000000
diabetes      0.7630000000 0.7620000000 0.7565104167
hypothyroid 0.9940000000 0.9950000000 0.9904559915
kr-vs-kp      0.9920000000 0.9870000000 0.9881101377
mushroom      1.0000000000 1.0000000000 1.0000000000
segment       0.9810000000 0.9800000000 0.9766233766
sick          0.9830000000 0.9840000000 0.9803817603
soybean       0.9320000000 0.9250000000 0.9385065886
vehicle       0.7470000000 0.7520000000 0.7517730496
vowel         0.9810000000 0.9680000000 0.9585858586
letter        0.9620000000 0.9620000000 0.9659500000
splice        0.9300000000 0.9410000000 0.9639498433
waveform-5000 0.8480000000 0.8480000000 0.8494000000

Note: the comparison does not show a best algorithm. The test is not intended to induce the idea of searching for the best implementation. Such a tentative is a dead end for far too many reasons, and I don't spend time to enumerate them here.

However what I wanted to know is if my implementation is comparable. And it is.

Later update: after I fix a bug I ran again same test.

         test         Weka       FastRF     RapaioRF
       anneal 0.9960000000 0.9930000000 0.9109131403
balance-scale 0.8160000000 0.8130000000 0.8208000000
     breast-w 0.9670000000 0.9680000000 0.9642346209
     credit-a 0.8620000000 0.8650000000 0.8753623188
     credit-g 0.7580000000 0.7510000000 0.7630000000
     diabetes 0.7630000000 0.7620000000 0.7643229167
hypothyroid 0.9940000000 0.9950000000 0.9931071050
     kr-vs-kp 0.9920000000 0.9870000000 0.9871714643
     mushroom 1.0000000000 1.0000000000 1.0000000000
      segment 0.9810000000 0.9800000000 0.9753246753
         sick 0.9830000000 0.9840000000 0.9827677625
      soybean 0.9320000000 0.9250000000 0.9355783309
      vehicle 0.7470000000 0.7520000000 0.7565011820
        vowel 0.9810000000 0.9680000000 0.9626262626
       letter 0.9620000000 0.9620000000 0.9658500000
       splice 0.9300000000 0.9410000000 0.9620689655
waveform-5000 0.8480000000 0.8480000000 0.8522000000

This time the results of my implementation seems better. Again, I am not finding the best implementation. There are, however, some differences which I think is better to highlight here:

Rapaio implementation follows the instructions proposed by Breiman regarding the impurity function to be used. I use Gini impurity function and the splitting criteria is the gain Gini impurity given by split. Both Weka and fast-random-forest uses InfoGain as impurity function.
Gini variable importance, one of the VI proposed by Breiman, is computed based on Gini index (again, both Weka and fast-random-forest uses InfoGain for that purpose)
Rapaio and fast-random-forest uses binary splits for growing trees. Weka does not.
Regarding missing values there are even more differences:

Impurity value computation: Weka incorporates missing values in the computed distributions, proportionally with the classes frequencies. Fast-random-forest does the same, however, for nominal attributes uses only binary splits. Rapaio ignores the missing values for this computation.
Split data: Weka and Rapaio uses all values with missing data on selected attribute, this instances are all distributed to all children nodes (only 2 for Rapaio, possible more than 2 for Weka) with weights adjusted by the proportions given by class distribution. Fast-random-forest does not, it distributes randomly instances in children nodes, proportionally with the class distribution.

Rapaio implementation is probably the slowest among all of them. That is explained partially because "early optimization is the root of all evil" and partly because it uses the most expensive choices.