Aussie AI

Softmax Benchmarking Results

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Softmax Benchmarking Results

Here's the result from my benchmarking 100,000 calls to the various Softmax versions for a vector with 1024 elements, for all these algorithms, including both sequential and AVX parallel versions.

    Softmax benchmarks (N=1024, ITER=100000)
    Softmax basic: 13186 ticks (13.19 seconds)
    Softmax reciprocal: 12986 ticks (12.99 seconds)
    Softmax expf-first: 6977 ticks (6.98 seconds)
    Softmax expf-sum-fused: 6682 ticks (6.68 seconds)
    Softmax expf with AVX1: 1095 ticks (1.09 seconds)
    Softmax expf/sum AVX1: 910 ticks (0.91 seconds)
    Softmax fused expf/sum AVX1: 1095 ticks (1.09 seconds)
    Softmax fused expf/sum/mult AVX1: 831 ticks (0.83 seconds)
    Softmax expf with AVX2: 538 ticks (0.54 seconds)
    Softmax expf/sum AVX2: 306 ticks (0.31 seconds)
    Softmax fused expf/sum AVX2: 252 ticks (0.25 seconds)
    Softmax fused expf/sum/mult AVX2: 176 ticks (0.18 seconds)

Interestingly, fusing the expf and summation kernels was actually worse for AVX1, but it was faster for AVX2. Otherwise, our speedups were as we would expect, with the triple-AVX optimizations of expf, summation, and multiply-by-scalar (reciprocal) getting the best results by far. The triple-vectorized AVX2 version is 73 times faster than the naive C++ sequential version, using about 1.4% of its CPU time cost. And we haven't even tried AVX-512 optimization yet!

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++