6

Why is vectorization library faster than hand-written for loops ? I mean, somewhere down the line, the matrices/vectors must be multiplied (or any other operation) and thus be one-by-one (for loop??) calculated and stored.

Why is it then faster to use these libraries than just manually writing for loops all over the place ?

I guess some low level magic (OpenBLAS ?) goes on there but I just don't see it..

P.S. [Would have posted it on stack overflow but I'd be ripped apart so I'm pioneering new ways]

Comments
  • 0
    What language are you talking about?
  • 0
    Maybe threading, or GPU utilization?
  • 0
    @j4cobgarby GPU is slower for arrays than CPU though.
  • 0
    Probably SIMD instructions and cache-awareness.

    Also if you write in C++, some libraries that accelerate your code like LAPACK are actually written in Fortran/C, which I think are inherently faster languages for calculations.
  • 0
    @BigBoo Not a particular one. I just saw this as a general concept utilized from C++ to Python

    @NickyBones This is what I think. The underlying code written almost in assembly gives it so much power

    So pound to pound, are these libraries really only meant to simplify the code ? Andrew Ng said in one of his videos that using for loops is actually SLOWER so I'm just confused now
  • 0
    @shinobiultra Check the implementation you use. Generally it's the same speed. Writing stuff in assembly does not make things faster than c++ just because. C++ compilers are usually well optimized.

    But using others implementations is nice for one reason. And it's apparent if you watch the cppcon talk about Facebookstrings.

    Using others implementations can cover more use cases than you usually do on your own unless you have a really well optimized structure.

    For example, it's beneficial for speed to have things on the stack. But the stack is small.
    So one way to do this is to, for example. Let a string of a smaller size be allocated on the stack but strings of a longer length be allocated to heap.

    This will make smaller strings faster but it does not mean that all strings would be faster.

    There is no easy answer. It all comes down to specific implementation of the specific library. There might be some cases that perform better. But overall it's the same.
  • 0
    @shinobiultra I know Andrew Ng from his DL work. In that area, there are cases where for loops can be optimized by matrix operations, and that indeed is a lot faster.

    LAPACK/BLAS/etc are meant to save you the trouble of going super low-level and handling stuff like aliasing. By querying the HW about its properties you can optimize your code per machine, and those libraries provide you with an abstraction layer.
  • 1
    SIMD optimizations
  • 0
    @NickyBones Yeah that's what I meant, that matrix operations are actually faster than for loops even tho the result is the same.
  • 1
    @shinobiultra matrix operations are like SIMD^2, since SIMD works on vectors. If your HW supports matrix acceleration like GPUs do, then matrix operations is the way to go.

    However not everything can be done by converting to matrix operations, and not all machines have powerful GPU (or GPU at all), so libraries like LAPACK are still needed.
  • 1
    Modern CPUs are able to operate on numbers bigger than 64bits. Let's say you CPU has 256bit operations. If you wanted to add two vector of four 64 bit integers each, if done manually, you'd need 4 add instructions. Vectorization libraries, however, abstract the low-level non-portable instructions that your CPU has to operate on 256 bits, which allows you to add the two vectors in a single instruction. On a perfect world, you'd get a 4x speedup in this particular case, although it is rarely that high.
Add Comment