How to use simd instructions in c++?



  • I've got a simple lead structure.

    struct v2f 
    {
        float x;
        float y;
    };
    

    There are two functions to work with them.

    v2f addNormal(const v2f& v1, const v2f& v2) {
        v2f vec;
        vec.x = v1.x + v2.x;
        vec.y = v1.y + v2.y;
        return vec;
    }
    

    v2f addVectorized(const v2f& v1, const v2f& v2) {
    v2f vec;
    __m128 res = _mm_add_ps(
    _mm_loadu_ps((float*)(&v1)),
    _mm_loadu_ps((float*)(&v2))
    );
    memcpy(&vec, res.m128_f32, sizeof(float) * 2);
    return vec;
    }

    Now, the vegetation function is twice as slow as usual.

    1. Is it legal to copy memory directly from _m128 or necessarily use _mm_storeu_ps() ?
    2. Is it legal to plant in _m128 debris (structure weighs 8 byte, reads 8 extra by it)
    3. How do we download only two float numbers at...m128? Can we download to ...m128 directly through memcpy?
    4. How do you pull the maximum speed? Is there any point in using the vector for just two numbers?

    Here's the new code.

    v2f addVectorized(const v2f& v1, const v2f& v2) {
    v2f vec;
    const __m128 zero = _mm_setzero_ps();
    const __m128 res = _mm_add_ps(
    _mm_loadh_pi(zero, (__m64*)(&v1)),
    _mm_loadh_pi(zero, (__m64*)(&v2))
    );
    _mm_storeh_pi((__m64*)(&vec), res);
    return vec;
    }

    Now he's working 1.5 times slow.
    How else can he be optimized?

    Which is remarkable when adding to the field structure z,w and working with four values instead of two, sse code wins the normal (about twice)



  • Questions 1-3 are all vague behaviour. Your initial realization is also the result of an undivided area of memory. That is, it can work, it may not work, and it makes no sense that either you trust the compiler ' s optimization or do it according to the documentation and standard, and you hope that implementation also works according to the documentation. For example, you can't remember two float values, but you can record four, two of which are not used. Two extra values in any case should be taken from somewhere, and you should indicate where.

    It should be clarified that the above refers specifically to C+++. On OpenCL, for example, the rules are different. Pure C is also free, but there should be no reference to undetected memory.

    Reply to 4: For your optimization attempts to work somehow, the structure must be uniform. Look at the documentation, most of the functions require the equating of 16 bytes, if there is no equating, the programme may eject or the compiler will add equalisation of the price of productivity.

    More or less, the right approach is:

    #include<xmmintrin.h>
    #include<cstring>
    struct alignas(16) v2f // по хорошему. нужно вообще использовать __attribute__((packed, aligned(16)))
    {
        float x;
        float y;
    };
    v2f  addNormal(const v2f& v1, const v2f& v2) {
        return {v1.x + v2.x, v1.y + v2.y};
    }
    

    v2f addVectorized(const v2f& v1, const v2f& v2) {
    v2f v1_vec[2] = {v1, v1};
    v2f v2_vec[2] = {v2, v2};
    const _m128 v1 = _mm_load_ps(&v1_vec[0].x);
    const _m128 v2 = _mm_load_ps(&v2_vec[0].x);

    const __m128 res = _mm_add_ps(
        v1_,
        v2_
    );
    
    v2f res_v[2];
    _mm_store_ps (&amp;res_v[0].x, res);
    return res_v[0];
    

    }

    Assembler for gcc https://godbolt.org/z/7b86Kv5js 😞

    addNormal(v2f const&, v2f const&):
    movq xmm0, QWORD PTR [rdi]
    movq xmm1, QWORD PTR [rsi]
    addps xmm0, xmm1
    ret
    addVectorized(v2f const&, v2f const&):
    movaps xmm0, XMMWORD PTR [rsi]
    addps xmm0, XMMWORD PTR [rdi]
    ret

    The result is not equivalent, but at least there are no extra movements between registers, just a different method of initializing the register. It's too much more optimized by the compiler again, and it's clear that the compiler uses the vector for two numbers, that's normal.

    But even in the correct implementation, there's a problem:

    v2f test1(){
    v2f v1{1.1, 3.1};
    v2f v2{42.1 ,1};
    return addNormal(v1, v2);
    }

    v2f test2(){
    v2f v1{1, 3};
    v2f v2{3 ,1};
    return addVectorized(v1, v2);
    }

    test1():
    movq xmm0, QWORD PTR .LC0[rip]
    ret
    test2():
    mov QWORD PTR [rsp-32], 0
    mov rax, QWORD PTR .LC1[rip]
    mov QWORD PTR [rsp-16], 0
    mov QWORD PTR [rsp-40], rax
    mov rax, QWORD PTR .LC2[rip]
    movaps xmm0, XMMWORD PTR [rsp-40]
    mov QWORD PTR [rsp-24], rax
    addps xmm0, XMMWORD PTR [rsp-24]
    ret
    .LC0:
    .long 1110232268
    .long 1082340147
    .LC1:
    .long 1066192077
    .long 1078355558
    .LC2:
    .long 1109943910
    .long 1065353216

    The compiler fully optimized the function of standard c++ but was unable to optimize the manual optimization function.



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2