The fastest target type

  • Why is a whole type smaller than the size of the machine word being processed slower than the size = size of the machine word?

    Then what kind of speed will be handled int > T or int < T?

  • I wanted to work as a mythbreaker... It didn't work. "Myth" got it. ♪

    A C test program was made:

    int main()
      unsigned char b,c,d;
      unsigned int a;
      struct timespec t,t1;
      int i;
      clock_gettime(CLOCK_REALTIME, &t);
      b=5; c=7; d=45;
      clock_gettime(CLOCK_REALTIME, &t1);
      printf("Difference %ld %ld\n",t1.tv_sec-t.tv_sec,t1.tv_nsec-t.tv_nsec);
      printf("%d %d %d %d\n",a,b,c,d);

    We'll call her "A" program. And the same program, but with variables. unsigned int b,c,d - program 'B'. In compiling gcc with optimization -O3 There's the following assembler code for the main cycles, with my comments on the meaning of the operation:

          Программа А                         Программа B
    .L2:                                .L2:
        movl    $200, %eax   a=200          movl    $200, %eax     a=200
    .L3:                                .L3:
        addl    %r9d, %r8d   c+=d           addl    %r9d, %r8d     c+=d
        movl    %r9d, %ecx   X=d            xorl    %r8d, %ebx     b^=c
        movl    $5, %r9d     d=5            movl    %r9d, %r8d     c=d
        xorl    %r8d, %ebx   b^=c           movl    %ebx, %r9d     d=b
        subb    %bl, %cl     X-=b           subl    %ebx, %r8d     c-=b
        xorl    %ebx, %r9d   d^=b           xorl    $5, %r9d       d^=5
        subl    $1, %eax     a--            subl    $1, %eax       a--
        movl    %ecx, %r8d   c=X
        jne     .L3          for(a)         jne     .L3            for(a)
        subl    $1, %edx     i--            subl    $1, %edx       i--
        jne     .L2          for(i)         jne     .L2            for(i)
                     Результаты выполнения:

    Difference 22 -30098974 Difference 19 -697228347
    Difference 22 394100751 Difference 18 2860932
    Difference 22 -37226465 Difference 18 3254312
    Difference 22 67398660 Difference 18 43898871
    Difference 22 -29109230 Difference 18 449544279

    The first thing that goes into the eye is, in the "A's" version, the optimizer introduces a new "replaceable" 'X', or more accurately performs the cross. d Register ecx Works with him and then returns. d♪ This has to do with the variable. d He's using a R9 register with a junior bay that can't work independently, and he prefers to use the size surgery we asked for. That's why he uses it. ECX, the youngest byte is available as CL

    In fact, the optimizer, as with 1 byte, in this case only performs the subtraction. A sediment and XOR It is peacefully performed in complete, 4th Byte registers without fear of side effects.

    I've decided to replace the "A's" version. subb bl,cl4x Light subl ebx,ecx♪ And there was a surprise waiting for me, and without any other truth, the program was being implemented. 18 For a second, 22♪ The processor (in my case, Core i7) performs the subtraction in 1 Byte registers slower than in full, 4x bytes. Then I tried the same surgery. add и xor and got the same results.

    After that, I made a "C" program with types. unsigned long longthe compiler sgenerated the normal 64th battle operations that showed the same 18 seconds. In the 64th battle mode, operations int and long work equally fast. Type verification short int, i.e. 16 battles gave 22 seconds, like 8-bit.

    Total: Modern Intel processors, at least Core i7, operate in single-bite registers slower than in the 4th byte, why ask Intel processors. In addition, in architecture x86, only 4 out of 16 general-purpose registers can be operated as white, so the optimizer has to produce a more complex code to handle these types.

Log in to reply

Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2