The fastest target type
-
Why is a whole type smaller than the size of the machine word being processed slower than the size = size of the machine word?
Then what kind of speed will be handled
int > T
orint < T
?
-
I wanted to work as a mythbreaker... It didn't work. "Myth" got it. ♪
A C test program was made:
int main() { unsigned char b,c,d; unsigned int a; struct timespec t,t1; int i; clock_gettime(CLOCK_REALTIME, &t); b=5; c=7; d=45; for(i=0;i<50000000;i++) { for(a=0;a<200;a++) { b^=c+d; c=d-b; d=b^5; } } clock_gettime(CLOCK_REALTIME, &t1); printf("Difference %ld %ld\n",t1.tv_sec-t.tv_sec,t1.tv_nsec-t.tv_nsec); printf("%d %d %d %d\n",a,b,c,d); }
We'll call her "A" program. And the same program, but with variables.
unsigned int b,c,d
- program 'B'. In compiling gcc with optimization-O3
There's the following assembler code for the main cycles, with my comments on the meaning of the operation:Программа А Программа B .L2: .L2: movl $200, %eax a=200 movl $200, %eax a=200 .L3: .L3: addl %r9d, %r8d c+=d addl %r9d, %r8d c+=d movl %r9d, %ecx X=d xorl %r8d, %ebx b^=c movl $5, %r9d d=5 movl %r9d, %r8d c=d xorl %r8d, %ebx b^=c movl %ebx, %r9d d=b subb %bl, %cl X-=b subl %ebx, %r8d c-=b xorl %ebx, %r9d d^=b xorl $5, %r9d d^=5 subl $1, %eax a-- subl $1, %eax a-- movl %ecx, %r8d c=X jne .L3 for(a) jne .L3 for(a) subl $1, %edx i-- subl $1, %edx i-- jne .L2 for(i) jne .L2 for(i)
Результаты выполнения:
Difference 22 -30098974 Difference 19 -697228347
Difference 22 394100751 Difference 18 2860932
Difference 22 -37226465 Difference 18 3254312
Difference 22 67398660 Difference 18 43898871
Difference 22 -29109230 Difference 18 449544279
The first thing that goes into the eye is, in the "A's" version, the optimizer introduces a new "replaceable" 'X', or more accurately performs the cross.
d
Registerecx
Works with him and then returns.d
♪ This has to do with the variable.d
He's using a R9 register with a junior bay that can't work independently, and he prefers to use the size surgery we asked for. That's why he uses it.ECX
, the youngest byte is available asCL
In fact, the optimizer, as with 1 byte, in this case only performs the subtraction. A sediment and
XOR
It is peacefully performed in complete, 4th Byte registers without fear of side effects.I've decided to replace the "A's" version.
subb bl,cl
4x Lightsubl ebx,ecx
♪ And there was a surprise waiting for me, and without any other truth, the program was being implemented. 18 For a second, 22♪ The processor (in my case, Core i7) performs the subtraction in 1 Byte registers slower than in full, 4x bytes. Then I tried the same surgery.add и xor
and got the same results.After that, I made a "C" program with types.
unsigned long long
the compiler sgenerated the normal 64th battle operations that showed the same 18 seconds. In the 64th battle mode, operations int and long work equally fast. Type verificationshort int
, i.e. 16 battles gave 22 seconds, like 8-bit.Total: Modern Intel processors, at least Core i7, operate in single-bite registers slower than in the 4th byte, why ask Intel processors. In addition, in architecture x86, only 4 out of 16 general-purpose registers can be operated as white, so the optimizer has to produce a more complex code to handle these types.