Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?
I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2)
by compiling it into a*a
, but the call pow(a,6)
is not optimized and will actually call the library function pow
, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc
, will eliminate the library call for pow(a,6)
.)
What I am curious about is that when I replaced pow(a,6)
with a*a*a*a*a*a
using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4
", it uses 5 mulsd
instructions:
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
while if I write (a*a*a)*(a*a*a)
, it will produce
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
which reduces the number of multiply instructions to 3. icc
has similar behavior.
Why do compilers not recognize this optimization trick?