Page 1 of 1

Optimisation

Posted: Fri Aug 05, 2005 10:17 am
by lincsimp
Hey
I know that early optimisation is the root of all evil, etc ,etc , but changing:

for (s32 i=0; i<16; ++i)
M = 0.0f;
M[0] = M[5] = M[10] = M[15] = 1;

in matrix4::makeidentity to:

M[0] = 1;
M[1] = 0;
M[2] = 0;
M[3] = 0;
M[4] = 0;
M[5] = 1;
M[6] = 0;
M[7] = 0;
M[8] = 0;
M[9] = 0;
M[10] = 1;
M[11] = 0;
M[12] = 0;
M[13] = 0;
M[14] = 0;
M[15] = 1;

reduces the times taken for the function by ~85% (30.39s to 4.09s /1000000)

which may help...

Posted: Sun Oct 01, 2006 12:45 am
by esaptonor
I know this post was a long time ago, but i have irrlicht version 1.0 and it hasn't been changed, so is it worth changing? or does that method not get called enough to merit changing?

Posted: Sun Oct 01, 2006 7:23 am
by hybrid
Simply tell your compiler to unroll loops and you'll get the same thing for free. No need to mess around with the code.

Posted: Sun Oct 01, 2006 2:26 pm
by RapchikProgrammer
I think his code should be atleast a little better cause the values of 0,5,10 and 15 are changed twice! First to 0 and then to 1! And in my opinion even the smallest of changes here would be really useful, cause i think the world, projection and view matrices are set to identity matrix at the render of every frame!

Posted: Sun Oct 01, 2006 3:14 pm
by hybrid
The latest matrix4 code is much better: A memset clearing the complete data and just setting 4 floats. This should give much better improvement.

Posted: Sun Oct 01, 2006 6:34 pm
by CuteAlien
hybrid wrote:The latest matrix4 code is much better: A memset clearing the complete data and just setting 4 floats. This should give much better improvement.
Are you sure memset is really faster? I did some tests in gcc by profiling memset vs. loops. Memset was faster when not compiling optimized, with -O2 it had the same speed and with -O3 the loop was faster (2x the speed of memset). So it seems to depend on how you compile the application. In games -O3 is often useful (not always), so memset seems to be worse here.

Posted: Sun Oct 01, 2006 7:59 pm
by hybrid
Did you use -mtune=i686 -sse2 (or whatever you have)? Memset uses intrinsics which are highly optimized to use the optimal machine code calls. However, for 16 bytes it might not always be better (because you might get the optimal values for both).
Also did you use other numbers in the memset call - and which additional overhead / cache strategy did you target?

Posted: Sun Oct 01, 2006 10:23 pm
by CuteAlien
I just compiled without optimization, -O2 and -O3. I tried now optimizing for 686, but it doesn't seem to make a difference. Maybe the test ain't that good as i'm just calling it a million times in a loop and measure that time (i'm using the values after each step, so it won't optimize the loop away). This ain't such a good test, as it could be easier to optimize for the compiler than it would be possible with some more code around it.

But i did some more reading about it around the web, and well... seems like people can't agree which version is faster ;-). Guess i'll stay with loops until i get a faster result the other way :-)