Hey
I know that early optimisation is the root of all evil, etc ,etc , but changing:
for (s32 i=0; i<16; ++i)
M = 0.0f;
M[0] = M[5] = M[10] = M[15] = 1;
in matrix4::makeidentity to:
M[0] = 1;
M[1] = 0;
M[2] = 0;
M[3] = 0;
M[4] = 0;
M[5] = 1;
M[6] = 0;
M[7] = 0;
M[8] = 0;
M[9] = 0;
M[10] = 1;
M[11] = 0;
M[12] = 0;
M[13] = 0;
M[14] = 0;
M[15] = 1;
reduces the times taken for the function by ~85% (30.39s to 4.09s /1000000)
which may help...
Optimisation
-
- Posts: 279
- Joined: Fri Dec 24, 2004 6:37 pm
I think his code should be atleast a little better cause the values of 0,5,10 and 15 are changed twice! First to 0 and then to 1! And in my opinion even the smallest of changes here would be really useful, cause i think the world, projection and view matrices are set to identity matrix at the render of every frame!
Are you sure memset is really faster? I did some tests in gcc by profiling memset vs. loops. Memset was faster when not compiling optimized, with -O2 it had the same speed and with -O3 the loop was faster (2x the speed of memset). So it seems to depend on how you compile the application. In games -O3 is often useful (not always), so memset seems to be worse here.hybrid wrote:The latest matrix4 code is much better: A memset clearing the complete data and just setting 4 floats. This should give much better improvement.
-
- Admin
- Posts: 14143
- Joined: Wed Apr 19, 2006 9:20 pm
- Location: Oldenburg(Oldb), Germany
- Contact:
Did you use -mtune=i686 -sse2 (or whatever you have)? Memset uses intrinsics which are highly optimized to use the optimal machine code calls. However, for 16 bytes it might not always be better (because you might get the optimal values for both).
Also did you use other numbers in the memset call - and which additional overhead / cache strategy did you target?
Also did you use other numbers in the memset call - and which additional overhead / cache strategy did you target?
I just compiled without optimization, -O2 and -O3. I tried now optimizing for 686, but it doesn't seem to make a difference. Maybe the test ain't that good as i'm just calling it a million times in a loop and measure that time (i'm using the values after each step, so it won't optimize the loop away). This ain't such a good test, as it could be easier to optimize for the compiler than it would be possible with some more code around it.
But i did some more reading about it around the web, and well... seems like people can't agree which version is faster ;-). Guess i'll stay with loops until i get a faster result the other way :-)
But i did some more reading about it around the web, and well... seems like people can't agree which version is faster ;-). Guess i'll stay with loops until i get a faster result the other way :-)