About SIMD math (with profiling code)

Discuss about anything related to the Irrlicht Engine, or read announcements about any significant features or usage changes.
hendu
Posts: 2600
Joined: Sat Dec 18, 2010 12:53 pm

Re: About SIMD math (with profiling code)

Post by hendu »

Yet the X axis is stored as M[0], M[1], M[2]...
CuteAlien
Admin
Posts: 9734
Joined: Mon Mar 06, 2006 2:25 pm
Location: Tübingen, Germany
Contact:

Re: About SIMD math (with profiling code)

Post by CuteAlien »

@devsh: 12, 13, 14, 15 is the last row in a row major matrix. That's what row major means.
As the documentation says the matrix is interpreted the same as in D3D. Opengl puts the translation in the column (because some non-coding math guys prefer that) instead of the row (as d3d does) which is why you might think it always should be a column - but there's no reason for that - both are fine. And as GL is reading the memory then as column major matrix it puts the translation elements again as 12,13,14,15 in memory just as in D3D.
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
mongoose7
Posts: 1227
Joined: Wed Apr 06, 2011 12:13 pm

Re: About SIMD math (with profiling code)

Post by mongoose7 »

Nothing you say makes sense. Let me say again, Irrlicht matrices are actually the transposes of the true matrices. Therefore, the translation vector is not in the fourth column, it is in the fourth row.
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: About SIMD math (with profiling code)

Post by REDDemon »

Actually the memory representation is independent of such stuff.(devsh explained it well)
In math, translation matrix is the last column, but you can represent the matrix with translation both in M[3] M[7] M[11] or in M[12] M[13] M[14]. You are in both cases refererring to a column if you keep consistent across all code.

Now, in D3D the translation is a row. So if irrlicht take that as assumption (12,13,14) from a D3D point of view irrlicht is column-major. That accidentally can be thinked as a GL matrix with translation in a column and row-major order, they are the same. However that change when matrix multiplication comes into play, from that you can detect if the matrix class is thinked as D3D-ish or Gl-ish. (so you need to post multiply a vector. or to pre-multiply it to get it transformed?)

Memory representation is independent of the math formalism. (speculations: D3D used translation in rows because 1) accidentally C++ operator * associates in the wrong direction, 2) to break code and confuse users so they will stick on D3D easily )
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: About SIMD math (with profiling code)

Post by devsh »

matrix multiplication is associative, so it doesnt matter if you multiply matrices as A(BC) or (AB)C, you still get the same answer, so C++ operator associativity is not an issue
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: About SIMD math (with profiling code)

Post by devsh »

Anyway

TRY MY SIMD MATRIX AND DO SOME BENCHMARKS OF DEATH:
http://irrlicht.sourceforge.net/forum/v ... 04#p293604

I recommend trying my matrix inverse function (which is untested so it may completely not work), it should be 10x faster than irrlicht's one.
thanhle
Posts: 325
Joined: Wed Jun 12, 2013 8:09 am

Re: About SIMD math (with profiling code)

Post by thanhle »

How do you use your SIMD Matrix operation classes?
Is the expected format similar to Irrlicht?
Row or Column major?
Do we need to convert irrlicht vector to SIMD vector format or from matrix4 to SIMD matrix to use it?
Maybe if you could put out one or more example on how to use it.

Regards,
thanh
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: About SIMD math (with profiling code)

Post by devsh »

matrix is proper maths mattrix, the matrix is made of 4 SIMD vectors, each representing a row

yes, you need to convert between irrlicht vectors and matrices, there is no implied conversion (for the sake of speed and the bugs that could arise from different formats)
there are some conversion functions for the vectorSIMDf, but none for matrixSIMD4 so far

the examples on usage are in the other thread
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: About SIMD math (with profiling code)

Post by devsh »

Using my properly implemented class (I was only pitching actual class instances of matrix4 and matrixSIMD4 against each other)

-O0
RUNTIME of SSE matrix mul 5054530 microseconds
RUNTIME of irrMatrix mul 6241177 microseconds
-O1
RUNTIME of SSE matrix mul 375871 microseconds
RUNTIME of irrMatrix mul 562569 microseconds
-O2
RUNTIME of SSE matrix mul 288762 microseconds
RUNTIME of irrMatrix mul 548489 microseconds
-O3
RUNTIME of SSE matrix mul 301609 microseconds
RUNTIME of irrMatrix mul 574059 microseconds
-O4
RUNTIME of SSE matrix mul 296882 microseconds
RUNTIME of irrMatrix mul 538160 microseconds

As we can see we have a clear winner here (SIMD intrinsics)

The funny thing is the optimizations begin to top-out at O2 (I guess there is nothing more the compiler can do with such simple code)


If we look at O0, we see perf gain is only around 10-15%, but as we go to O2 we start hitting 2x speedup.
The reason for that is because the compiler finally starts inlining all the functions and sees a lot of

Code: Select all

 
...
xmm0 = _mm_load_ps(memory);
xmm0 = _mm_someop_ps(xmm0,xmm1);
_mm_store_ps(memory);
xmm0 = _mm_load_ps(memory);
...
 
and eliminates the redundant loads and stores
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: About SIMD math (with profiling code)

Post by REDDemon »

devsh wrote:matrix multiplication is associative, so it doesnt matter if you multiply matrices as A(BC) or (AB)C, you still get the same answer, so C++ operator associativity is not an issue
Yes indeed it matters if you have vectors one one side of the multiplication chain, and anyway in general (AB) is different from (BA) many implementations would also invert pairs of elements effectively turning A(BC) into A(CB) and then into (CB)A wich is wrong too. In general the "wrong" associativity of C++ and particular choosen implementation force to write matrix in wrong order compared to math and you will get different results.


@devsh. going to test your code XD
Last edited by REDDemon on Wed Jun 10, 2015 9:48 am, edited 1 time in total.
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: About SIMD math (with profiling code)

Post by REDDemon »

-O2 -msse3

Code: Select all

 
RUNTIME of devsh SSE mat mul    1270000 microseconds
RUNTIME of SSE matrix mul       3384000 microseconds
RUNTIME of regularMatrix mul    3720000 microseconds
RUNTIME of irrMatrix mul        3542000 microseconds
 
-O3 -msse3

Code: Select all

 
RUNTIME of devsh SSE mat mul    1301000 microseconds
RUNTIME of SSE matrix mul       3126000 microseconds
RUNTIME of regularMatrix mul    3593000 microseconds
RUNTIME of irrMatrix mul        3532000 microseconds
 
very nice

You can run the benchmark too here:
https://github.com/Darelbi/PublicProfil ... iplication
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
RdR
Competition winner
Posts: 273
Joined: Tue Mar 29, 2011 2:58 pm
Contact:

Re: About SIMD math (with profiling code)

Post by RdR »

Very nice indeed!
Any other benchmarks in the works? Like matrix inverse
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: About SIMD math (with profiling code)

Post by REDDemon »

RdR wrote:Very nice indeed!
Any other benchmarks in the works? Like matrix inverse
Thanks! that's possible XD
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
RdR
Competition winner
Posts: 273
Joined: Tue Mar 29, 2011 2:58 pm
Contact:

Re: About SIMD math (with profiling code)

Post by RdR »

REDDemon wrote:
RdR wrote:Very nice indeed!
Any other benchmarks in the works? Like matrix inverse
Thanks! that's possible XD
That would be nice.
Have to say I did not do much research about SIMD yet, but would like to implement this in the feature (or use devsh's code if its available)
But how do you handle CPU's not supporting SIMD?
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: About SIMD math (with profiling code)

Post by REDDemon »

Good question. It is a maintenance issue mostly..

You need 2 branches of the same code (you can do preprocessor trickery, ore just abuse the build system to include the correct file). One with regular C++ and the other one with SIMD instructions.

You have also to build 2 different binaries (pretty easy as long as you stick with CMAKE, but a bit pain using VS or C::B) and warn users about the different download packages (or just do that selection at runtime: the selection would be platform dependent so requires extra code).

You can happily assume everyone have SSE2 (mine laptop is 7 years old and has up to SSE3) but most interesting stuff comes with SSE3 (horizontal add in example, mine SSE matrix multiplication is slow as native code because uses SSE2 and so no horizontal add, while devsh code use SSE3 wich give a huge boost of x3 speed)

It surprises me that modern C++ compilers still can't do the following:

- Convert SIMD instructions into multiple regular x86 instructions (it is possible, and that would at least remove some maintenance burden from C++ developers: you just write SIMD code).. as far as I know Emscripten already do that but that's for web.
- Certain processors just translate SIMD instructions into multiple microcodes instructions

The most important point is that optimizing for SIMD requires also some changes at high level (different memory layout), and just hardcoding routines in SIMD instructions is not the most smart thing todo (but actually no programming language provide high level control for memory layout of data, I just wrote an article in italian once about the topic.), so you have to keep into account that.
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
Post Reply