About SIMD math (with profiling code)

hendu · Post by **hendu** » Fri Apr 24, 2015 8:27 am

Yet the X axis is stored as M[0], M[1], M[2]...

CuteAlien · Post by **CuteAlien** » Fri Apr 24, 2015 9:01 am

@devsh: 12, 13, 14, 15 is the last row in a row major matrix. That's what row major means.
As the documentation says the matrix is interpreted the same as in D3D. Opengl puts the translation in the column (because some non-coding math guys prefer that) instead of the row (as d3d does) which is why you might think it always should be a column - but there's no reason for that - both are fine. And as GL is reading the memory then as column major matrix it puts the translation elements again as 12,13,14,15 in memory just as in D3D.

mongoose7 · Post by **mongoose7** » Fri Apr 24, 2015 1:46 pm

Nothing you say makes sense. Let me say again, Irrlicht matrices are actually the transposes of the true matrices. Therefore, the translation vector is not in the fourth column, it is in the fourth row.

REDDemon · Post by **REDDemon** » Sun Apr 26, 2015 7:14 pm

Actually the memory representation is independent of such stuff.(devsh explained it well)
In math, translation matrix is the last column, but you can represent the matrix with translation both in M[3] M[7] M[11] or in M[12] M[13] M[14]. You are in both cases refererring to a column if you keep consistent across all code.

Now, in D3D the translation is a row. So if irrlicht take that as assumption (12,13,14) from a D3D point of view irrlicht is column-major. That accidentally can be thinked as a GL matrix with translation in a column and row-major order, they are the same. However that change when matrix multiplication comes into play, from that you can detect if the matrix class is thinked as D3D-ish or Gl-ish. (so you need to post multiply a vector. or to pre-multiply it to get it transformed?)

Memory representation is independent of the math formalism. (speculations: D3D used translation in rows because 1) accidentally C++ operator * associates in the wrong direction, 2) to break code and confuse users so they will stick on D3D easily )

devsh · Post by **devsh** » Tue Apr 28, 2015 12:55 pm

matrix multiplication is associative, so it doesnt matter if you multiply matrices as A(BC) or (AB)C, you still get the same answer, so C++ operator associativity is not an issue

devsh · Post by **devsh** » Fri May 01, 2015 12:42 pm

Anyway

TRY MY SIMD MATRIX AND DO SOME BENCHMARKS OF DEATH:
http://irrlicht.sourceforge.net/forum/v ... 04#p293604

I recommend trying my matrix inverse function (which is untested so it may completely not work), it should be 10x faster than irrlicht's one.

thanhle · Post by **thanhle** » Sat May 02, 2015 3:17 am

How do you use your SIMD Matrix operation classes?
Is the expected format similar to Irrlicht?
Row or Column major?
Do we need to convert irrlicht vector to SIMD vector format or from matrix4 to SIMD matrix to use it?
Maybe if you could put out one or more example on how to use it.

Regards,
thanh

devsh · Post by **devsh** » Sat May 02, 2015 1:08 pm

matrix is proper maths mattrix, the matrix is made of 4 SIMD vectors, each representing a row

yes, you need to convert between irrlicht vectors and matrices, there is no implied conversion (for the sake of speed and the bugs that could arise from different formats)
there are some conversion functions for the vectorSIMDf, but none for matrixSIMD4 so far

the examples on usage are in the other thread

devsh · Post by **devsh** » Fri May 22, 2015 3:33 pm

Using my properly implemented class (I was only pitching actual class instances of matrix4 and matrixSIMD4 against each other)

-O0

RUNTIME of SSE matrix mul 5054530 microseconds
RUNTIME of irrMatrix mul 6241177 microseconds

-O1

RUNTIME of SSE matrix mul 375871 microseconds
RUNTIME of irrMatrix mul 562569 microseconds

-O2

RUNTIME of SSE matrix mul 288762 microseconds
RUNTIME of irrMatrix mul 548489 microseconds

-O3

RUNTIME of SSE matrix mul 301609 microseconds
RUNTIME of irrMatrix mul 574059 microseconds

-O4

RUNTIME of SSE matrix mul 296882 microseconds
RUNTIME of irrMatrix mul 538160 microseconds

As we can see we have a clear winner here (SIMD intrinsics)

The funny thing is the optimizations begin to top-out at O2 (I guess there is nothing more the compiler can do with such simple code)

If we look at O0, we see perf gain is only around 10-15%, but as we go to O2 we start hitting 2x speedup.
The reason for that is because the compiler finally starts inlining all the functions and sees a lot of

Code: Select all

 
...
xmm0 = _mm_load_ps(memory);
xmm0 = _mm_someop_ps(xmm0,xmm1);
_mm_store_ps(memory);
xmm0 = _mm_load_ps(memory);
...

and eliminates the redundant loads and stores

REDDemon · Post by **REDDemon** » Wed Jun 10, 2015 8:17 am

devsh wrote:matrix multiplication is associative, so it doesnt matter if you multiply matrices as A(BC) or (AB)C, you still get the same answer, so C++ operator associativity is not an issue

Yes indeed it matters if you have vectors one one side of the multiplication chain, and anyway in general (AB) is different from (BA) many implementations would also invert pairs of elements effectively turning A(BC) into A(CB) and then into (CB)A wich is wrong too. In general the "wrong" associativity of C++ and particular choosen implementation force to write matrix in wrong order compared to math and you will get different results.

@devsh. going to test your code XD

REDDemon · Post by **REDDemon** » Wed Jun 10, 2015 8:41 am

-O2 -msse3

Code: Select all

 
RUNTIME of devsh SSE mat mul    1270000 microseconds
RUNTIME of SSE matrix mul       3384000 microseconds
RUNTIME of regularMatrix mul    3720000 microseconds
RUNTIME of irrMatrix mul        3542000 microseconds

-O3 -msse3

Code: Select all

 
RUNTIME of devsh SSE mat mul    1301000 microseconds
RUNTIME of SSE matrix mul       3126000 microseconds
RUNTIME of regularMatrix mul    3593000 microseconds
RUNTIME of irrMatrix mul        3532000 microseconds

very nice

You can run the benchmark too here:
https://github.com/Darelbi/PublicProfil ... iplication

RdR · Post by **RdR** » Wed Jun 10, 2015 9:08 am

Very nice indeed!
Any other benchmarks in the works? Like matrix inverse

REDDemon · Post by **REDDemon** » Wed Jun 10, 2015 9:49 am

RdR wrote:Very nice indeed!
Any other benchmarks in the works? Like matrix inverse

Thanks! that's possible XD

RdR · Post by **RdR** » Wed Jun 10, 2015 11:14 am

REDDemon wrote:
RdR wrote:Very nice indeed!
Any other benchmarks in the works? Like matrix inverse
Thanks! that's possible XD

That would be nice.
Have to say I did not do much research about SIMD yet, but would like to implement this in the feature (or use devsh's code if its available)
But how do you handle CPU's not supporting SIMD?

REDDemon · Post by **REDDemon** » Wed Jun 10, 2015 12:43 pm

Good question. It is a maintenance issue mostly..

You need 2 branches of the same code (you can do preprocessor trickery, ore just abuse the build system to include the correct file). One with regular C++ and the other one with SIMD instructions.

You have also to build 2 different binaries (pretty easy as long as you stick with CMAKE, but a bit pain using VS or C::B) and warn users about the different download packages (or just do that selection at runtime: the selection would be platform dependent so requires extra code).

You can happily assume everyone have SSE2 (mine laptop is 7 years old and has up to SSE3) but most interesting stuff comes with SSE3 (horizontal add in example, mine SSE matrix multiplication is slow as native code because uses SSE2 and so no horizontal add, while devsh code use SSE3 wich give a huge boost of x3 speed)

It surprises me that modern C++ compilers still can't do the following:

- Convert SIMD instructions into multiple regular x86 instructions (it is possible, and that would at least remove some maintenance burden from C++ developers: you just write SIMD code).. as far as I know Emscripten already do that but that's for web.
- Certain processors just translate SIMD instructions into multiple microcodes instructions

The most important point is that optimizing for SIMD requires also some changes at high level (different memory layout), and just hardcoding routines in SIMD instructions is not the most smart thing todo (but actually no programming language provide high level control for memory layout of data, I just wrote an article in italian once about the topic.), so you have to keep into account that.

Irrlicht Engine

About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)

Re: About SIMD math (with profiling code)