SSE vector3df and matrix4

Simpe · Post by **Simpe** » Wed Dec 15, 2010 8:37 am

... Or in a place where you can perform multiple vector operations at once. Such as a brute-force culling-algorithm.

However, I'm assuming that you've been testing on x86/x64 which are out-of-order cpu's and they can be pretty awesome at times to swallow "bad code". If you compare that to a in-order cpu (xbox360/ps3/mobile devices etc), you'll get completely different results. If Irrlicht is thought to be used in none-x86/x64-platforms, SSE should definetly be implemented for performance.

Or if you're going to evaluate something like SSE, atleast try multiple platforms, specifically platforms that benefits from it.

fmx · Post by **fmx** » Wed Dec 15, 2010 9:37 am

Simpe, you have a point, but remember that most uses of irrlicht are currently on desktop systems which dont use out-of-order CPUs.

I think more tests should be done before this is outright rejected.
I'm developing a custom GLES 2.0 based renderer for my iOS engine (iPhones, etc) and I make use of many base irrlicht types, including Vector2D, Vector3D and Matrix4's.

devsh if you can post your SSE versions of these then I can benchmark the performance of your changes on my iPhone4.

devsh · Post by **devsh** » Wed Dec 15, 2010 8:14 pm

I have posted my matrix class... but the functions are incomplete... either just copy them from the class and use pointer arithmetic to treat the 4 __m128 like 16 floats or complete the SSE implementation the reference is pretty easy... SSE is most useful with matrices so I'd start with that... you will have to test combinations of SSE functions against normal functions, stuff like multiplying by another matrix, assignment from scalar, and especially computing the inverse is always obviously faster and assignment of 4 __m128 is faster than memcpy or the same, but other stuff I found slower like some assignments (identity), transposing can be slower with -03 and -ffast-math.

Simpe · Post by **Simpe** » Thu Dec 16, 2010 10:30 am

fmx wrote:Simpe, you have a point, but remember that most uses of irrlicht are currently on desktop systems which dont use out-of-order CPUs.

I think you meant the other way, (most desktop systems are out-of-order execution cpu's)... possibly because I typoed it in my post

But yeah, most irrlicht users run on desktop systems but for those who don't I'd say that something like this is extremely important since it makes a huge diff on performance. Just like vcalls does on in-order-machines

fmx · Post by **fmx** » Thu Dec 16, 2010 12:57 pm

I honestly had no idea, my experience is limited to consumer desktop PCs and iPhones, I still have a lot to learn about other hardware and differences in CPU architecture.

BlindSide · Post by **BlindSide** » Thu Dec 16, 2010 11:02 pm

You're gonna need to use SOA form if you want a decent speed boost, that would require re-writing a lot of the algorithms in Irrlicht.

One trick I found useful is to use a SOA3Vector class which holds 4 3-dimensional vector and performs all the ordinary operations on 4 vectors at once, so as long as you always have a long list of data to perform operations on, you should be fine

devsh · Post by **devsh** » Tue Nov 26, 2013 6:10 pm

i'm ressurecting this effort because the CPU is dragging down the performance of Build a World

the previous code I made is completely unusable because its not proper SSE

this time the classes of vector3d and matrix4 and rect2d and aabbox will all be 16byte aligned and padded to 4 floats
even on normal assignment or variable declaration

we'll provide a aligned16 call which will work on both windows and linux as well as a _SSSE3_ #ifdef s and #else s , so that irrlicht can be compiled without those

we'll release the whole thing when done and opengl 3.2 compliance is in

our irrlicht is always merged with the latest stable version ( 1.8 now, but merging with 1.8.1)

Granyte · Post by **Granyte** » Sat Jan 11, 2014 5:04 am

I have been working on an SSE implementation of the matrix4 it's been able to make my fps go up 40% in some of my math heavy situation
so far only matrix works has been done as padding the vector to 4 components ended poorly have you gotten it to work?

devsh · Post by **devsh** » Sun Jan 12, 2014 8:41 pm

so far, post-poned until some game features are in and OGL 3.2 core context gets sorted

devsh · Post by **devsh** » Fri Sep 05, 2014 3:50 am

only just sorted out the proper implementation... first I'm going to make/publish the classes and then you need to change the actual type of the matrix etc.

head over to http://irrlicht.sourceforge.net/forum/v ... =9&t=50230

Irrlicht Engine

SSE vector3df and matrix4

Re: SSE vector3df and matrix4

Re: SSE vector3df and matrix4

Re: SSE vector3df and matrix4

Re: SSE vector3df and matrix4