... Or in a place where you can perform multiple vector operations at once. Such as a brute-force culling-algorithm.
However, I'm assuming that you've been testing on x86/x64 which are out-of-order cpu's and they can be pretty awesome at times to swallow "bad code". If you compare that to a in-order cpu (xbox360/ps3/mobile devices etc), you'll get completely different results. If Irrlicht is thought to be used in none-x86/x64-platforms, SSE should definetly be implemented for performance.
Or if you're going to evaluate something like SSE, atleast try multiple platforms, specifically platforms that benefits from it.
SSE vector3df and matrix4
Simpe, you have a point, but remember that most uses of irrlicht are currently on desktop systems which dont use out-of-order CPUs.
I think more tests should be done before this is outright rejected.
I'm developing a custom GLES 2.0 based renderer for my iOS engine (iPhones, etc) and I make use of many base irrlicht types, including Vector2D, Vector3D and Matrix4's.
devsh if you can post your SSE versions of these then I can benchmark the performance of your changes on my iPhone4.
I think more tests should be done before this is outright rejected.
I'm developing a custom GLES 2.0 based renderer for my iOS engine (iPhones, etc) and I make use of many base irrlicht types, including Vector2D, Vector3D and Matrix4's.
devsh if you can post your SSE versions of these then I can benchmark the performance of your changes on my iPhone4.
I have posted my matrix class... but the functions are incomplete... either just copy them from the class and use pointer arithmetic to treat the 4 __m128 like 16 floats or complete the SSE implementation the reference is pretty easy... SSE is most useful with matrices so I'd start with that... you will have to test combinations of SSE functions against normal functions, stuff like multiplying by another matrix, assignment from scalar, and especially computing the inverse is always obviously faster and assignment of 4 __m128 is faster than memcpy or the same, but other stuff I found slower like some assignments (identity), transposing can be slower with -03 and -ffast-math.
I think you meant the other way, (most desktop systems are out-of-order execution cpu's)... possibly because I typoed it in my postfmx wrote:Simpe, you have a point, but remember that most uses of irrlicht are currently on desktop systems which dont use out-of-order CPUs.
But yeah, most irrlicht users run on desktop systems but for those who don't I'd say that something like this is extremely important since it makes a huge diff on performance. Just like vcalls does on in-order-machines
You're gonna need to use SOA form if you want a decent speed boost, that would require re-writing a lot of the algorithms in Irrlicht.
One trick I found useful is to use a SOA3Vector class which holds 4 3-dimensional vector and performs all the ordinary operations on 4 vectors at once, so as long as you always have a long list of data to perform operations on, you should be fine
One trick I found useful is to use a SOA3Vector class which holds 4 3-dimensional vector and performs all the ordinary operations on 4 vectors at once, so as long as you always have a long list of data to perform operations on, you should be fine
ShadowMapping for Irrlicht!: Get it here
Need help? Come on the IRC!: #irrlicht on irc://irc.freenode.net
Need help? Come on the IRC!: #irrlicht on irc://irc.freenode.net
Re: SSE vector3df and matrix4
i'm ressurecting this effort because the CPU is dragging down the performance of Build a World
the previous code I made is completely unusable because its not proper SSE
this time the classes of vector3d and matrix4 and rect2d and aabbox will all be 16byte aligned and padded to 4 floats
even on normal assignment or variable declaration
we'll provide a aligned16 call which will work on both windows and linux as well as a _SSSE3_ #ifdef s and #else s , so that irrlicht can be compiled without those
we'll release the whole thing when done and opengl 3.2 compliance is in
our irrlicht is always merged with the latest stable version ( 1.8 now, but merging with 1.8.1)
the previous code I made is completely unusable because its not proper SSE
this time the classes of vector3d and matrix4 and rect2d and aabbox will all be 16byte aligned and padded to 4 floats
even on normal assignment or variable declaration
we'll provide a aligned16 call which will work on both windows and linux as well as a _SSSE3_ #ifdef s and #else s , so that irrlicht can be compiled without those
we'll release the whole thing when done and opengl 3.2 compliance is in
our irrlicht is always merged with the latest stable version ( 1.8 now, but merging with 1.8.1)
Re: SSE vector3df and matrix4
I have been working on an SSE implementation of the matrix4 it's been able to make my fps go up 40% in some of my math heavy situation
so far only matrix works has been done as padding the vector to 4 components ended poorly have you gotten it to work?
so far only matrix works has been done as padding the vector to 4 components ended poorly have you gotten it to work?
Re: SSE vector3df and matrix4
so far, post-poned until some game features are in and OGL 3.2 core context gets sorted
Re: SSE vector3df and matrix4
only just sorted out the proper implementation... first I'm going to make/publish the classes and then you need to change the actual type of the matrix etc.
head over to http://irrlicht.sourceforge.net/forum/v ... =9&t=50230
head over to http://irrlicht.sourceforge.net/forum/v ... =9&t=50230