SSE vector3df and matrix4
SSE vector3df and matrix4
I want to implement SSE 3dvectors and matrices to speed up the engine CPU part by 50%. I have some results...
simple assignment (constructor and = operator) using vector3df with SSE takes 1400ms +/- 50ms for 50 MILLION assignments (loop)
standard irrlicht takes 1680ms +/- 50ms
That is JUST assignment
update:
assignment with values (x,y,z) is slower on SSE
simple assignment (constructor and = operator) using vector3df with SSE takes 1400ms +/- 50ms for 50 MILLION assignments (loop)
standard irrlicht takes 1680ms +/- 50ms
That is JUST assignment
update:
assignment with values (x,y,z) is slower on SSE
I tried, all.. I even implemented SSE in particle system in hopes of speeding up the billboarding... SSE simply doesnt accelerate non-16-aligned floats, simply just as fast... I conclude we shouldn't bother with SSE as this would have us rewriting irrlicht
SSE starts to loose its 20% advantage after enabling the -03 flag in GCC
SSE starts to loose its 20% advantage after enabling the -03 flag in GCC
I decided to give it one last shot... I have made core::matrix4sse and.... it performs at avg. 4700 ms and standard matrix4 performs at avg. 8800ms
Here is the test code, ofc in the native irrlicht version the matrix4sse declarations change into matrix4
EDIT: I do realize I would need the dev's openmindedness... I can see this is not likely to make it into the engine and moreover the SSE needs to be written in inline ASM to be truly effective, I cant do so. So here is my matrix4 stub
Here is the test code, ofc in the native irrlicht version the matrix4sse declarations change into matrix4
Code: Select all
//We multiply many matrices together so it actually takes some time
#define ARRAY_SIZE 1024*64
void ComputeArrayCPlusPlus(core::matrix4sse* vec, core::matrix4sse* other, u32 i)
{
vec[i] = other[i]*other[i]*vec[i]*0.00001526f;
}
/*..... sOME CODE and in the main function.... */
core::matrix4sse vec[ARRAY_SIZE];
core::matrix4sse vec2[ARRAY_SIZE+1];
for (u32 i = 0; i<ARRAY_SIZE; i++)
{
vec2[i][0] = sinf(i*55.f-1.f);
vec2[i][1] = cosf(i+1.f);
vec2[i][2] = cosf(i*128.f-128.f);
vec2[i][3] = sinf(i*55.f-1.f);
vec2[i][4] = 1.f/sinf(i*55.f-1.f);
vec2[i][5] = 1.f/cosf(i+1.f);
vec2[i][6] = 1.f/cosf(i*128.f-128.f);
vec2[i][7] = 1.f/sinf(i*55.f-1.f);
vec2[i][8] = sinf(i*5.f-1.f);
vec2[i][9] = cosf(i+2.f);
vec2[i][10] = cosf(i*18.f-12.f);
vec2[i][11] = sinf(i*5.f-11.f);
vec2[i][12] = 1.f/sinf(i*535.f-12.f);
vec2[i][13] = 1.f/cosf(i*0.25f+14.f);
vec2[i][14] = 1.f/cosf(i*12.f-18.f);
vec2[i][15] = 1.f/sinf(i*0.5f-111.f);
}
for (u32 i = 0; i<ARRAY_SIZE; i++)
{
vec[i][0] = sinf(i*551.f-1.f);
vec[i][1] = cosf(i+11.f);
vec[i][2] = cosf(i*1268.f-1238.f);
vec[i][3] = sinf(i*535.f-13.f);
vec[i][4] = 15.f/sinf(i*55.f-1.f);
vec[i][5] = 17.f/cosf(i+1.f);
vec[i][6] = 13.f/cosf(i*128.f-128.f);
vec[i][7] = 14.f/sinf(i*55.f-1.f);
vec[i][8] = sinf(i*56.f-1.f);
vec[i][9] = cosf(i+261.f);
vec[i][10] = cosf(i*1813.f-112.f);
vec[i][11] = sinf(i*56.f-131.f);
vec[i][12] = 3.f/sinf(i*535.f-12.f);
vec[i][13] = 31.f/cosf(i*0.25f+14.f);
vec[i][14] = 3.f/cosf(i*12.f-18.f);
vec[i][15] = 3.f/sinf(i*0.5f-111.f);
}
u32 time = device->getTimer()->getRealTime();
for (u32 j = 0; j<512; j++)
{
for (u32 i = 0; i<ARRAY_SIZE; i++)
ComputeArrayCPlusPlus(vec,vec2,i);
}
printf("Time Taken: %u Result: %f, %f, %f \n",device->getTimer()->getRealTime()-time,vec[123][10],vec[123][1],vec[123][15]);//,((f32*)&vec[123].coords)[0],((f32*)&vec[123].coords)[1],((f32*)&vec[123].coords)[2]);
Code: Select all
// Copyright (C) 2002-2009 Nikolaus Gebhardt
// This file is part of the "Irrlicht Engine".
// For conditions of distribution and use, see copyright notice in irrlicht.h
#ifndef __IRR_MATRIX_4_SSE_H_INCLUDED__
#define __IRR_MATRIX_4_SSE_H_INCLUDED__
#include "irrMath.h"
#include "vector3d.h"
#include "vector2d.h"
#include "plane3d.h"
#include "aabbox3d.h"
#include "rect.h"
#include "irrString.h"
#include "xmmintrin.h"
// enable this to keep track of changes to the matrix
// and make simpler identity check for seldomly changing matrices
// otherwise identity check will always compare the elements
//#define USE_MATRIX_TEST
// this is only for debugging purposes
//#define USE_MATRIX_TEST_DEBUG
namespace irr
{
namespace core
{
//! 4x4 matrix. Mostly used as transformation matrix for 3d calculations.
/** The matrix is a D3D style matrix, row major with translations in the 4th row. */
class matrix4sse
{
public:
//! Constructor Flags
enum eConstructor
{
EM4CONST_NOTHING = 0,
EM4CONST_COPY,
EM4CONST_IDENTITY,
EM4CONST_TRANSPOSED,
EM4CONST_INVERSE,
EM4CONST_INVERSE_TRANSPOSED
};
//! Default constructor
/** \param constructor Choose the initialization style */
matrix4sse( eConstructor constructor = EM4CONST_IDENTITY );
//! Copy constructor
/** \param other Other matrix to copy from
\param constructor Choose the initialization style */
matrix4sse(const matrix4sse& other, eConstructor constructor = EM4CONST_COPY);
//! Simple operator for directly accessing every element of the matrix.
f32& operator()(const s32 row, const s32 col) { return ((f32*)(M+row))[col]; }
//! Simple operator for directly accessing every element of the matrix.
const f32& operator()(const s32 row, const s32 col) const { return ((f32*)(M+row))[col]; }
//! Simple operator for linearly accessing every element of the matrix.
f32& operator[](u32 index) { return ((f32*)M)[index]; }
//! Simple operator for linearly accessing every element of the matrix.
const f32& operator[](u32 index) const { return ((f32*)M)[index]; }
//! Sets this matrix equal to the other matrix.
inline matrix4sse& operator=(const matrix4sse &other);
//! Sets all elements of this matrix to the value.
inline matrix4sse& operator=(const f32& scalar);
//! Returns pointer to internal array
const f32* pointer() const { return (f32*)M; }
f32* pointer()
{
return (f32*)M;
}
//! Multiply by another matrix.
/** Calculate other*this */
matrix4sse operator*(const matrix4sse& other) const;
//! Multiply by another matrix.
/** Calculate and return other*this */
//matrix4sse& operator*=(const matrix4sse& other);
//! Multiply by scalar.
matrix4sse operator*(const f32& scalar) const;
//! Set matrix to identity.
inline matrix4sse& makeIdentity();
//! Gets transposed matrix
matrix4sse getTransposed() const;
//! Gets transposed matrix
inline void getTransposed( matrix4sse& dest ) const;
private:
//! Matrix data, stored in row-major order
__m128 M[4];
};
// Default constructor
inline matrix4sse::matrix4sse( eConstructor constructor )
{
switch ( constructor )
{
case EM4CONST_NOTHING:
case EM4CONST_COPY:
break;
case EM4CONST_IDENTITY:
case EM4CONST_INVERSE:
default:
makeIdentity();
break;
}
}
// Copy constructor
inline matrix4sse::matrix4sse( const matrix4sse& other, eConstructor constructor)
{
switch ( constructor )
{
case EM4CONST_IDENTITY:
makeIdentity();
break;
case EM4CONST_NOTHING:
break;
case EM4CONST_COPY:
*this = other;
break;
case EM4CONST_TRANSPOSED:
other.getTransposed(*this);
break;/*
case EM4CONST_INVERSE:
if (!other.getInverse(*this))
memset(M, 0, 16*sizeof(T));
break;
case EM4CONST_INVERSE_TRANSPOSED:
if (!other.getInverse(*this))
memset(M, 0, 16*sizeof(T));
else
*this=getTransposed();
break;*/
}
}
//! Multiply by scalar.
inline matrix4sse matrix4sse::operator*(const f32& scalar) const
{
matrix4sse temp ( EM4CONST_NOTHING );
__m128 scalarSSE = _mm_load1_ps(&scalar);
temp.M[0] = _mm_mul_ps(M[0],scalarSSE);
temp.M[1] = _mm_mul_ps(M[1],scalarSSE);
temp.M[2] = _mm_mul_ps(M[2],scalarSSE);
temp.M[3] = _mm_mul_ps(M[3],scalarSSE);
return temp;
}
//! multiply by another matrix
inline matrix4sse matrix4sse::operator*(const matrix4sse& m2) const
{
matrix4sse m3 ( EM4CONST_NOTHING );
/*
m3.M[0] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_load1_ps(((f32*)m2.M)+0)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+1))),_mm_add_ps(_mm_mul_ps(M[2],_mm_load1_ps(((f32*)m2.M)+2)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+3))));
m3.M[1] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_load1_ps(((f32*)m2.M)+4)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+5))),_mm_add_ps(_mm_mul_ps(M[2],_mm_load1_ps(((f32*)m2.M)+6)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+7))));
m3.M[2] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_load1_ps(((f32*)m2.M)+8)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+9))),_mm_add_ps(_mm_mul_ps(M[2],_mm_load1_ps(((f32*)m2.M)+10)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+11))));
m3.M[3] = _mm_add_ps(_mm_mul_ps(_mm_add_ps(M[0],_mm_load1_ps(((f32*)m2.M)+12)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+13))),_mm_mul_ps(_mm_add_ps(M[2],_mm_load1_ps(((f32*)m2.M)+14)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+15))));
*/
m3.M[0] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[0],m2.M[0],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[0],m2.M[0],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[0],m2.M[0],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[0],m2.M[0],0xff))));
m3.M[1] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[1],m2.M[1],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[1],m2.M[1],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[1],m2.M[1],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[1],m2.M[1],0xff))));
m3.M[2] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[2],m2.M[2],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[2],m2.M[2],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[2],m2.M[2],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[2],m2.M[2],0xff))));
m3.M[3] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[3],m2.M[3],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[3],m2.M[3],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[3],m2.M[3],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[3],m2.M[3],0xff))));
/*
const f32 *m1 = (f32*)M;
m3[0] = m1[0]*m2[0] + m1[4]*m2[1] + m1[8]*m2[2] + m1[12]*m2[3];
m3[1] = m1[1]*m2[0] + m1[5]*m2[1] + m1[9]*m2[2] + m1[13]*m2[3];
m3[2] = m1[2]*m2[0] + m1[6]*m2[1] + m1[10]*m2[2] + m1[14]*m2[3];
m3[3] = m1[3]*m2[0] + m1[7]*m2[1] + m1[11]*m2[2] + m1[15]*m2[3];
m3[4] = m1[0]*m2[4] + m1[4]*m2[5] + m1[8]*m2[6] + m1[12]*m2[7];
m3[5] = m1[1]*m2[4] + m1[5]*m2[5] + m1[9]*m2[6] + m1[13]*m2[7];
m3[6] = m1[2]*m2[4] + m1[6]*m2[5] + m1[10]*m2[6] + m1[14]*m2[7];
m3[7] = m1[3]*m2[4] + m1[7]*m2[5] + m1[11]*m2[6] + m1[15]*m2[7];
m3[8] = m1[0]*m2[8] + m1[4]*m2[9] + m1[8]*m2[10] + m1[12]*m2[11];
m3[9] = m1[1]*m2[8] + m1[5]*m2[9] + m1[9]*m2[10] + m1[13]*m2[11];
m3[10] = m1[2]*m2[8] + m1[6]*m2[9] + m1[10]*m2[10] + m1[14]*m2[11];
m3[11] = m1[3]*m2[8] + m1[7]*m2[9] + m1[11]*m2[10] + m1[15]*m2[11];
m3[12] = m1[0]*m2[12] + m1[4]*m2[13] + m1[8]*m2[14] + m1[12]*m2[15];
m3[13] = m1[1]*m2[12] + m1[5]*m2[13] + m1[9]*m2[14] + m1[13]*m2[15];
m3[14] = m1[2]*m2[12] + m1[6]*m2[13] + m1[10]*m2[14] + m1[14]*m2[15];
m3[15] = m1[3]*m2[12] + m1[7]*m2[13] + m1[11]*m2[14] + m1[15]*m2[15];*/
return m3;
}
/*!
*/
inline matrix4sse& matrix4sse::makeIdentity()
{
((f32*)M)[0] = ((f32*)M)[5] = ((f32*)M)[10] = ((f32*)M)[15] = 1.f;
return *this;
}
inline matrix4sse& matrix4sse::operator=(const matrix4sse &other)
{
if (this==&other)
return *this;
M[0] = other.M[0];
M[1] = other.M[1];
M[2] = other.M[2];
M[3] = other.M[3];
return *this;
}
inline matrix4sse& matrix4sse::operator=(const f32& scalar)
{
M[0] = M[1] = M[2] = M[3] = _mm_set1_ps(scalar);
return *this;
}
// returns transposed matrix
inline matrix4sse matrix4sse::getTransposed() const
{
matrix4sse t ( EM4CONST_NOTHING );
getTransposed ( t );
return t;
}
// returns transposed matrix
inline void matrix4sse::getTransposed( matrix4sse& o ) const
{
o=*this;
_MM_TRANSPOSE4_PS(o.M[0], o.M[1], o.M[2], o.M[3]);
/*o[ 0] = ((f32*)M)[ 0];
o[ 1] = ((f32*)M)[ 4];
o[ 2] = ((f32*)M)[ 8];
o[ 3] = ((f32*)M)[12];
o[ 4] = ((f32*)M)[ 1];
o[ 5] = ((f32*)M)[ 5];
o[ 6] = ((f32*)M)[ 9];
o[ 7] = ((f32*)M)[13];
o[ 8] = ((f32*)M)[ 2];
o[ 9] = ((f32*)M)[ 6];
o[10] = ((f32*)M)[10];
o[11] = ((f32*)M)[14];
o[12] = ((f32*)M)[ 3];
o[13] = ((f32*)M)[ 7];
o[14] = ((f32*)M)[11];
o[15] = ((f32*)M)[15];*/
}
} // end namespace core
} // end namespace irr
#endif
You're missing my point. Even if it is 14% faster, you're just getting a nominal performance bump at the cost of specializing the code to a certain processor architecture. IRRLICHT_FAST_MATH is disabled by default for similar reasons.devsh wrote:SSE is SSE, boosts 4d vector ops by 400%
I have a 13% increase on the += operator
None on the + operator
I have 30% on just the assignment from ONE float i.e. vector3df(1.f)
-
- Admin
- Posts: 14143
- Joined: Wed Apr 19, 2006 9:20 pm
- Location: Oldenburg(Oldb), Germany
- Contact:
Did you ran the regression suite against your new implementation? Looks like the internal structure has changed so much, that safe access to all operators is not guaranteed anymore.
There's no problem about specializing for one or the other architecture. We have this all over the place. But additional maintenance overhead must justify by either being very low, or by giving much benefit.
There's no problem about specializing for one or the other architecture. We have this all over the place. But additional maintenance overhead must justify by either being very low, or by giving much benefit.
What else in the engine is restricted to x86? I thought a big selling point of Irrlicht was that with a little tweaking, it can be compiled to Xbox, Windows mobile, iPhone OS, Droid, etc.hybrid wrote:There's no problem about specializing for one or the other architecture. We have this all over the place. But additional maintenance overhead must justify by either being very low, or by giving much benefit.
-
- Admin
- Posts: 14143
- Joined: Wed Apr 19, 2006 9:20 pm
- Location: Oldenburg(Oldb), Germany
- Contact:
The code would not restrict anything to a certain platform. It would simply give certain extra support (performance, no extra functionality) on some platforms. Pretty much like d3d drivers, which are also windows only.slavik262 wrote:What else in the engine is restricted to x86? I thought a big selling point of Irrlicht was that with a little tweaking, it can be compiled to Xbox, Windows mobile, iPhone OS, Droid, etc.hybrid wrote:There's no problem about specializing for one or the other architecture. We have this all over the place. But additional maintenance overhead must justify by either being very low, or by giving much benefit.
... together with the ppc-architectures that's on the xbox360/ps3 as well. I wouldn't be surprised if SSE instructions will be available on mobile devices quite soon. If you design your code to be SSE-friendly you can get quite huge optimizations on calculation-tight places.devsh wrote:actually x86 and x64, yeh i get the point.. even i dont think its worth it
You might even get it so fast that it's faster/easier to do a brute-force solution (such as culling dynamic objects) rather than using some sort of culling-tree due to the amount of operations you can push through instead of stalling the cpu due to l2 cache misses, missed branches, pipes due to in-order-architecture etc.
I think such an implementation as an SSE-library would be good since it's designing the software for the future. Maybe not replace the existing vectors but instead add a library that can be used on places where there would be gain from it.