Code: Select all
//unwrapped three loops into one, it's not the prettiest but it's about 30-40% faster depending on the specific platform (cache related?)
for (int i = 0; i < CHUNKTOTAL; i++)
{
if (x >= CS) //logic to calculate whether to increment Y and reset X
{
x = 0;
y++;
}
if (y >= CS) //logic to calculate whether to increment Z and reset Y
{
y = 0;
z++;
}
if (z >= CS) //logic to calculate whether to reset Z
{
z = 0;
}
//push back 12 vertices, this is optimal for a cube.
buf->Vertices.push_back(video::S3DVertex(x - 0, y + 1, z - 0, -2, 2, -2, c, 0, 0));
buf->Vertices.push_back(video::S3DVertex(x + 1, y + 1, z - 0, 2, 2, -2, c, 1, 0));
buf->Vertices.push_back(video::S3DVertex(x + 1, y - 0, z - 0, 2, -2, -2, c, 1, 1));
buf->Vertices.push_back(video::S3DVertex(x - 0, y - 0, z - 0, -2, -2, -2, c, 0, 1));
buf->Vertices.push_back(video::S3DVertex(x - 0, y + 1, z + 1, -2, 2, 2, c, 1, 0));
buf->Vertices.push_back(video::S3DVertex(x + 1, y + 1, z + 1, 2, 2, 2, c, 0, 0));
buf->Vertices.push_back(video::S3DVertex(x + 1, y - 0, z + 1, 2, -2, 2, c, 0, 1));
buf->Vertices.push_back(video::S3DVertex(x - 0, y - 0, z + 1, -2, -2, 2, c, 1, 1));
buf->Vertices.push_back(video::S3DVertex(x - 0, y + 1, z - 0, -2, 2, -2, c, 1, 1));
buf->Vertices.push_back(video::S3DVertex(x + 1, y + 1, z - 0, 2, 2, -2, c, 0, 1));
buf->Vertices.push_back(video::S3DVertex(x + 1, y - 0, z - 0, 2, -2, -2, c, 0, 0));
buf->Vertices.push_back(video::S3DVertex(x - 0, y - 0, z - 0, -2, -2, -2, c, 1, 0));
//push back 36 indices
buf->Indices.push_back(numVertices + 0);
buf->Indices.push_back(numVertices + 1);
buf->Indices.push_back(numVertices + 2);
buf->Indices.push_back(numVertices + 2);
buf->Indices.push_back(numVertices + 3);
buf->Indices.push_back(numVertices + 0);
buf->Indices.push_back(numVertices + 7);
buf->Indices.push_back(numVertices + 6);
buf->Indices.push_back(numVertices + 5);
buf->Indices.push_back(numVertices + 5);
buf->Indices.push_back(numVertices + 4);
buf->Indices.push_back(numVertices + 7);
buf->Indices.push_back(numVertices + 4);
buf->Indices.push_back(numVertices + 0);
buf->Indices.push_back(numVertices + 3);
buf->Indices.push_back(numVertices + 3);
buf->Indices.push_back(numVertices + 7);
buf->Indices.push_back(numVertices + 4);
buf->Indices.push_back(numVertices + 1);
buf->Indices.push_back(numVertices + 5);
buf->Indices.push_back(numVertices + 6);
buf->Indices.push_back(numVertices + 6);
buf->Indices.push_back(numVertices + 2);
buf->Indices.push_back(numVertices + 1);
buf->Indices.push_back(numVertices + 9);
buf->Indices.push_back(numVertices + 8);
buf->Indices.push_back(numVertices + 4);
buf->Indices.push_back(numVertices + 4);
buf->Indices.push_back(numVertices + 5);
buf->Indices.push_back(numVertices + 9);
buf->Indices.push_back(numVertices + 11);
buf->Indices.push_back(numVertices + 10);
buf->Indices.push_back(numVertices + 6);
buf->Indices.push_back(numVertices + 6);
buf->Indices.push_back(numVertices + 7);
buf->Indices.push_back(numVertices + 11);
numVertices += 12;
x++; //increment x, otherwise the loop gets wonky, this one doesn't really need a comment but I commented everything else...
}
But that's still 196608 memory write operations.
That is extremely slow.
if I could push back all verts as one operation and all indices as one operation that'd only be 8192 operations.
I needn't point out why this is a bottleneck, that is blatantly obvious on its own (memory writes are slow, I'm doing a lot of them).
Other problems with that snippet:
32^3 cubes cause a 16-bit index count overflow [unoptimized, the optimized version can be mathematically proven to only ever contain exactly half the amount of indices that the unoptimized does in worst case but ideally I'd need more than 32^3] - either the index logic can be optimized (can you do better than 36 indices/cube?) or I need to use a 32-bit SMeshBuffer (although after peeking at the source code this sounds like I'd have to manually patch irrlicht to support this)
In addition at 3.54MB of just mesh data (minimum, a 16-bit value is only guaranteed to be at least 16 bits, it can be larger) per chunk that gets unwieldy very fast - is there some way I can optimize memory performance without sacrificing render performance? (instancing comes to mind, which replaces the 864 bytes/cube with sizeof(*SMesh).
But this would incur 1 drawcall/ptr which would be several thousand drawcalls per chunk, that compounds into hundreds of thousands of drawcalls in a scene (unless I misunderstand instancing in this case, but I'm fairly certain I don't and even if I didn't we're comparing 1 drawcall/chunk to a minimum of 6 drawcalls/chunk since there are 6 sides to a cube, add in that it also makes greedy meshing or similar optimization algorithms impossible and the performance still takes a plunge).
For rather obvious reasons that's not desirable either.
So to reiterate: how do I solve these rather memory-constrained problems in a performance friendly way, is this really the most efficient way of allocating mesh data? Because I'd really not want to write a patch for an optimized vertex format and allocator just yet, that would make my already constrained schedule (juggling coding with real life) less than easy to manage.
Well, I suppose the fact that it's 8 am may be clouding my judgement, although I can't find a more efficient solution - for my previous (far less memory and performance constrained) projects this allocation method has worked fine, but at this scale it completely falls apart.