Is there a more efficient way to allocate vertices/indices?

Cube_ · Post by **Cube_** » Sat May 14, 2016 11:07 am

Currently I'm doing this:

 
//unwrapped three loops into one, it's not the prettiest but it's about 30-40% faster depending on the specific platform (cache related?)
for (int i = 0; i < CHUNKTOTAL; i++)
    {
        if (x >= CS) //logic to calculate whether to increment Y and reset X
        {
            x = 0;
            y++;
        }
        if (y >= CS) //logic to calculate whether to increment Z and reset Y
        {
            y = 0;
            z++;
        }
        if (z >= CS)  //logic to calculate whether to reset Z
        {
            z = 0;
        }
 
                //push back 12 vertices, this is optimal for a cube.
        buf->Vertices.push_back(video::S3DVertex(x - 0, y + 1, z - 0, -2, 2, -2, c, 0, 0));
        buf->Vertices.push_back(video::S3DVertex(x + 1, y + 1, z - 0, 2, 2, -2, c, 1, 0));
        buf->Vertices.push_back(video::S3DVertex(x + 1, y - 0, z - 0, 2, -2, -2, c, 1, 1));
        buf->Vertices.push_back(video::S3DVertex(x - 0, y - 0, z - 0, -2, -2, -2, c, 0, 1));
        buf->Vertices.push_back(video::S3DVertex(x - 0, y + 1, z + 1, -2, 2, 2, c, 1, 0));
        buf->Vertices.push_back(video::S3DVertex(x + 1, y + 1, z + 1, 2, 2, 2, c, 0, 0));
        buf->Vertices.push_back(video::S3DVertex(x + 1, y - 0, z + 1, 2, -2, 2, c, 0, 1));
        buf->Vertices.push_back(video::S3DVertex(x - 0, y - 0, z + 1, -2, -2, 2, c, 1, 1));
        buf->Vertices.push_back(video::S3DVertex(x - 0, y + 1, z - 0, -2, 2, -2, c, 1, 1));
        buf->Vertices.push_back(video::S3DVertex(x + 1, y + 1, z - 0, 2, 2, -2, c, 0, 1));
        buf->Vertices.push_back(video::S3DVertex(x + 1, y - 0, z - 0, 2, -2, -2, c, 0, 0));
        buf->Vertices.push_back(video::S3DVertex(x - 0, y - 0, z - 0, -2, -2, -2, c, 1, 0));
 
                //push back 36 indices
        buf->Indices.push_back(numVertices + 0);
        buf->Indices.push_back(numVertices + 1);
        buf->Indices.push_back(numVertices + 2);
        buf->Indices.push_back(numVertices + 2);
        buf->Indices.push_back(numVertices + 3);
        buf->Indices.push_back(numVertices + 0);
 
        buf->Indices.push_back(numVertices + 7);
        buf->Indices.push_back(numVertices + 6);
        buf->Indices.push_back(numVertices + 5);
        buf->Indices.push_back(numVertices + 5);
        buf->Indices.push_back(numVertices + 4);
        buf->Indices.push_back(numVertices + 7);
 
        buf->Indices.push_back(numVertices + 4);
        buf->Indices.push_back(numVertices + 0);
        buf->Indices.push_back(numVertices + 3);
        buf->Indices.push_back(numVertices + 3);
        buf->Indices.push_back(numVertices + 7);
        buf->Indices.push_back(numVertices + 4);
 
        buf->Indices.push_back(numVertices + 1);
        buf->Indices.push_back(numVertices + 5);
        buf->Indices.push_back(numVertices + 6);
        buf->Indices.push_back(numVertices + 6);
        buf->Indices.push_back(numVertices + 2);
        buf->Indices.push_back(numVertices + 1);
 
        buf->Indices.push_back(numVertices + 9);
        buf->Indices.push_back(numVertices + 8);
        buf->Indices.push_back(numVertices + 4);
        buf->Indices.push_back(numVertices + 4);
        buf->Indices.push_back(numVertices + 5);
        buf->Indices.push_back(numVertices + 9);
 
        buf->Indices.push_back(numVertices + 11);
        buf->Indices.push_back(numVertices + 10);
        buf->Indices.push_back(numVertices + 6);
        buf->Indices.push_back(numVertices + 6);
        buf->Indices.push_back(numVertices + 7);
        buf->Indices.push_back(numVertices + 11);
        numVertices += 12;
 
        x++; //increment x, otherwise the loop gets wonky, this one doesn't really need a comment but I commented everything else...
    }

So, at surface value that's pretty optimized compared to the buf->append(Vertices, 12, Indices, 36) method.
But that's still 196608 memory write operations.
That is extremely slow.

if I could push back all verts as one operation and all indices as one operation that'd only be 8192 operations.
I needn't point out why this is a bottleneck, that is blatantly obvious on its own (memory writes are slow, I'm doing a lot of them).

Other problems with that snippet:
32^3 cubes cause a 16-bit index count overflow [unoptimized, the optimized version can be mathematically proven to only ever contain exactly half the amount of indices that the unoptimized does in worst case but ideally I'd need more than 32^3] - either the index logic can be optimized (can you do better than 36 indices/cube?) or I need to use a 32-bit SMeshBuffer (although after peeking at the source code this sounds like I'd have to manually patch irrlicht to support this)

In addition at 3.54MB of just mesh data (minimum, a 16-bit value is only guaranteed to be at least 16 bits, it can be larger) per chunk that gets unwieldy very fast - is there some way I can optimize memory performance without sacrificing render performance? (instancing comes to mind, which replaces the 864 bytes/cube with sizeof(*SMesh).
But this would incur 1 drawcall/ptr which would be several thousand drawcalls per chunk, that compounds into hundreds of thousands of drawcalls in a scene (unless I misunderstand instancing in this case, but I'm fairly certain I don't and even if I didn't we're comparing 1 drawcall/chunk to a minimum of 6 drawcalls/chunk since there are 6 sides to a cube, add in that it also makes greedy meshing or similar optimization algorithms impossible and the performance still takes a plunge).

For rather obvious reasons that's not desirable either.

So to reiterate: how do I solve these rather memory-constrained problems in a performance friendly way, is this really the most efficient way of allocating mesh data? Because I'd really not want to write a patch for an optimized vertex format and allocator just yet, that would make my already constrained schedule (juggling coding with real life) less than easy to manage.
Well, I suppose the fact that it's 8 am may be clouding my judgement, although I can't find a more efficient solution - for my previous (far less memory and performance constrained) projects this allocation method has worked fine, but at this scale it completely falls apart.

hendu · Post by **hendu** » Sat May 14, 2016 4:33 pm

1. 200k memory writes of a few bytes each, that's a couple mb. Your RAM is easily 3GB/s.

Thus it's not the number of operations or bytes, but your overhead - if you don't pre-reserve the buf size, there's a lot of memory allocations and copies as it grows. The other slow part is that you call the S3DVertex constructor 200k times, and some of the parameters are always constant too.

Since S3DVertex has nothing magic, you should instead allocate them in one big block, and set just the changing attributes for each. No constructor calls then. As a final step, to get it into your SMeshBuffer, you'd set the buf size and just do a memcpy.

2. I think you do misunderstand instancing. "Please draw this cube 1 million times, in these positions, here's an array of per-instance data to use in the shader" - that's one draw call for all the cubes. Or if you wish, smaller units like a chunk or whatever.

They don't have to have the same textures, or have every side use the same texture. That's what you use the per-instance data for, mapping the desired texture to the desired sides in the shader.

Cube_ · Post by **Cube_** » Sun May 15, 2016 1:54 am

hendu wrote:1. 200k memory writes of a few bytes each, that's a couple mb. Your RAM is easily 3GB/s.

Misnormer, that's best case.
Real case has other things writing to memory, not to mention real-case does not have optimal bus allocation and thus wastes quite a bit of buswidth for no good reason, the buffer is already pre-allocated (buf->Vertices.reallocate(CHUNKTOTAL*12), buf->Indices.reallocate(CHUNKTOTAL*36) ). But this structure takes 150ms to build (debug configuration, no optimization flags, this is unacceptably slow for 4096 cubes. (50ms with optimization flags (O1, real case should preferably be under 16ms/chunk even on a machine up to 40% slower than this one) which is still pretty slow for 4096 cubes)

The goal here is to be able to construct the mesh in less than 5 frames worst case, 2 frames best case. Otherwise it just isn't interactive enough to be feasible, 150ms is unacceptable. 50ms is barely acceptable for a worst case, but that's the average case even with optimization flags (still an unstripped debug build but I also have a relatively beefy cpu as compared to target, which would run at least 20% slower on cpu operations).

hendu wrote:Thus it's not the number of operations or bytes, but your overhead - if you don't pre-reserve the buf size, there's a lot of memory allocations and copies as it grows. The other slow part is that you call the S3DVertex constructor 200k times, and some of the parameters are always constant too.

yes, there's a reason why I'm asking if there's a more efficient way to allocate vertices - I didn't find any after reading the docs and examples, rather this is the best I could figure out given the interfaces I'm aware of.

hendu wrote:Since S3DVertex has nothing magic, you should instead allocate them in one big block, and set just the changing attributes for each. No constructor calls then. As a final step, to get it into your SMeshBuffer, you'd set the buf size and just do a memcpy.

What, like a big vector of null initialized vertices that I iterate over and alter the relevant parameters as needed?

hendu wrote:2. I think you do misunderstand instancing. "Please draw this cube 1 million times, in these positions, here's an array of per-instance data to use in the shader" - that's one draw call for all the cubes. Or if you wish, smaller units like a chunk or whatever.

I did say that with:

(unless I misunderstand instancing in this case, but I'm fairly certain I don't and even if I didn't we're comparing 1 drawcall/chunk to a minimum of 6 drawcalls/chunk since there are 6 sides to a cube, add in that it also makes greedy meshing or similar optimization algorithms impossible and the performance still takes a plunge).

but even then for optimization reasons I'd have to have one draw call per face, as they would be separate meshes.
That is six drawcalls/side.
Verus one drawcall/chunk for the current method.

It's fewer net drawcalls, yes. But it also risks vertex starving as I can no longer utilize strategies such as greedy meshing (and I seem to recall instancing being ineffectual for small meshes)
And that's not taking LOD meshes into account, they're too simplified and would require their own pre-calculated meshes.

hendu wrote:They don't have to have the same textures, or have every side use the same texture. That's what you use the per-instance data for, mapping the desired texture to the desired sides in the shader.

And does irrlicht support such instancing? I'd be willing to implement a test case at the very least.

Anyway, to summarize my whole issue with my current method of allocation (speed-wise):
the whole point is that it's faster to batch multiple writes in one operation as it utilizes the memory bandwidth significantly better, especially on modern system with wide 64-bit buses.
This could be significantly faster (at least 30ms faster)

As for compressing the vertex format: I wager irrlicht does not have such a format, but here's one way to save a lot of memory:
originally we have: S3DVertex (f32 x, f32 y, f32 z, f32 nx, f32 ny, f32 nz, SColor c, f32 tu, f32 tv) - or 48 bytes.
That's a big vertex.

This could logically be compressed in-memory (at loss of UV layout precision) to (this would, presumably, only save memory on the ghost-copy in RAM, the version in VRAM may require all of the 48 bytes to be present):
S3DVertex (f32 x, f32 y, f32 z, f32 UV) //16 bits for U, 16 bits for W - a simple bitmask, almost no (or no if the compiler is even remotely intelligent) decoding overhead.
That's significantly smaller at 16 bytes, you lose control over nx, ny, nz, and the color (but that would often be left default at a medium grey or a white, and then everything else is done with diffuse textures).

But that's a minor thing in comparison to the memory operations.

Far more relevant is 32-bit index buffers, I haven't found a way to enable them without manually patching the SMeshBuffer to allow such.

Well, I suppose it's a bit rambly.

The core questions can really be simplified as such:
1. how to allocate vertices and indices more efficiently? [answered above? I'm going to try a large vector and only changing what's needed, to see how it impacts performance]
2. Is there a more memory efficient vertex structure available?
3. Are 32-bit index buffers supported for SMeshBuffers?
4. If instancing is used, then what additional optimization schemes are available? Greedy meshing is likely out as that involves rewriting the underlying mesh, and I can't pre-calculate (CS^3)! chunks - even at 16^3 chunks that's ~Infinity variations.(3.642736389e+13020 possible arrangements (severely rounded), at 3.54mb each that'd be something like 3.3e+13008TB of cache (actually about 10-50% of this real-case becuase worst case is provably at most half of this simulated worst case load, but even at 10% that's 3.3e+1301TB of cache or so which is impossible within the 4G constraint I have for both hardware reasons and scope reasons))
This forces me to pre-cache very small sets of data, such as one face, at this point the question becomes: what optimization schemes are possible, presumably instancing to this extent would have quite a bit of overhead. (although if each instance pointer is held in a red/black tree it'd at least have good rebuild speeds, what with searching, inserting, and removing being O(log n) - while taking O(n) space overhead).*

*In any case, I guess I'll read up on instancing in Irrlicht, it's at the very least worth setting up a test case to compare instanced vs non-instanced performance (I predict memory footprint will shrink, but memory overhead may get larger).

hendu · Post by **hendu** » Sun May 15, 2016 9:31 am

"-O1" - stop doing that. Check any compiler benchmark on Phoronix, at -O1 your performance can be five times worse vs -O3.

2. You can always create your own vertex structure, or try the fvf branch with it.
3. No.
4. So you're doing Minecraft except with tiny cubes, such that it vastly exceeds the capability of current hardware? I probably don't need to say what I think...

Irr does not support instancing. Mine does, and I guess devsh's will soon.

but even then for optimization reasons I'd have to have one draw call per face, as they would be separate meshes.

Why would they have to be separate? You say that, but give no reasons for it.

Cube_ · Post by **Cube_** » Sun May 15, 2016 1:12 pm

hendu wrote:"-O1" - stop doing that. Check any compiler benchmark on Phoronix, at -O1 your performance can be five times worse vs -O3.

You know what? I'll appease you and run some O3 tests and get back to you on that.
But 03 is terrible for debugging, and debugging with long delays is ineffectual.

hendu wrote:2. You can always create your own vertex structure, or try the fvf branch with it.

Right, I guess I'll go have a peek at the fvf branch to start with then.

hendu wrote:3. No.

Gotcha.

hendu wrote:4. So you're doing Minecraft except with tiny cubes, such that it vastly exceeds the capability of current hardware? I probably don't need to say what I think...

Not so, first of all cubes don't automagically make it minecraft - but aside from that, I'm by no means pushing the capability of current hardware, I'm only pushing my own ability to optimize it for current hardware.

hendu wrote:Irr does not support instancing. Mine does, and I guess devsh's will soon.

Right, gotcha - guess I'll have to find a patch or build the functionality myself then. At least that saves me time trying to figure out which option I'll have to pursue.

hendu wrote:Why would they have to be separate? You say that, but give no reasons for it.

Naturally to reduce the total amount of triangles pushed to gpu. (at any point 3 cube faces can be visible - instancing every combination of 3 faces that share at least one edge would be a total of 8 instances, which would be more drawcalls than one instance per face (but it's a larger instance so is it faster because of higher instancing efficiency? I honestly don't know, I only have tangential experience with instancing)
even an instanced mesh adds rendering overhead, and adding rendering overhead from faces that would be occluded? Doesn't sound like the smartest implementation.
Significantly more importantly however: presumably it would use up more VRAM (not to mention completely trash the PCIe bandwidth) which is at a higher premium than ever in any software dealing with large quantities of data.

Also: gut-feeling tells me that it's probably not a brilliant idea to instance one block/face/whatever at a time, simply due to the fact that write-ups and examples of voxel engines never utilize such a strategy, rather opting for more complex solutions such as greedy meshing.

Found this, well - whatever.

I don't think you can do it efficiently with instances. Vast majority of faces/cubes is never visible and you can save a lot of time by not rendering them. Unfortunately that makes every cube a different object.

from: http://stackoverflow.com/questions/9152 ... side-for-a

There may be an efficient instancing strategy available for something like this, I just can't think of it.
I was going to instance things like dynamic entities (enemies and whatnot) due to performance deterministic reasons already.
perhaps I should use some sort of combinatory strategy of greedy meshing and a SVO (Sparse Voxel Octree) or other such structure.

Well in any case, you more or less answered most of my questions

hendu · Post by **hendu** » Sun May 15, 2016 2:42 pm

Yeah, of course you'd debug with lower optimization. You however quoted benchmark scores from -O1.

Instancing always saves VRAM over sending everything comparing apples with apples, I don't know why you think it's otherwise. The triangles facing away are efficiently culled by the normal backface culling, they do have some hit, but doubtful it's that big.

Cube_ · Post by **Cube_** » Sun May 15, 2016 3:42 pm

hendu wrote:Yeah, of course you'd debug with lower optimization. You however quoted benchmark scores from -O1.

Ah yes, ran some O3 benchmarks
average case 0ms
worst case 5ms
best case 0ms

The problem, which is why I was building with O1 to begin with, is that when I get to 3.9G of memory it freezes with O2 and O3, with O1 and debug builds it does not, well - I guess I'll just have to solve that by not going past 3.9G of memory.

hendu wrote:Instancing always saves VRAM over sending everything comparing apples with apples, I don't know why you think it's otherwise. The triangles facing away are efficiently culled by the normal backface culling, they do have some hit, but doubtful it's that big.

I'm more concerned with bandwidth, mostly based on reading about instanced voxel worlds - I have little practical experience on the topic however, either way I suppose it's worth a test at least.

As for backface culling, the average visible face count for a voxel would be closer to 0 than 1 (most faces are occluded), I wager it'd be pretty expensive to cull.
And even if backface culling is cheap, there will also be depth culled meshes that take up time.
I suppose the question is, how much (which I won't know until I've written (or found) an instancing patch)

mongoose7 · Post by **mongoose7** » Sun May 15, 2016 10:47 pm

I don't know how often you intend to build these blocks, or what shape you are making (the example is simply a cube), but I think you should be doing something elementary in the building, like considering the position of the camera. Build the cubes closest to the camera and put a flag in a 3-dimensional table to say that there is a cube at that position. Then for every other cube, see if there are cubes in front of it by referencing the table, and add it if not, to the table as well. Also, don't build whole cubes, just one, two or three sides, depending on the speed of the camera and how often you build. In fact, as it takes less than 5 ms to build the mesh, you can devote a thread just to building the mesh while rendering the previous version.
Of course, if you want to use instancing, you need all the vertices, but I think you can upload the index list on a per frame basis.

Cube_ · Post by **Cube_** » Mon May 16, 2016 1:23 am

At this stage I primarily focus on cubes, other primitives may be implemented later but at this point it's not even a consideration.

The camera position is not considered at this point, although that's a good idea.

As for the optimizations: Yes, I know I should build at most 3 faces per cube (but I've yet to implement that logic, I had a slow naive version but I'm refactoring my slower classes and I've yet to get there*), as for threading: OpenGL isn't thread safe and while I have an idea on how to do it I'd have to implement nontrivial non-locking syncing mechanisms which is an optimization for later.

*optimization runs:
1. skip cubes that are completely occluded
2. build only faces touching air
3. discard chunks not hit by a camera ray, they can be rebuilt when they can be seen.
4. run greedy algorithm on remaining chunks to reduce triangle count.
5. Potentially other. if needed.

As for instancing: I probably won't use that, instancing very large quantities of small meshes is inefficient to my understanding, and instancing chunks is impractical (that'd take petabytes of memory due to the sheer amount of viable variations - even if I break it down into sub-segments of 4^3 that's still 1.1e+77PB of ram - although in segments of 2^3 there are only some 40k variations, which would break down to (assuming all variations include mesh data, they don't but it gets hard to math otherwise) ~34mb - but that would require thousands upon thousands of instances of several 40k base objects, which would be ~40k drawcalls)

mongoose7 wrote:Of course, if you want to use instancing, you need all the vertices, but I think you can upload the index list on a per frame basis.

This part intrigues me though, effectively index the entire field of vertices and feed it indices as needed? This would cut the memory of all the vertices per chunk and only require me to store the indices, correct? (that'd save me 288 bytes per cube, which is ~15% of the memory used)

mongoose7 · Post by **mongoose7** » Mon May 16, 2016 8:07 am

I don't know what you mean. I mean that, with all the vertices in a GPU buffer, you could just pass the indices on each frame. This would be possible in OpenGL but I don't know if you can get Irrlicht to do it.

Also, the multithreading I was referring to doesn't require synchronisation primitives. The building thread performs
if (last buffer released)
set buffer in use
build mesh
set buffer ready

and the rendering thread:
if (next buffer ready)
release current
render next (set as current)
else
render current buffer

That is, there is a buffer that is being rendered and one that is being built. The build thread marks the buffer as ready to be rendered and the rendering thread marks the buffer as released, if the other buffer is ready. There are three states and one thread can change STATE1 to STATE2 and STATE3 to STATE1 and the other thread can change STATE2 to STATE3, so they do not need synchronisation.

hendu · Post by **hendu** » Mon May 16, 2016 8:10 am

That does need locking because of compiler optimization and caches. Otherwise it can do things like move your state check around, or optimize it away.

mongoose7 · Post by **mongoose7** » Mon May 16, 2016 8:20 am

You'll have to give me a diagram.

The build thread can only set in-use and ready. It keeps looking for a buffer that is released. It won't find one until the rendering thread makes a change. OK so far?
The rendering thread presumably has a buffer to render (provided at least by init) and the last possible action was to set the other buffer to released.
Coherence means that the change made by one thread may not be seen for some microseconds, but this doesn't matter because the state won't be reversed.
The rendering thread sees the new buffer marked ready. It is not possible for this to occur until the buffer is built, because the last action of the building thread was to mark it ready, and the last action of the rendering thread was to mark it released. That is, both threads see it as released, the build thread changes it to in-use and then ready. All that the rendering thread is interested in is seeing ready, so it doesn't matter if it sees in-use or not. And once the build thread sets it to ready, it is no longer interested until it sees released.

I don't think the Petersen paradox has a look in here. But, please, show my your timing diagrams for each thread.

(Err, only need two states, of course. Build thread needn't change released until it sets ready. Build thread: released -> ready; rendering thread: ready -> released.)

hendu · Post by **hendu** » Mon May 16, 2016 8:57 am

Creating thread:

Code: Select all

if (buffer[0] == free) {
  build();
  buffer[0] = ready;
} else { // same for 1

The compiler can deduce the ready flag write is unrelated to the build writes. As such, it may move the ready write before the build. If the render thread run happens during this time, it accesses the buffer while it's being built, which of course blows up.

mongoose7 · Post by **mongoose7** » Mon May 16, 2016 11:20 am

There may be a danger that the compiler will write the ready flag before the buffer is built, but this can be prevented by declaring the flag volatile. Though for cache coherency, it may still appear that the flag was written before the buffer was complete. So you may need a fence here. Though, doesn't cache coherency ensure that writes are not reordered?

Or: if (build()) flag = ready;

hendu · Post by **hendu** » Mon May 16, 2016 7:03 pm

No, volatile does not prevent write reordering. It only prevents optimizing out. You need a memory barrier, and thread locking has an implicit one.

Cache coherency likewise doesn't affect write reordering. It prevents core #2 from using a cached value from its own cache if core #1 changed it (but only on x86! Other arches need explicit sync).

Code: Select all

 
if (build()) flag = ready;

In this case, the compiler may decide or measure that build() succeeds more often than fails, and pre-emptively write the flag, restoring it if the if fails.

These things are so much fun to debug

Irrlicht Engine

Is there a more efficient way to allocate vertices/indices?

Is there a more efficient way to allocate vertices/indices?

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic

Re: Is there a more efficient way to allocate vertices/indic