optimizing render performance, looking for information

Cube_ · Post by **Cube_** » Mon Jun 29, 2015 8:17 am

So my test scenes typically contain ~10-15k tris (Well, primitives but since I build all my meshes manually through code I can guarantee tris), yet the performance is terrible (for reference I can handle about 100k tris in MC at about 15FPS and MC is terribly optimized, over 1M tris in blender but blender is really well optimized (3M tris at 5fps being the highest I'd ever go for a scene on this machine))

I have a few theories as to why this is and I'm seeking information as to which one(s) may or may not be the case:

theory one: irrlicht renders each mesh buffer as a draw call, I have thousands of mesh buffers per mesh scene node.
theory two: irrlicht stores a copy of the material assigned to each mesh buffer and then renders the copies as separate draw calls (functionally identical to theory one but semantically different)
theory three: my mesh somehow doesn't agree with irrlicht
theory four: I am vertex bottlenecked at 3 verts per primitive, quads would improve the vert bottleneck at cost of fragment bottlenecking
theory five: irrlicht really doesn't agree with my graphics driver (I find this unlikely)
theory six: irrlicht is terribly slow on linux
theory seven: irrlicht is terribly slow (poor optimization?)

They are ranked by what I wager would be the probabilities, if each mesh buffer is considered one draw call then what's the max vert count for a mesh buffer? (can a mesh buffer have many materials attached to different primitives?).

if theory two is true, then how would one inhibit this behavior?

if theory three is true I am at loss as to how to solve this.

if theory four is true, then that's a reasonably simple fix.

if theory five is true: then I'll just have to hack together my own openGL renderer (not an additional workload I'm looking forward to in this case)

theory six and seven: see theory five.

I wager most of you would be able to identify which theory is the correct one and possibly point out other reasons why my renders are so slow (the 15 fps number was picked because that's what my 10k scenes run at, 10x slower than MC is unacceptable)

Any other information on how to speed up renders are greatly appreciated.

here's my main loop, it's not particularly cluttered:

Code: Select all

    while(device->run())
    {
        if(ticks == 20)
        {
            cMgr->poll();
            ticks=0;
        }
        ++ticks;
        driver->beginScene(true, true, SColor(255,140,255,140));
        
        smgr->drawAll();
        guienv->drawAll();
 
        stringw str = L"Hexahedron World [";
        str += driver->getName();
        str += "] VERSION:";
        str += VERSION;
        str += "Triangles: ";
        str += driver->getPrimitiveCountDrawn();
        device->setWindowCaption(str.c_str());
        
        driver->endScene();
    }

The obvious question is: does cMgr->poll() slow things down? let's have a look

Code: Select all

void chunkMgr::poll()
{
    //unloadList();
 
    //loadList();
    
    buildList();
    
    //updateList();
}

let's check buildList then.

Code: Select all

void chunkMgr::buildList()
{
    for( int index = 0; index != indx; index++)
    {
        if(loadID.at(index) == NULL)
        {
            bList.at(index)=true;
        }
        else
        {
            //printf("Not building local chunk #%d (already built?)\n",index);
            bList.at(index)=false;
        }
    }
    build();
}

nothing odd here, it could be improved by setting a global flag that chunks need building but currently it goes through 1 iteration of a loop, does one if check and prints 1 message to console.
This happens every 20 frames.

let's check build for completeness

Code: Select all

void chunkMgr::build()
{
    for(int index = 0; index != indx; index++)
    {
        if(bList.at(index) == true)
        {
            loadID.at(index) = new chunk(index);
            bList.at(index) = false;
        }
    }
}

again, nothing particularly time consuming. 1 loop iteration (indx is world size cubed, calculated on initialization, 1 cubed is 1), 1 conditional check and that's it.
overall this might add a millisecond or two to the frame in question at most.
in other words: my rendering loop is not at fault.

but for sake of argument, let's get the exact triangle count and the correct mesh buffer count:

Irrlicht Engine version 1.8.1
Linux 4.0.5-1-ARCH #1 SMP PREEMPT Sat Jun 6 18:37:49 CEST 2015 x86_64
Creating X window...
Visual chosen: : 133
Using renderer: OpenGL 2.1
Mesa DRI Intel(R) Ironlake Mobile : Intel Open Source Technology Center
OpenGL driver version is 1.2 or better.
GLSL version: 1.2

Triangles: 10296
858 mesh buffers
FPS: 30 (avg), 31 (peak), 5 (low, only during the first few ms) [as reported by driver->getFPS()]

This performance is rather terrible, if we load more chunks it gets worse, let's see what happens with 8 chunks (all identical for sake of consistency)

Triangles: 82368
6864 mesh buffers
FPS: 4 (avg) 8 (peak) 1 (low, only during the first few ms) [as reported by driver->getFPS()]

The gpu in question may be absolutely terrible (about equivalent to a 2010/2011 smartphone... then again it is a mobile gpu, even supports OGL ES 2.0 as well as normal OGL 2.1, or so mesa reports. I've tried GL ES demos and they ran) but not this terribad, at least not given the other baselines.
other baseline values that aren't 15 fps:

TES IV oblivion, medium/low settings 30ish FPS (avg), 70 (peak, mostly in empty well lit areas), 10 (low, with litebrite enabled in dark dusky areas with a lot of shadows)
Fallout 3 and Fallout new vegas: medium settings, 45 FPS (avg), 70ish (peak), 30 (low)
Half-Life 2, medium settings, 30 FPS (avg), no other data collected. (low is terribly low, that's the best estimate I can give, high seems to be about avg, the game is really stable with its fps, the only times I get dips is with a lot of explosions and on the title screen that uses a fancy water shader)

This joke of a GPU is, again, really not good but it does hold up to some abuse, the values I get are far too low to indicate only gpu, there's a clear bottleneck.
and I intend to find it or I'm forced to hack together a rendering engine of my own as these FPS values (naive optimization) aren't even remotely in the target range (80k tris is a reasonable render count for a much larger render distance than 2 chunks in each direction, either way 80k tris is not that much, even the ps1 could push that if you really wanted to torture the thing).

As theory 1 and 2 state, this would indicate 6864 draw calls instead of a more realistic 16ish (assuming two mesh buffers per chunk @ 8 chunks), if this is indeed the case then the solution is reasonably obvious: use fewer mesh buffers.
the issue is, how does one merge mesh buffers?

On that matter, perhaps it's a better idea to just build the entire mesh in a vertex shader, given that I upload it ot the gpu anyawy with VBOs this would seemingly skip an entire step that could potentially bottleneck the entire thing.
especially since I'll have to use a vertex shader to do greedy meshing, one could assume that perhaps it'd make more sense to build the entire mesh from scratch (or indeed, build it entirely from voxel metadata since that data is strictly speaking all that's needed to traverse the optimization algorithm)

Other potential issues I see: I was told one cannot allocate a node in the middle of a frame, does this also go for creating mesh buffers or meshes? or is this fine so long they aren't attached to a node until the frame has rendered?
if this is indeed the case, what about modifying a mesh attached to a node while mid-frame, is this also a no-no?
or to simplify: what mesh and node operations can and can't be done during the rendering of a frame (this is of course, vitally important since I'm going to generate chunks in a multithreaded fashion)

to reiterate, since it was mentioned way up: any general advice on speeding up irrlicht in additon to what I asked would be greatly appreciated, I need to squeeze out every bit of performance I can.

hendu · Post by **hendu** » Mon Jun 29, 2015 9:19 am

Yes, each mesh buffer is a draw call.

Cube_ · Post by **Cube_** » Mon Jun 29, 2015 9:54 am

right, that explains a lot (a modern gpu could tank a few thousand calls but I'm testing on a mobile gpu roughly equivalent to the tegra 3 on purpose as that's minimum target).

now I just need to know the max vertex count for each mesh buffer to be able to add them all, and if a mesh buffer can have arbitrary material counts (say a mesh buffer containing 8 verts, 4 verts form a quad and the other form another quad, would they then be allowed different materials?)
Regardless I could probably write an omnishader that takes the block type as a parameter to determine what tile it should use so either way I could most likely design around such a potential requirement.

Cube_ · Post by **Cube_** » Mon Jun 29, 2015 10:02 am

UPDATE: Okay, now that works but chunks are WAY slower to generate
previous: 50ms build time
now: 3000ms build time

on the other hand it now runs at 60 fps (vsync locked), as expected it only works with one material but that can be addressed using a shader material.

Cube_ · Post by **Cube_** » Mon Jun 29, 2015 6:03 pm

Okay, we're defo getting the sort of performance I was expecting, a 1km^3 area of the world is 5,271,552 and runs at real time (64 draw calls) with 60fps solid (VSYNC LOCKED) even on a shite gpu.
However, it also takes about 3200ms per chunk... this might not sound like an issue at first, until you realize that there are hundreds of them loaded for almost an hour of just building the chunks, this would indicate that my optimization is slow.

the only change I made was instead of passing a mesh to my meshing function and then attaching one mesh buffer to it I instead pass a mesh buffer, add all the vertices in the chunk (30k for the chunk I'm testing with) and then add this singular mesh buffer regardless of if it's empty or not (an empty mesh buffer is required to not crash for a mesh scene node)

I don't see why this is this much slower, I expected maybe 10%, not 61.5x slowdown, that's insane.

Darktib · Post by **Darktib** » Mon Jun 29, 2015 7:21 pm

A draw call to pass geometry to the gpu (or to instruct it to use a vbo) is not that expensive. However, changing material (also a draw call, which mean your draw call estimation is too low) can be really expensive. Switching materials thousands times per frame is probably a huge problem - and I don't think Irrlicht has some sort of merger (ie reorder rendered nodes by shader to save shader switches).

You could try to profile your application, the visual studio profiler is quite good for cpu related profiling, or you could use gDEBugger for gpu profiling.
Also, maybe you can generate your meshes on a thread ? On completion, get the mesh back on the main thread, then assign materials and set it to a new scene node.

Cube_ · Post by **Cube_** » Mon Jun 29, 2015 7:26 pm

Material switching is one draw call, however if each mesh buffer only has one material and each mesh buffer is one draw call then the draw call count is equivalent to the mesh buffer count.
more importantly though, one material seems to be the max supported by any one singular buffer, which is fine I can solve that with a shader.

interestingly though, if you overrun the cap performance drops a like a rock, and the meshes get glitchy.

interestingly enough irrlicht spammed my console with this, suppressing all other output (as in, pushing it above scrollback limit):
Too many vertices for 16bit index type, render artifacts may occur.

I'd wager that puts the vertex cap at 65536 (counting index 0) per buffer

Anywho, to address your response: One singular draw call isn't expensive, but thousands of them are, especially since gpus do much better with large quantities of vertices as opposed togetting send a dozen or so thousands of times over, that's a huge amount of overhead.

CuteAlien · Post by **CuteAlien** » Mon Jun 29, 2015 9:18 pm

Vertex buffer is by default 16-bit. You can create 32-bit buffers as well, but I've not worked with those yet myself.

devsh · Post by **devsh** » Mon Jun 29, 2015 10:16 pm

2,6,7

TOP limit on drawcalls per frame is 5000

Cube_ · Post by **Cube_** » Mon Jun 29, 2015 11:37 pm

5000, I'd assume that's top for decent performance, not a hardcoded top (unless it would take two frames to render the scene if that's a hardcoded limit)

anyway 32 bit buffers might work but I'm not sure what the hardware support is for that on various platforms, more importantly I need to figure out why passing the buffer instead of the mesh (and then appending the buffer when the mesh is full) is 62x slower than just creating a few thousand buffers, a lot slower actually, perhaps merging buffers is faster?

What I do is, I creat a mesh and a mesh buffer, I call the meshing function 4096 times and attach either 12 verts and 36 indices or nothing to it (average about 1k appends), then when I've gone through that loop I attach the buffer to the mesh, drop the buffer and make a scene node, then I drop the mesh.

previously I did the same but instead of appending to the same buffer I used a new one each time resulting in ~1k buffers per chunk.

hendu · Post by **hendu** » Tue Jun 30, 2015 8:31 am

Some mobiles do not support 32-bit indices.

Darktib · Post by **Darktib** » Wed Jul 01, 2015 5:28 pm

Material switching is one draw call.

Yes, but this is 1 draw call per mesh buffer (if you render it using a mesh). So, 1 mesh = at least 2 draw calls per mesh buffers

You could try rendering it manually: create your own scene node, and do 1 driver->setMaterial() followed by as much driver->drawMeshBuffer as you need.
For the shader, you should definitely pack your types, if 1 type = 1 texture you can put up to 16 types per material - this is a nice way to improve performance.
Also, how do you generate your mesh ? Do you put each cube individually (if yes, then that's definitely sub-optimal) ? As you're doing some voxels, you should generate the surface of the chunk, to "remove" all vertices inside the terrain.

Cube_ · Post by **Cube_** » Fri Jul 03, 2015 4:36 am

currently I'm using naive optimization of culling chunks not touching air, this will be improved to only generate faces touching air (haven't bothered yet) and then finally migrated to shader for greedy meshing (minimum vertices possible to generate the terrain)

christianclavet · Post by **christianclavet** » Sat Jul 04, 2015 1:24 am

Two technique that should improve performance:
- Deferred rendering
- Instancing (there are tons of variant)

Cube_ · Post by **Cube_** » Sun Jul 05, 2015 12:17 am

I considered instancing but that has a higher draw call overhead on OGL 2.1 and OGL ES 2.0, both of which I intend to support, in fact instancing in either is pretty much nonexistent without hacks, I'll go with greedy meshing.

As for deferred rendering... I don't particularly understand the topic well enough to even consider implementing or using it.

Irrlicht Engine

optimizing render performance, looking for information

optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information

Re: optimizing render performance, looking for information