efficiently allocate large amounts of meshes.

Cube_ · Post by **Cube_** » Fri Mar 06, 2015 2:57 pm

Okay, so after profiling my code I get this output:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 29.74      0.43     0.43  2097148     0.00     0.00  irr::core::CMatrix4<float>::operator==(irr::core::CMatrix4<float> const&) const
 11.89      0.60     0.17                             irr::scene::SMesh::getMeshBuffer(unsigned int) const
 10.49      0.75     0.15  2097148     0.00     0.00  irr::video::SMaterialLayer::operator!=(irr::video::SMaterialLayer const&) const
  9.10      0.88     0.13  3145748     0.00     0.00  irr::core::aabbox3d<float>::addInternalPoint(float, float, float)
  8.05      0.99     0.12                             irr::video::SMaterial::operator!=(irr::video::SMaterial const&) const
  5.60      1.07     0.08  1835016     0.00     0.00  irr::core::vector3d<float>::vector3d(float, float, float)
  4.90      1.14     0.07   262145     0.00     0.00  irr::IReferenceCounted::drop() const
  2.45      1.18     0.04                             irr::core::vector3d<float>::operator*=(irr::core::vector3d<float> const&)
  2.10      1.21     0.03                             irr::scene::SMesh::getMeshBufferCount() const
  1.40      1.23     0.02  2097148     0.00     0.00  irr::video::SColor::operator!=(irr::video::SColor const&) const
  1.40      1.25     0.02   262144     0.00     0.00  irr::scene::IMesh::~IMesh()
  1.40      1.27     0.02        1    20.01   140.07  chunk::createMesh()
  1.05      1.28     0.02  2883584     0.00     0.00  irr::core::array<irr::scene::IMeshBuffer*, irr::core::irrAllocator<irr::scene::IMeshBuffer*>

That's 89.57% of the entire runtime in my program, the majority of it spent on mesh work it seems.
that takes, with profiling enabled, about 5000ms (rounded up, 3550 without profiling) to generate the full 3.14m tris (using 350ish mb of ram).
here's the relevant code for generating the mesh data

Code: Select all

ISceneNode *node = smgr->addCubeSceneNode(1.0f, 0, -1, core::vector3df(x,y,z));
node->setMaterialFlag(video::EMF_WIREFRAME, true);

While I am aware that I'd get a nice speed up I'm not really interested in that right now, memory and runtime is an important aspect due to my target specs and I'm at this point comparing what different methods cost using the most naive implementation (fill the entire chunk with cubes without any logic determining which faces to cull).

The sad part is that the extremely naive and silly method of generating 262144 cube scene nodes is still faster than this:

Code: Select all

video::SColor c(255, rand() % 256, rand() % 256, rand() % 256);
//SNIP
video::S3DVertex vertices[24] =
                    {
                        // Up
                        video::S3DVertex(-2,+2,-2, 0,1,0, c, 0,1),
                        video::S3DVertex(-2,+2,+2, 0,1,0, c, 0,0),
                        video::S3DVertex(+2,+2,+2, 0,1,0, c, 1,0),
                        video::S3DVertex(+2,+2,-2, 0,1,0, c, 1,1),
                        // Down
                        video::S3DVertex(-2,-2,-2, 0,-1,0, c, 0,0),
                        video::S3DVertex(+2,-2,-2, 0,-1,0, c, 1,0),
                        video::S3DVertex(+2,-2,+2, 0,-1,0, c, 1,1),
                        video::S3DVertex(-2,-2,+2, 0,-1,0, c, 0,1),
                        // Right
                        video::S3DVertex(+2,-2,-2, 1,0,0, c, 0,1),
                        video::S3DVertex(+2,+2,-2, 1,0,0, c, 0,0),
                        video::S3DVertex(+2,+2,+2, 1,0,0, c, 1,0),
                        video::S3DVertex(+2,-2,+2, 1,0,0, c, 1,1),
                        // Left
                        video::S3DVertex(-2,-2,-2, -1,0,0, c, 1,1),
                        video::S3DVertex(-2,-2,+2, -1,0,0, c, 0,1),
                        video::S3DVertex(-2,+2,+2, -1,0,0, c, 0,0),
                        video::S3DVertex(-2,+2,-2, -1,0,0, c, 1,0),
                        // Back
                        video::S3DVertex(-2,-2,+2, 0,0,1, c, 1,1),
                        video::S3DVertex(+2,-2,+2, 0,0,1, c, 0,1),
                        video::S3DVertex(+2,+2,+2, 0,0,1, c, 0,0),
                        video::S3DVertex(-2,+2,+2, 0,0,1, c, 1,0),
                        // Front
                        video::S3DVertex(-2,-2,-2, 0,0,-1, c, 0,1),
                        video::S3DVertex(-2,+2,-2, 0,0,-1, c, 0,0),
                        video::S3DVertex(+2,+2,-2, 0,0,-1, c, 1,0),
                        video::S3DVertex(+2,-2,-2, 0,0,-1, c, 1,1),
                    };
 
                    u16 indices[6] = {0,1,2,2,3,0};
                    scene::SMesh *mesh = new scene::SMesh();
                    for (u32 i=0; i<6; ++i)
                    {
                        scene::IMeshBuffer *buf = new scene::SMeshBuffer();
                        buf->append(vertices + 4 * i, 4, indices, 6);
                        // Set default material
                        buf->getMaterial().setFlag(video::EMF_LIGHTING, false);
                        buf->getMaterial().setFlag(video::EMF_BILINEAR_FILTER, false);
                        buf->getMaterial().MaterialType = video::EMT_TRANSPARENT_ALPHA_CHANNEL_REF;
                        // Add mesh buffer to mesh
                        mesh->addMeshBuffer(buf);
                        buf->drop();
//SNIP
    scene::SAnimatedMesh *anim_mesh = new scene::SAnimatedMesh(mesh);
    mesh->drop();
    scaleMesh(anim_mesh, scale);  // also recalculates bounding box
    ISceneNode * Chunk = smgr->addMeshSceneNode(anim_mesh);

The above code takes 35098.5ms (15k without profiling) to run and uses 1.0 GB (peak 1.1)
eyeing over the profiling analysis of the algorithm I can already tell why it's so bloody inefficient.

Code: Select all

 14.03      2.42     2.42 18874364     0.00     0.00  irr::core::CMatrix4<float>::operator==(irr::core::CMatrix4<float> const&) const
  7.38      5.36     1.27 40866384     0.00     0.00  irr::core::irrAllocator<irr::core::CMatrix4<float> >::irrAllocator()
  5.34      7.22     0.92 18874364     0.00     0.00  irr::video::SMaterialLayer::operator!=(irr::video::SMaterialLayer const&) const
  3.95      9.51     0.68 34574928     0.00     0.00  irr::video::SMaterialLayer::~SMaterialLayer()
  3.31     10.08     0.57 53449312     0.00     0.00  irr::core::irrAllocator<irr::core::CMatrix4<float> >::internal_delete(void*)
  2.27     10.47     0.39 24372313     0.00     0.00  operator new(unsigned long, void*)
  2.00     10.82     0.35 34574928     0.00     0.00  irr::core::irrAllocator<irr::core::CMatrix4<float> >::~irrAllocator()
  1.86     11.14     0.32 34574928     0.00     0.00  irr::core::irrAllocator<irr::core::CMatrix4<float> >::deallocate(irr::core::CMatrix4<float>*)
  1.74     11.44     0.30 34574928     0.00     0.00  irr::core::irrAllocator<irr::core::CMatrix4<float> >::destruct(irr::core::CMatrix4<float>*)

Those numbers are huge, no wonder it's slow.
That's a lot of allocation calls, a lot of matrices and just a whole ton of insane over processing it seems*.

*Note: all of the entries in the log are handpicked from the analysis to showcase the worst parts.

So, the question is.
What is a better way to generate these meshes in a clean way, memory efficiency is more important than runtime speed (in milliseconds), so long it doesn't take more than ~5 seconds it's an acceptable runtime speed for generating the mesh (I reckon actually generating the optimized mesh will be orders of magnitude faster).

Ideally I'd like to use less than 40mb for the mesh data, however this might be impossible for the naive allocator of filling the entire volume with blocks (is it?).
Really, anything faster than the cube scene node method (preferably by building the mesh instead of calling a function as I'll want to generate fewer faces later, hence why I'm trying to write my own function and doing very poorly) would be fine.

Recap: I do not wish to get advice for how to optimize generation, I know full well that generating 262144 cubes (3.14M tris) per chunk isn't a viable long term strategy, however I am primarily concerned with getting something up so I can start implementing other features, mesh optimizations would be farther down the line as compared to other things like paging.

hendu · Post by **hendu** » Sat Mar 07, 2015 9:10 am

I'm afraid you will have to optimize your generation to get less RAM use. "Just pack the 256k cubes tighter" won't save you much, you need "the hidden cubes just don't exist".

Granyte · Post by **Granyte** » Sat Mar 07, 2015 11:36 am

you should just detect if your cube is on the edge of the solid volume and if it's inside don't add it

Also optionaly if you could look into instancing that could help your speed a lot

Cube_ · Post by **Cube_** » Sat Mar 07, 2015 3:20 pm

Well, I was really hoping to find a method of adding more cubes that doesn't involve adding a scene node per block, I'd love to merge them so I can later do optimizations on the meshes themselves (but more importantly so I can use VBOs).
The logic for determining which blocks should or shouldn't be generated is rather trivial (on that matter I should implement frustum culling...).
I do realize that I need to optimize out the cubes that aren't visible, but I was more hoping to find a way to generate meshes that use as little memory as possible (as in, I'm trying to keep the entire project under strict size limits).
I don't know what a vertex costs, memory wise, but I'd wager some 12 bytes (three floats for location, not sure what other information they store).
Not sure what indices take.
Not sure if the faces consume extra data, if so. how much?.
Assuming the vertices are all the data required then each block should take 288 bytes (before optimizing out duplicate verts which I assume should be reasonably possible, so long they have the same material) for a grand total of 75.5 megabytes in one chunk.
Reasonably that's probably quite a bit off but it might be sorta close at least.
Of course the reason I need this is because I will have a lot of chunks loaded at any time, even with optimized meshes I'd love to keep the memory usage down as much as possible, that needs to start with my mesh allocator and as such I need to find a more efficient way to generate them than either two methods suggested (my super slow and expensive cube generator and the naive cubenode generator), reasonably 400 bytes of mesh data per cube (full cube that is) would be the highest I could go, that'd yield some 105 megabytes of memory usage.

Cube_ · Post by **Cube_** » Sat Mar 07, 2015 4:03 pm

S3DVertex (f32 x, f32 y, f32 z, f32 nx, f32 ny, f32 nz, SColor c, f32 tu, f32 tv)
...
s3DVertex(4,4,4,4,4,4,16,4,4) that's a lot of bytes for one vertex, now I see why my code uses such a large amount of ram.
that's 301.989888 megabytes of just vertex data o-o;;

I get the x, y,z
I don't get the nx, ny, nz, I don't know why I'd need an SColor if I also have UVs.

Cube_ · Post by **Cube_** » Sat Mar 07, 2015 9:28 pm

Success! Sorta.
I've managed to generate a full chunk using only 234mb (including the extra ram consumed by chunk data and the rest of the program).

With a slight caveat, there are a bunch of meshes missing from the chunk.

This is the math I use for each block

Code: Select all

video::S3DVertex vertices[8] =
                    {
                        video::S3DVertex(-1,+1,-1, 0,0,0, c, 1,0),
                        video::S3DVertex(+1,+1,-1, 0,0,0, c, 0,1),
                        video::S3DVertex(+1,-1,-1, 0,0,0, c, 1,1),
                        video::S3DVertex(-1,-1,-1, 0,0,0, c, 0,0),
                        video::S3DVertex(-1,+1,+1, 0,0,0, c, 1,0),
                        video::S3DVertex(+1,+1,+1, 0,0,0, c, 0,1),
                        video::S3DVertex(+1,-1,+1, 0,0,0, c, 1,1),
                        video::S3DVertex(-1,-1,+1, 0,0,0, c, 0,0),
                    };
                    u16 indices[36] = {         //Front
                                        0, 1, 2,
                                        1, 2, 3,
                                                //Back
                                        4, 5, 6,
                                        5, 6, 7,
                                                //Left
                                        4, 5, 0,
                                        5, 0, 1,
                                                //Right
                                        2, 3, 6,
                                        3, 6, 7,
                                                //Top
                                        1, 5, 3,
                                        5, 3, 7,
                                                //Bottom
                                        0, 4, 2,
                                        4, 2, 6};
 
                    scene::IMeshBuffer *buf = new scene::SMeshBuffer();
                    buf->append(vertices, 8, indices, 32);
 
                    buf->getMaterial().setFlag(video::EMF_LIGHTING, false);
                    buf->getMaterial().setFlag(video::EMF_BILINEAR_FILTER, false);
                    buf->getMaterial().MaterialType = video::EMT_TRANSPARENT_ALPHA_CHANNEL_REF;
 
                    mesh->addMeshBuffer(buf);
                    buf->drop();

I think that my code *may* be a slight wee bit broken in some regards, I went through the vertex math on paper and came up with this:

I'm not entirely certain why it asplodes, I'm probably abusing buf->append by feeding it weird data.
Or maybe because I don't understand all flags of S3DVertex so I just filled in the ones I understood (and I guessed on the UV flags, I don't need UVs for what I'm doing).
It is also noticeably slower than the naive method of cube scene nodes (sorta, the program overall runs faster but it takes 7711ms to generate these cubes instead of 3550 however that's probably an optimization issue and is fine, it's below my 8 second mark anywho (based on estimating the speed it'd take someone to traverse an entire chunk)).

So now my questions are:
Why is this math broken and where is it broken.
Second: Is there a more efficient type vertex than S3DVertex? I really only need the vertex location and the color (not sure why SColor takes four 32 bit values since it only goes to 255 in each field, it should only need 8 bits per field for this)

Ideally I'd hope there's a vertex type that stores this data and nothing but:
vertex location (3 values, floating point probably) - 96 bits (3x32 bit values) (12 bytes)
vertex color (4 values, ARGB each being 8 bits) - 32 bits (8 bytes), hell I only need 8 bit color depth so I'd love a type that only uses 2 bit alpha and 2 bit per color channel.
that'd allow me to hit 160 bytes per cube (vertex data only) for a total of 41 megabytes of vertex data, this would be ideal, is there such a vertex type in Irrlicht? If not, does Irrlicht have some system for easily adding such a type (preferably easily ported to the OGL ES version as well since I intend to use it for mobile platforms)?

mongoose7 · Post by **mongoose7** » Sat Mar 07, 2015 11:44 pm

I think the cube looks OK, you just have to fix the winding. (You need to traverse the vertices clockwise from the outside. (I think it is clockwise.))

Cube_ · Post by **Cube_** » Sun Mar 08, 2015 3:19 am

Aight, I'm surprised my math is ok.
Looking over a wireframe I see a few issues

It seems some of my indices are wrong, some of the winding might be wrong as well, it's hard to tell.
My vertices are all in the right place at least so there's that (beats my last two attempts at generating cubes by a longshot).

Edit:
My cubes are now working fine with proper backface culling, it was indeed an issue with my indices.

Cube_ · Post by **Cube_** » Sun Mar 08, 2015 5:39 pm

I should hack the S3DVertex type to take an 8 bit ARGB value (using bit twiddling?) and to remove these things (that I either don't understand or use):

S3DVertex (f32 x, f32 y, f32 z,f32 nx, f32 ny, f32 nz, SColor c, f32 tu, f32 tv) //so wasteful o-o;;

f32 nx, f32 ny, f32 nz, (what are these for anyway?)
, f32 tu, f32 tv (I don't need UVs for my project and they just end up being a waste of memory)
S3DVertex (f32 x, f32 y, f32 z, SColor c) //this is all I need and it's still wasting memory, 32 bits for each channel? I only need 8 per channel since I am going for an 8-bit look with vibrant colors, besides with 8 bits per channel this STILL equals 32 bit color depth, I could easily do with 2 bits and get a nice 8-bit color depth .-. )

Really, this would give me an improvement of 35 bytes per vertex, even with an optimized mesh I can expect to have at least a few ten thousand verts visible at any time, most likely more (the tegra 3 isn't vertex starved and that's my minimum target spec, it can handle up to 1.5M verts (theoretically) with a decent vertex shader; in this worst case I save a whopping 52.5 megabytes, this is an insane improvement and as such I'd loooove that.... sadly the tegra 3 is fill rate starved, I'll need to write a nice fragment shader and probably employ upscaling techniques).

Granyte · Post by **Granyte** » Mon Mar 09, 2015 6:56 am

You would need the shader pipeline to do this in it uo can create your own vertex format

Also you could use something like polyvox to generate mesh based on a volume

CuteAlien · Post by **CuteAlien** » Mon Mar 09, 2015 10:33 am

As long as you only have cubes you only need a single mesh. The rest could be nodes. Not saying this is faster (many nodes will render slowly again). Doing a cube world just by having real cubes all over the place is likely the wrong approach.

Cube_ · Post by **Cube_** » Mon Mar 09, 2015 2:10 pm

CuteAlien wrote:As long as you only have cubes you only need a single mesh. The rest could be nodes. Not saying this is faster (many nodes will render slowly again). Doing a cube world just by having real cubes all over the place is likely the wrong approach.

I am fully aware, but having the underlying logic be more efficient helps when I start generating only the required meshes and then applying optimizations (greedy meshing and whatnot, still not sure how I'd do this in irrlicht as I don't know how to edit the contents of an IAnimatedMesh but that's for later anywho)

I dare say though, 100ms using 6.6mb of ram is rather optimized for a chunk size of 16^3 (which I determined was the optimal chunk size for this project, that might change later if I decide to have bigger chunks that are optimized (split into regions? sorta like an octree per chunk, using one for the entire world is woefully slow), if I employ multithreading (which... should work since they'd be using the same device) I could generate n chunks at the same time, if each takes the same amount of time that'd be nice (my minimum target happens to have 4 hardware threads so I'd probably do 4 chunks at the time, software threads should work fine for this too, maybe I could do up to 8 at a time).

I really should switch to the OGL ES version anyway, I'm testing this on a mobile gpu and it supports OGL ES 2.0, pretty sure OGL ES would be more efficient than OGL 2.1 anyway (seeing it's a mobile gpu)

Irrlicht Engine

efficiently allocate large amounts of meshes.

efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.

Re: efficiently allocate large amounts of meshes.