Per Face Culling

agi_shi · Post by **agi_shi** » Sun Jun 15, 2008 11:20 am

Mirror wrote: @agi_shi : when you are saying "winding order culling" you are referring to the backface culling ? node->setMaterialFlag(video::EMF_BACK_FACE_CULLING, true); specifically this flag ? which as far as i can understand is described in this link : http://msdn.microsoft.com/en-us/library ... S.85).aspx
p.s. what's your graphics card ?

I don't call it back face culling because it can refer to front face culling as well (anti-clockwise). I'm using an 8800GTS. Combine your numFaces dot products with real-time VBOs being uploaded to the GPU, and there is absolutely no way that you can out-perform the GPU in culling. Like I said - GPUs these days will happily render millions of polygons practically "for free" - it's the fragment processing that matters.

Mirror · Post by **Mirror** » Sun Jun 15, 2008 2:11 pm

hybrid wrote: Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.

yes seems that hybrid is right on this and agi_shi as well. The benefit from this will come only on low-end graphics cards in comparison with average to good cpus.

the results from shadowslair and agi_shi are perfectly explained.

agi_shi :

497 FPS with per face culling
594 FPS with native hardware vertex winding culling

it's obvious that his graphics card ( 8800 gts ) is so good that it can render 39600 polys with 594 fps. also, since the calculation of whether the polys should be culled or not, takes about 2 milliseconds that's why he gets about 500 fps with per face culling, 500x2=1000=1second. so it doesn't help here because his graphics card is too good and it gets limited by the cpu.

This is interesting...

I`ve got the average of 209 fps for the hardware and about 135 fps in the other executable. Test made on AMD k7 1,14 512DDR 128 Ati Radeon.

This is also perfectly explained. his CPU is too slow and it takes several milliseconds to make the loop calculation. my guess the cycle takes about 1000/135=7.5 milliseconds. So i suppose the dt=7/8 in the debug text. shadowslair can you please check this ?

also, i noticed that im calling cam->getposition() 20.000 times / millisecond, so i moved it out in the main loop, thus improving the calculation speed a bit. i produced a new .exe. shadowslair/agi_shi would you be kind enough to test the fps with the new .exe ? here is the link :

http://irrlichtirc.g0dsoft.com/Ogami_It ... better.exe

though i don't expect this will produce a spectacular improvement.

PI · Post by PI » Sun Jun 15, 2008 5:27 pm

I've tested it again, now with the PerFaceCulling_better.exe, and here how it has changed:

PerFaceCulling_better.exe:
At the startup screen, 490 FPS, 7400 tris rendered.
Inside the sphere, 980 FPS, 0 tris rendered.
No matter how far from the sphere, but facing it, not going under 430 FPS.

If you look at my previous post, you'll see the results now outperformes HW culling. Cool.

Cheers,
PI

agi_shi · Post by **agi_shi** » Sun Jun 15, 2008 5:53 pm

Yup, definitely faster now. ~590FPS with the per-face culling, which is just about on-par with the native GPU culling. (all of this is with the whole sphere in view)

Agreed, this technique would rather help out on powerful CPUs with weak GPUs - but newer GPUs definitely have the raw power to rip through a large number of triangle pretty much "for free". I'm a bit harsh on older technology

.

Anyways, I suspect this will help out the most with LOTS of forward fragment processing. But at the same time, it will scale absolutely horribly as the scene grows. So, it's really a trade off between CPU and GPU power, depends on your target.

Mirror · Post by **Mirror** » Sun Jun 15, 2008 9:03 pm

i believe i can greatly enhance the performance by lowering the cost of the "if"

instead of

Code: Select all

buffer->Indices.push_back(indicesc[i]);
buffer->Indices.push_back(indicesc[i+1]);
buffer->Indices.push_back(indicesc[i+2]);

i could use something like buffer->Indices.push_back(indices,3) but there is no such method.
furthermore, Visible or Notvisible faces do not come one after the other but rather they come in chunks, like for example :

VVVVVVVVVVNNNNNNNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVVVVVVNNNNNNNNNN

so a method like push_back(indices, visible_polys_chunk_count), could become VERY VERY handy and lower alot the cost and thus making it much much better and probably would make this culling method good even for high end graphics cards.

If someone could code such kind of a push_back it would be very nice as it would help alot.

p.s. whats forward fragment processing ?

agi_shi · Post by **agi_shi** » Mon Jun 16, 2008 12:02 am

Mirror wrote: p.s. whats forward fragment processing ?

Fragment shading done on the fly. Meaning, when you render your objects, you render them with all of your lighting and shadowing shaders and what-not right on the spot. With deferred shading, on the other hand, you basically render of the geometry to a texture, and then all of the fragment processing effectively becomes a post process. That is, regardless of your geometry, lighting and shadowing for a single light takes constant time (when not culling the light for extra performance), since it is done on the screen, not on the geometry (thus, only on what is visible).

hybrid · Post by **hybrid** » Mon Jun 16, 2008 8:07 am

Mirror, you should simply reallocate the necessary memory region, probably just use the size of the original mesh. Then, push_back is almost for free.

Mirror · Post by **Mirror** » Mon Jun 16, 2008 12:09 pm

i haven't tested VBOs yet but i suspect that with VBOs rendering will be so fast that it will not matter at all how fast the culling becomes even if it takes 1microsecond, as the cpu<->gpu bus ( agp/pci express ) will always be slower than the gpu<->gpu memory bus. if this is true, i don't see any reason to continue the effort of optimizing it

Dorth · Post by **Dorth** » Mon Jun 16, 2008 3:04 pm

because not everyone will be able to use the vbos, may it be because of old cards or simply because they are not making a game or a weird one and their meshbuffers are always changing. ^^

PI · Post by PI » Mon Jun 16, 2008 3:24 pm

I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?

Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.

Also, it'd be a great alternate solution for lower-end computers.

What do you think?

Cheers,
PI

Nadro · Post by **Nadro** » Mon Jun 16, 2008 9:22 pm

PI wrote:I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?

Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.

Also, it'd be a great alternate solution for lower-end computers.

What do you think?

Cheers,
PI

I agree with PI in 100%:)

agi_shi · Post by **agi_shi** » Tue Jun 17, 2008 12:40 am

PI wrote:I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?

Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.

Also, it'd be a great alternate solution for lower-end computers.

What do you think?

Cheers,
PI

Doesn't matter, VBOs outperform mesh buffers of any kind regardless of whether they're changing or not. There's that, and anyone who wants performance would surely do hardware skinning.

hybrid · Post by **hybrid** » Tue Jun 17, 2008 7:35 am

Try this culling with the software renderers, mnaybe it's working with them. Since you'd trade CPU time for CPU time it could be an immediate win for the renderers.

BlindSide · Post by **BlindSide** » Tue Jun 17, 2008 9:20 am

Yes I did this for a software ray tracer and it was a great optimization.

varmint · Post by **varmint** » Thu Jun 19, 2008 7:58 pm

So very kewl idea!! I was wanting to something like this for our animation system. So I stuck it in during our weight calculation code to avoid any extra loops.

On a Quad Core I was getting 30 extra FPS then normal. So definitely faster on a quad core. On a dual core I get -2 FPS and on a single core about -20 FPS.

I'm thinking of try'n this code on the GPU and see what the results may be.

Thx
V

Irrlicht Engine

Per Face Culling

Re:

Re:

Re:

Re: