Per Face Culling

Post those lines of code you feel like sharing or find what you require for your project here; or simply use them as tutorials.
agi_shi
Posts: 122
Joined: Mon Feb 26, 2007 12:46 am

Post by agi_shi »

Mirror wrote: @agi_shi : when you are saying "winding order culling" you are referring to the backface culling ? node->setMaterialFlag(video::EMF_BACK_FACE_CULLING, true); specifically this flag ? which as far as i can understand is described in this link : http://msdn.microsoft.com/en-us/library ... S.85).aspx
p.s. what's your graphics card ?
I don't call it back face culling because it can refer to front face culling as well (anti-clockwise). I'm using an 8800GTS. Combine your numFaces dot products with real-time VBOs being uploaded to the GPU, and there is absolutely no way that you can out-perform the GPU in culling. Like I said - GPUs these days will happily render millions of polygons practically "for free" - it's the fragment processing that matters.
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Post by Mirror »

hybrid wrote: Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.
yes seems that hybrid is right on this and agi_shi as well. The benefit from this will come only on low-end graphics cards in comparison with average to good cpus.

the results from shadowslair and agi_shi are perfectly explained.

agi_shi :
497 FPS with per face culling
594 FPS with native hardware vertex winding culling
it's obvious that his graphics card ( 8800 gts ) is so good that it can render 39600 polys with 594 fps. also, since the calculation of whether the polys should be culled or not, takes about 2 milliseconds that's why he gets about 500 fps with per face culling, 500x2=1000=1second. so it doesn't help here because his graphics card is too good and it gets limited by the cpu.
This is interesting...

I`ve got the average of 209 fps for the hardware and about 135 fps in the other executable. Test made on AMD k7 1,14 512DDR 128 Ati Radeon.
This is also perfectly explained. his CPU is too slow and it takes several milliseconds to make the loop calculation. my guess the cycle takes about 1000/135=7.5 milliseconds. So i suppose the dt=7/8 in the debug text. shadowslair can you please check this ?

also, i noticed that im calling cam->getposition() 20.000 times / millisecond, so i moved it out in the main loop, thus improving the calculation speed a bit. i produced a new .exe. shadowslair/agi_shi would you be kind enough to test the fps with the new .exe ? here is the link :

http://irrlichtirc.g0dsoft.com/Ogami_It ... better.exe

though i don't expect this will produce a spectacular improvement.
PI
Posts: 176
Joined: Tue Oct 09, 2007 7:15 pm
Location: Hungary

Re:

Post by PI »

I've tested it again, now with the PerFaceCulling_better.exe, and here how it has changed:

PerFaceCulling_better.exe:
At the startup screen, 490 FPS, 7400 tris rendered.
Inside the sphere, 980 FPS, 0 tris rendered.
No matter how far from the sphere, but facing it, not going under 430 FPS.

If you look at my previous post, you'll see the results now outperformes HW culling. Cool. :D

Cheers,
PI
agi_shi
Posts: 122
Joined: Mon Feb 26, 2007 12:46 am

Post by agi_shi »

Yup, definitely faster now. ~590FPS with the per-face culling, which is just about on-par with the native GPU culling. (all of this is with the whole sphere in view)

Agreed, this technique would rather help out on powerful CPUs with weak GPUs - but newer GPUs definitely have the raw power to rip through a large number of triangle pretty much "for free". I'm a bit harsh on older technology :lol:.

Anyways, I suspect this will help out the most with LOTS of forward fragment processing. But at the same time, it will scale absolutely horribly as the scene grows. So, it's really a trade off between CPU and GPU power, depends on your target.
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Post by Mirror »

i believe i can greatly enhance the performance by lowering the cost of the "if"

instead of

Code: Select all

buffer->Indices.push_back(indicesc[i]);
buffer->Indices.push_back(indicesc[i+1]);
buffer->Indices.push_back(indicesc[i+2]);
i could use something like buffer->Indices.push_back(indices,3) but there is no such method.
furthermore, Visible or Notvisible faces do not come one after the other but rather they come in chunks, like for example :

VVVVVVVVVVNNNNNNNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVVVVVVNNNNNNNNNN

so a method like push_back(indices, visible_polys_chunk_count), could become VERY VERY handy and lower alot the cost and thus making it much much better and probably would make this culling method good even for high end graphics cards.

If someone could code such kind of a push_back it would be very nice as it would help alot.

p.s. whats forward fragment processing ?
agi_shi
Posts: 122
Joined: Mon Feb 26, 2007 12:46 am

Post by agi_shi »

Mirror wrote: p.s. whats forward fragment processing ?
Fragment shading done on the fly. Meaning, when you render your objects, you render them with all of your lighting and shadowing shaders and what-not right on the spot. With deferred shading, on the other hand, you basically render of the geometry to a texture, and then all of the fragment processing effectively becomes a post process. That is, regardless of your geometry, lighting and shadowing for a single light takes constant time (when not culling the light for extra performance), since it is done on the screen, not on the geometry (thus, only on what is visible).
hybrid
Admin
Posts: 14143
Joined: Wed Apr 19, 2006 9:20 pm
Location: Oldenburg(Oldb), Germany
Contact:

Post by hybrid »

Mirror, you should simply reallocate the necessary memory region, probably just use the size of the original mesh. Then, push_back is almost for free.
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Post by Mirror »

i haven't tested VBOs yet but i suspect that with VBOs rendering will be so fast that it will not matter at all how fast the culling becomes even if it takes 1microsecond, as the cpu<->gpu bus ( agp/pci express ) will always be slower than the gpu<->gpu memory bus. if this is true, i don't see any reason to continue the effort of optimizing it :D
Dorth
Posts: 931
Joined: Sat May 26, 2007 11:03 pm

Post by Dorth »

because not everyone will be able to use the vbos, may it be because of old cards or simply because they are not making a game or a weird one and their meshbuffers are always changing. ^^
PI
Posts: 176
Joined: Tue Oct 09, 2007 7:15 pm
Location: Hungary

Re:

Post by PI »

I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?

Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.

Also, it'd be a great alternate solution for lower-end computers.

What do you think?

Cheers,
PI
Nadro
Posts: 1648
Joined: Sun Feb 19, 2006 9:08 am
Location: Warsaw, Poland

Re:

Post by Nadro »

PI wrote:I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?

Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.

Also, it'd be a great alternate solution for lower-end computers.

What do you think?

Cheers,
PI
I agree with PI in 100%:)
Library helping with network requests, tasks management, logger etc in desktop and mobile apps: https://github.com/GrupaPracuj/hermes
agi_shi
Posts: 122
Joined: Mon Feb 26, 2007 12:46 am

Re:

Post by agi_shi »

PI wrote:I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?

Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.

Also, it'd be a great alternate solution for lower-end computers.

What do you think?

Cheers,
PI
Doesn't matter, VBOs outperform mesh buffers of any kind regardless of whether they're changing or not. There's that, and anyone who wants performance would surely do hardware skinning.
hybrid
Admin
Posts: 14143
Joined: Wed Apr 19, 2006 9:20 pm
Location: Oldenburg(Oldb), Germany
Contact:

Post by hybrid »

Try this culling with the software renderers, mnaybe it's working with them. Since you'd trade CPU time for CPU time it could be an immediate win for the renderers.
BlindSide
Admin
Posts: 2821
Joined: Thu Dec 08, 2005 9:09 am
Location: NZ!

Post by BlindSide »

Yes I did this for a software ray tracer and it was a great optimization.
ShadowMapping for Irrlicht!: Get it here
Need help? Come on the IRC!: #irrlicht on irc://irc.freenode.net
varmint
Posts: 46
Joined: Fri Oct 06, 2006 4:33 pm

Post by varmint »

So very kewl idea!! I was wanting to something like this for our animation system. So I stuck it in during our weight calculation code to avoid any extra loops.

On a Quad Core I was getting 30 extra FPS then normal. So definitely faster on a quad core. On a dual core I get -2 FPS and on a single core about -20 FPS.

I'm thinking of try'n this code on the GPU and see what the results may be.

Thx
V :D
Post Reply