I don't call it back face culling because it can refer to front face culling as well (anti-clockwise). I'm using an 8800GTS. Combine your numFaces dot products with real-time VBOs being uploaded to the GPU, and there is absolutely no way that you can out-perform the GPU in culling. Like I said - GPUs these days will happily render millions of polygons practically "for free" - it's the fragment processing that matters.Mirror wrote: @agi_shi : when you are saying "winding order culling" you are referring to the backface culling ? node->setMaterialFlag(video::EMF_BACK_FACE_CULLING, true); specifically this flag ? which as far as i can understand is described in this link : http://msdn.microsoft.com/en-us/library ... S.85).aspx
p.s. what's your graphics card ?
Per Face Culling
yes seems that hybrid is right on this and agi_shi as well. The benefit from this will come only on low-end graphics cards in comparison with average to good cpus.hybrid wrote: Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.
the results from shadowslair and agi_shi are perfectly explained.
agi_shi :
it's obvious that his graphics card ( 8800 gts ) is so good that it can render 39600 polys with 594 fps. also, since the calculation of whether the polys should be culled or not, takes about 2 milliseconds that's why he gets about 500 fps with per face culling, 500x2=1000=1second. so it doesn't help here because his graphics card is too good and it gets limited by the cpu.497 FPS with per face culling
594 FPS with native hardware vertex winding culling
This is also perfectly explained. his CPU is too slow and it takes several milliseconds to make the loop calculation. my guess the cycle takes about 1000/135=7.5 milliseconds. So i suppose the dt=7/8 in the debug text. shadowslair can you please check this ?This is interesting...
I`ve got the average of 209 fps for the hardware and about 135 fps in the other executable. Test made on AMD k7 1,14 512DDR 128 Ati Radeon.
also, i noticed that im calling cam->getposition() 20.000 times / millisecond, so i moved it out in the main loop, thus improving the calculation speed a bit. i produced a new .exe. shadowslair/agi_shi would you be kind enough to test the fps with the new .exe ? here is the link :
http://irrlichtirc.g0dsoft.com/Ogami_It ... better.exe
though i don't expect this will produce a spectacular improvement.
Re:
I've tested it again, now with the PerFaceCulling_better.exe, and here how it has changed:
PerFaceCulling_better.exe:
At the startup screen, 490 FPS, 7400 tris rendered.
Inside the sphere, 980 FPS, 0 tris rendered.
No matter how far from the sphere, but facing it, not going under 430 FPS.
If you look at my previous post, you'll see the results now outperformes HW culling. Cool.
Cheers,
PI
PerFaceCulling_better.exe:
At the startup screen, 490 FPS, 7400 tris rendered.
Inside the sphere, 980 FPS, 0 tris rendered.
No matter how far from the sphere, but facing it, not going under 430 FPS.
If you look at my previous post, you'll see the results now outperformes HW culling. Cool.
Cheers,
PI
Yup, definitely faster now. ~590FPS with the per-face culling, which is just about on-par with the native GPU culling. (all of this is with the whole sphere in view)
Agreed, this technique would rather help out on powerful CPUs with weak GPUs - but newer GPUs definitely have the raw power to rip through a large number of triangle pretty much "for free". I'm a bit harsh on older technology .
Anyways, I suspect this will help out the most with LOTS of forward fragment processing. But at the same time, it will scale absolutely horribly as the scene grows. So, it's really a trade off between CPU and GPU power, depends on your target.
Agreed, this technique would rather help out on powerful CPUs with weak GPUs - but newer GPUs definitely have the raw power to rip through a large number of triangle pretty much "for free". I'm a bit harsh on older technology .
Anyways, I suspect this will help out the most with LOTS of forward fragment processing. But at the same time, it will scale absolutely horribly as the scene grows. So, it's really a trade off between CPU and GPU power, depends on your target.
i believe i can greatly enhance the performance by lowering the cost of the "if"
instead of
i could use something like buffer->Indices.push_back(indices,3) but there is no such method.
furthermore, Visible or Notvisible faces do not come one after the other but rather they come in chunks, like for example :
VVVVVVVVVVNNNNNNNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVVVVVVNNNNNNNNNN
so a method like push_back(indices, visible_polys_chunk_count), could become VERY VERY handy and lower alot the cost and thus making it much much better and probably would make this culling method good even for high end graphics cards.
If someone could code such kind of a push_back it would be very nice as it would help alot.
p.s. whats forward fragment processing ?
instead of
Code: Select all
buffer->Indices.push_back(indicesc[i]);
buffer->Indices.push_back(indicesc[i+1]);
buffer->Indices.push_back(indicesc[i+2]);
furthermore, Visible or Notvisible faces do not come one after the other but rather they come in chunks, like for example :
VVVVVVVVVVNNNNNNNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNVVVVVVVVVVVVVVVVVVNNNNNNNNNN
so a method like push_back(indices, visible_polys_chunk_count), could become VERY VERY handy and lower alot the cost and thus making it much much better and probably would make this culling method good even for high end graphics cards.
If someone could code such kind of a push_back it would be very nice as it would help alot.
p.s. whats forward fragment processing ?
Fragment shading done on the fly. Meaning, when you render your objects, you render them with all of your lighting and shadowing shaders and what-not right on the spot. With deferred shading, on the other hand, you basically render of the geometry to a texture, and then all of the fragment processing effectively becomes a post process. That is, regardless of your geometry, lighting and shadowing for a single light takes constant time (when not culling the light for extra performance), since it is done on the screen, not on the geometry (thus, only on what is visible).Mirror wrote: p.s. whats forward fragment processing ?
i haven't tested VBOs yet but i suspect that with VBOs rendering will be so fast that it will not matter at all how fast the culling becomes even if it takes 1microsecond, as the cpu<->gpu bus ( agp/pci express ) will always be slower than the gpu<->gpu memory bus. if this is true, i don't see any reason to continue the effort of optimizing it
Re:
I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?
Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.
Also, it'd be a great alternate solution for lower-end computers.
What do you think?
Cheers,
PI
Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.
Also, it'd be a great alternate solution for lower-end computers.
What do you think?
Cheers,
PI
Re:
I agree with PI in 100%:)PI wrote:I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?
Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.
Also, it'd be a great alternate solution for lower-end computers.
What do you think?
Cheers,
PI
Library helping with network requests, tasks management, logger etc in desktop and mobile apps: https://github.com/GrupaPracuj/hermes
Re:
Doesn't matter, VBOs outperform mesh buffers of any kind regardless of whether they're changing or not. There's that, and anyone who wants performance would surely do hardware skinning.PI wrote:I'd encourage you to optimise it! Also, I'd like to see this integrated into Irrlicht. Why?
Because I think animated meshes could benefit from this. Tell me if it's nonsense. Animated meshes are changing every frame, so - I guess - they won't benefit from VBOs at all. Hence there this technique could be handy.
Also, it'd be a great alternate solution for lower-end computers.
What do you think?
Cheers,
PI
Yes I did this for a software ray tracer and it was a great optimization.
ShadowMapping for Irrlicht!: Get it here
Need help? Come on the IRC!: #irrlicht on irc://irc.freenode.net
Need help? Come on the IRC!: #irrlicht on irc://irc.freenode.net
So very kewl idea!! I was wanting to something like this for our animation system. So I stuck it in during our weight calculation code to avoid any extra loops.
On a Quad Core I was getting 30 extra FPS then normal. So definitely faster on a quad core. On a dual core I get -2 FPS and on a single core about -20 FPS.
I'm thinking of try'n this code on the GPU and see what the results may be.
Thx
V
On a Quad Core I was getting 30 extra FPS then normal. So definitely faster on a quad core. On a dual core I get -2 FPS and on a single core about -20 FPS.
I'm thinking of try'n this code on the GPU and see what the results may be.
Thx
V