Per Face Culling

Post those lines of code you feel like sharing or find what you require for your project here; or simply use them as tutorials.
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Per Face Culling

Post by Mirror »

Hi. This is my first attempt ( or rather second ) of posting some code of mine, so don't be too harsh on me, still n00b inside!

Ok, in this app i do some polygon ( triangle ) culling based on whether the triangles are facing the camera or no. the not visible triangles are culled. i have included in the file a sphere with 39600 polys. On standard rendering, all of them will be rendered, but with per face culling - that's how i call it now, i used to call it "backface culling" but for avoiding confusions i renamed it - you will get rendered anywhere from 0 up to 20k triangles. but most of the time about 10-15k with the current example. Maybe you will not notice a big difference in the FPS with the current mesh which has only 39600 triangles, but with larger meshes the difference is quite big.

If you think that this is something interesting, we could integrate this in OctTreeSceneNode, as a second "filter" something which would remove even more polys from the rendering pipeline and thus boosting the performance

Code: Select all

#include "stdafx.h"
#include "time.h"
#include <irrlicht.h>

using namespace std;
 
using namespace irr;
 
using namespace core;
using namespace scene;
using namespace video;
using namespace io;
using namespace gui;
 
#pragma comment(lib, "Irrlicht.lib")
//#pragma comment(linker, "/SUBSYSTEM:windows /ENTRY:mainCRTStartup") 
 
int main(int argc, char *argv[])
{
	IrrlichtDevice *device = createDevice(video::EDT_DIRECT3D9, dimension2d<s32>(1024, 768), 32, true, true, false);
	if ( device == 0 ) return 1;
 
	IVideoDriver* driver = device->getVideoDriver();
	ISceneManager* smgr = device->getSceneManager();
	IGUIEnvironment* guienv = device->getGUIEnvironment();
 
	driver->setTextureCreationFlag(video::ETCF_ALWAYS_32_BIT, true);
 
	IGUIStaticText* fpstext = guienv->addStaticText(L"", rect<s32>(0,0,700,15), true, false, 0, -1, true);
 
	IGUIFont* font = guienv->getFont("fonthaettenschweiler.bmp");
	IGUISkin* skin = guienv->getSkin();
	if (font) skin->setFont(font);

 	scene::ICameraSceneNode* MMOCam = smgr->addCameraSceneNodeFPS(0,100,20,-1,0,0,0,100);
 
	MMOCam->setFarValue(1000000.0f);
	MMOCam->setNearValue(0.1f);
	MMOCam->setPosition(core::vector3df(0,120,0));
	MMOCam->setTarget(core::vector3df(0,100,100));


	scene::IAnimatedMesh* terrmesh1 = smgr->getMesh("sphere.obj");

	CMeshBuffer<S3DVertex>* buffer=(CMeshBuffer<S3DVertex>*)terrmesh1->getMeshBuffer(0); 
	printf("Vertex Count: %u\n",buffer->getVertexCount());
	printf("Index Count: %u\n\n",buffer->getIndexCount());
	printf("Total Triangles Count: %u\n\n",buffer->getIndexCount()/3);

	u16* indices = buffer->getIndices();
	void* vertices = buffer->getVertices();
	S3DVertex* vertex = (S3DVertex *) vertices;
	s32 indexc = buffer->getIndexCount();
	u16* indicesc = (u16* )malloc(sizeof(short int)*buffer->getIndexCount());
	vector3df* normals = (vector3df *)malloc(sizeof(vector3df)*buffer->getIndexCount());
	vector3df* vertpnt = (vector3df *)malloc(sizeof(vector3df)*buffer->getIndexCount());
	triangle3df poly;

	//precalculated faces normals and precalculated 1 point of intersection: normals/vertpnt
	memcpy(indicesc, indices, sizeof(short int)*buffer->getIndexCount());
	for(s32 i=0;i<indexc;i+=3) {
		poly.pointA = vertex[indicesc[i]].Pos;
		poly.pointB = vertex[indicesc[i+1]].Pos;
		poly.pointC = vertex[indicesc[i+2]].Pos;
		normals[i/3] = poly.getNormal().normalize();
		vertpnt[i/3] = vertex[indicesc[i]].Pos;
	}

	video::SMaterial material;
	material.Lighting=false;
	material.Wireframe=true;
	material.BackfaceCulling=false;

	int m=0,k=0;

	u32 t1,t2,dt=0;

/*	//uncomment this and comment out PerFaceCulling to test performance 
	scene::ISceneNode* terrnode1 = smgr->addAnimatedMeshSceneNode(terrmesh1);
	terrnode1->setMaterialFlag(video::EMF_NORMALIZE_NORMALS, true);
	terrnode1->setMaterialFlag(video::EMF_WIREFRAME, true);
	terrnode1->setMaterialFlag(video::EMF_BACK_FACE_CULLING, false);
	terrnode1->setPosition(core::vector3df(0,0,0));
	terrnode1->setMaterialFlag(video::EMF_LIGHTING, false);
	terrnode1->setVisible(false);
*/
	device->getCursorControl()->setVisible(false);
 
	while(device->run())
	{
		stringw str = L"FPS: ";str += driver->getFPS();
		str += " TRI:";str += driver->getPrimitiveCountDrawn();
		//comment out the following lines for better performance--start commenting
		str += " Total Triangles: ";str += m;
		str += " Visible Triangles: ";str += k;
		str += " Not Visible Triangles: ";str += m-k;
		str += " Index Count: ";str += k*3;
		str += " dt:";str += dt;
		//--end commenting
		fpstext->setText(str.c_str());
		m = k = 0;

		driver->beginScene(true, true, 0);

// PER FACE CULLING - START

				driver->setMaterial(material);
				driver->setTransform(video::ETS_WORLD, core::matrix4());
				driver->drawMeshBuffer(buffer);

				t1 = device->getTimer()->getRealTime();
				buffer->Indices.erase(0, buffer->getIndexCount());
				for(s32 i=0;i<indexc;i+=3) {
					m++;
					if (normals[i/3].dotProduct(vertpnt[i/3]-MMOCam->getPosition())<0){
						//memcpy((void *)&indices[i],(void *)&indicesc[i], 3); <--doesn't seem to work i'm noob in pointers :D
						buffer->Indices.push_back(indicesc[i]);
						buffer->Indices.push_back(indicesc[i+1]);
						buffer->Indices.push_back(indicesc[i+2]);
						k++;
					}
				}
				t2 = device->getTimer()->getRealTime();
				dt=t2-t1;

// PER FACE CULLING - END

			smgr->drawAll();
			guienv->drawAll();
		driver->endScene();

	}
	device->drop();
	return 0;
}
Here is the link with the precompiled example and the source :

http://irrlichtirc.g0dsoft.com/Ogami_It ... ulling.rar
rogerborg
Admin
Posts: 3590
Joined: Mon Oct 09, 2006 9:36 am
Location: Scotland - gonnae no slag aff mah Engleesh
Contact:

Post by rogerborg »

Have you profiled it on a SVN trunk build with VBOs? I'm wondering if culling the triangles in the app is quicker than letting the driver do it when VBOs are being used.
Please upload candidate patches to the tracker.
Need help now? IRC to #irrlicht on irc.freenode.net
How To Ask Questions The Smart Way
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Post by Mirror »

rogerborg wrote:Have you profiled it on a SVN trunk build with VBOs? I'm wondering if culling the triangles in the app is quicker than letting the driver do it when VBOs are being used.
i haven't tried VBOs yet because i have some compiling errors with the svn but yes letting the driver do it would be faster. does such a method which does exactly this exist already ?
rogerborg
Admin
Posts: 3590
Joined: Mon Oct 09, 2006 9:36 am
Location: Scotland - gonnae no slag aff mah Engleesh
Contact:

Post by rogerborg »

Drvr dos it rlry questionr uso whtherit fastetoro do wait ait faster ttod oti in te appp and send tirasnglesto t he driver or fasterirteo send indeexies to driver anad let it sdo the .. word.. what.. CULLING itself.

OOops. SNAKE!!!! very drunk attime.
Please upload candidate patches to the tracker.
Need help now? IRC to #irrlicht on irc.freenode.net
How To Ask Questions The Smart Way
Halifax
Posts: 1424
Joined: Sun Apr 29, 2007 10:40 pm
Location: $9D95

Post by Halifax »

Honestly I didn't know that OpenGL/Direct3D did per face culling this early in the process. I thought they did winding order culling, or rendering triangles, and then choosing whether they should enter the pipeline.

But maybe I am wrong.

EDIT:

Actually, yes, I am right. OpenGL doesn't perform dot product face culling, but instead does screen space winding order culling which is further down the pipeline than it is when you just eliminate the polygon all-in-one.

So I don't know, this appears as though it would be useful.

EDIT2:

This is in fact useful, and has already been implemented as an OpenGL extension, and it does in fact provide for faster rendering, "In many circumstances, using this extension results in faster rendering, because it culls faces at an earlier stage of the rendering pipeline."

If you would like to use the extension it is called, GL_EXT_cull_vertex.

So this would be a beneficial fallback if the extension were to not be available.
TheQuestion = 2B || !2B
hybrid
Admin
Posts: 14143
Joined: Wed Apr 19, 2006 9:20 pm
Location: Oldenburg(Oldb), Germany
Contact:

Post by hybrid »

Well, what is before the winding order culling? The major benefit of using the cull vertex extension is that it uses the user normal to cull vertices. So it neither needs to calculate the face normal or winding, and it can cull a shared vertex with just one operation thereby possibly removing several faces. So indeed this extension is really useful (and my eeePC does support it, so I want it :) ), but I'd really like to see whether the presented code does give any advantages over the standard techniques. Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Post by Mirror »

hybrid wrote:Well, what is before the winding order culling? The major benefit of using the cull vertex extension is that it uses the user normal to cull vertices. So it neither needs to calculate the face normal or winding, and it can cull a shared vertex with just one operation thereby possibly removing several faces. So indeed this extension is really useful (and my eeePC does support it, so I want it :) ), but I'd really like to see whether the presented code does give any advantages over the standard techniques. Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.
yes i will provide some info right now.

sphere polys : 39.600

a) Hardware Back Face Culling ENABLED : all 39.600 polys are rendered ALL the time, 30 FPS on my computer ( nvidia 6600 )

b) Per Face Culling ENABLED, Hardware Back Face Culling DISABLED : in worst case scenario, about 20.000 polygons are rendered, on average anywhere from 5k to 15k. 70-140 FPS on my computer ( nvidia 6600 ).

I updated release.rar with the screenshots im posting here and with 2 precompiled .exe in order to test performance : PerFaceCulling.exe and PerFaceCulling_HardwareBackFaceCulling.exe. they both load sphere.obj with 39.600 polys. In the screenshots you can see 2 different angles, one with 98 FPS and one with 140 FPS.

The more polys the objects have the more is the benefit, which means that you will not notice any big differences if you have a super GPU and very few polys. for example, with a sphere of just 9800 polys, one guy from the irc who has an OC nvidia 8600 noticed a difference from 694 to 704 FPS.

About the time that is required by the CPU for the calculations. It depends on the CPU power and on the number of polys. On my pc ( P4 @ 2.8ghz it requires only 2milliseconds ( which means it will not be a bottleneck for framerates up to 500 FPS ) with about 40k polys. the debug data ( polys visible/invisible/required time is all present in the debug text of the app.


Image

Image

Image

i think the best option would be making this as a second filter in octreescene node and uploading only the indices to the gpu while having the vertices static ( vbo ).
BlindSide
Admin
Posts: 2821
Joined: Thu Dec 08, 2005 9:09 am
Location: NZ!

Post by BlindSide »

That's quite an impressive improvement in framerate, but...

Did you only perform this test in wire frame mode? Because that would seriously effect the test environment. What kind of results do you get with normal shading with textures etc?
ShadowMapping for Irrlicht!: Get it here
Need help? Come on the IRC!: #irrlicht on irc://irc.freenode.net
Nadro
Posts: 1648
Joined: Sun Feb 19, 2006 9:08 am
Location: Warsaw, Poland

Post by Nadro »

If in test with normal shading and textures effect will be good, This will be very useful method:)
Library helping with network requests, tasks management, logger etc in desktop and mobile apps: https://github.com/GrupaPracuj/hermes
shadowslair
Posts: 758
Joined: Mon Mar 31, 2008 3:32 pm
Location: Bulgaria

Post by shadowslair »

This is interesting...

I`ve got the average of 209 fps for the hardware and about 135 fps in the other executable. Test made on AMD k7 1,14 512DDR 128 Ati Radeon.

I`m not that fluent in the culling methods, just wondering if it could give better performance in a real game, where the level is for example 30k, and have 15 character models with 2000 poly each, (equals to 60k polys) without any additional otimisation? Keeping in mind some additional calcs of course...:wink:

Is it for OpenGL only, `cause I`m getting better results with DX9?

Anyway this is cool... :D
"Although we walk on the ground and step in the mud... our dreams and endeavors reach the immense skies..."
agi_shi
Posts: 122
Joined: Mon Feb 26, 2007 12:46 am

Post by agi_shi »

497 FPS with per face culling
594 FPS with native hardware vertex winding culling

There is no possible way that locking a mesh for thousands of dot products per frame and then re-uploading it to the GPU card is faster than native GPU winding order culling with over drawing. Why? Because these days vertex transformations come practically for free. You can render millions of polygons in one draw call, still in the hundreds of FPS.
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Post by Mirror »

BlindSide wrote:That's quite an impressive improvement in framerate, but...

Did you only perform this test in wire frame mode? Because that would seriously effect the test environment. What kind of results do you get with normal shading with textures etc?
yes, good question. I tested with a texture and one light. here are the results :

a) BACKFACE CULLING ( all 39.600 polys rendered ) : 30 FPS ( 1st screenshot )

b) per face culling ( 5-15k polys rendered ) : 90-120 FPS ( 2nd - 3d screenshot )



Image

Image

Image

i can't understand the weird results from shadowslair and agi_shi, maybe it's because their graphics cards are so fast that actually it doesn't matter to them the number of 40k polys.

@agi_shi : when you are saying "winding order culling" you are referring to the backface culling ? node->setMaterialFlag(video::EMF_BACK_FACE_CULLING, true); specifically this flag ? which as far as i can understand is described in this link : http://msdn.microsoft.com/en-us/library ... S.85).aspx
p.s. what's your graphics card ?
Mirror
Posts: 218
Joined: Sat Dec 01, 2007 4:09 pm

Post by Mirror »

Halifax wrote:Honestly I didn't know that OpenGL/Direct3D did per face culling this early in the process. I thought they did winding order culling, or rendering triangles, and then choosing whether they should enter the pipeline.

But maybe I am wrong.

EDIT:

Actually, yes, I am right. OpenGL doesn't perform dot product face culling, but instead does screen space winding order culling which is further down the pipeline than it is when you just eliminate the polygon all-in-one.

So I don't know, this appears as though it would be useful.

EDIT2:

This is in fact useful, and has already been implemented as an OpenGL extension, and it does in fact provide for faster rendering, "In many circumstances, using this extension results in faster rendering, because it culls faces at an earlier stage of the rendering pipeline."

If you would like to use the extension it is called, GL_EXT_cull_vertex.

So this would be a beneficial fallback if the extension were to not be available.
hybrid wrote:Well, what is before the winding order culling? The major benefit of using the cull vertex extension is that it uses the user normal to cull vertices. So it neither needs to calculate the face normal or winding, and it can cull a shared vertex with just one operation thereby possibly removing several faces. So indeed this extension is really useful (and my eeePC does support it, so I want it :) ), but I'd really like to see whether the presented code does give any advantages over the standard techniques. Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.
i read the paper link provided by Halifax and i also found another link for this extension.

http://personal.redestb.es/jmovill/open ... in-19.html

From what i read there, this OpenGL extension culls a polygon ONLY IF both 3 vertices of the polygon are culled which means that it does 3 dot products calculations for each polygon ( triangle ) instead of 1 which i do for each poly, so THEORETICALLY the extension should be slower :P ( unless it's done in hardware )
BlindSide
Admin
Posts: 2821
Joined: Thu Dec 08, 2005 9:09 am
Location: NZ!

Post by BlindSide »

That's true, just because an extension is supported theres no guarantee that it's performed on hardware, alot of extensions are just the equivalent of utility functions.

Cheers
ShadowMapping for Irrlicht!: Get it here
Need help? Come on the IRC!: #irrlicht on irc://irc.freenode.net
PI
Posts: 176
Joined: Tue Oct 09, 2007 7:15 pm
Location: Hungary

Re:

Post by PI »

Hello guys,

Very interesting conversation. Reminds me what I did back in those days with Revolution3D. Then I was storeing the face normals for each face, and by adding a face normal vector to the camera view normal vector I could define which faces to draw and which one to cull. It was based on the length of the product. It wasn't bad, but I've found that hardware backface culling does the trick faster than computing on the CPU.

Anyway, I have an ATI Radeon HD 2600 Pro, and an Intel Dual Core processor, here are my test results:

Per face culling:
At the startup screen, 365 FPS, 7400 tris rendered.
Inside the sphere, 607 FPS, 0 tris rendered.
No matter how far from the sphere, but facing it, not going under 320 FPS.
Looking away from it - as I suspect - you're still computing, so it's still 320 FPS.

Hardware backface culling:
At the startup screen, 377 FPS, 39600 tris rendered.
Inside the sphere, 378 FPS, 39600 tris rendered.
No matter how far from the sphere, but facing it, not going under 320 FPS.
However, looking away from it, the FPS goes up to 1700.

Cheers,
PI

P.S. Guys, do you know if there's any chance that Hardware Occlusion Culling will be added to Irrlicht? And, by the way, what do you think about it?
Post Reply