Page 1 of 3

Per Face Culling

Posted: Fri Jun 13, 2008 7:38 pm
by Mirror
Hi. This is my first attempt ( or rather second ) of posting some code of mine, so don't be too harsh on me, still n00b inside!

Ok, in this app i do some polygon ( triangle ) culling based on whether the triangles are facing the camera or no. the not visible triangles are culled. i have included in the file a sphere with 39600 polys. On standard rendering, all of them will be rendered, but with per face culling - that's how i call it now, i used to call it "backface culling" but for avoiding confusions i renamed it - you will get rendered anywhere from 0 up to 20k triangles. but most of the time about 10-15k with the current example. Maybe you will not notice a big difference in the FPS with the current mesh which has only 39600 triangles, but with larger meshes the difference is quite big.

If you think that this is something interesting, we could integrate this in OctTreeSceneNode, as a second "filter" something which would remove even more polys from the rendering pipeline and thus boosting the performance

Code: Select all

#include "stdafx.h"
#include "time.h"
#include <irrlicht.h>

using namespace std;
 
using namespace irr;
 
using namespace core;
using namespace scene;
using namespace video;
using namespace io;
using namespace gui;
 
#pragma comment(lib, "Irrlicht.lib")
//#pragma comment(linker, "/SUBSYSTEM:windows /ENTRY:mainCRTStartup") 
 
int main(int argc, char *argv[])
{
	IrrlichtDevice *device = createDevice(video::EDT_DIRECT3D9, dimension2d<s32>(1024, 768), 32, true, true, false);
	if ( device == 0 ) return 1;
 
	IVideoDriver* driver = device->getVideoDriver();
	ISceneManager* smgr = device->getSceneManager();
	IGUIEnvironment* guienv = device->getGUIEnvironment();
 
	driver->setTextureCreationFlag(video::ETCF_ALWAYS_32_BIT, true);
 
	IGUIStaticText* fpstext = guienv->addStaticText(L"", rect<s32>(0,0,700,15), true, false, 0, -1, true);
 
	IGUIFont* font = guienv->getFont("fonthaettenschweiler.bmp");
	IGUISkin* skin = guienv->getSkin();
	if (font) skin->setFont(font);

 	scene::ICameraSceneNode* MMOCam = smgr->addCameraSceneNodeFPS(0,100,20,-1,0,0,0,100);
 
	MMOCam->setFarValue(1000000.0f);
	MMOCam->setNearValue(0.1f);
	MMOCam->setPosition(core::vector3df(0,120,0));
	MMOCam->setTarget(core::vector3df(0,100,100));


	scene::IAnimatedMesh* terrmesh1 = smgr->getMesh("sphere.obj");

	CMeshBuffer<S3DVertex>* buffer=(CMeshBuffer<S3DVertex>*)terrmesh1->getMeshBuffer(0); 
	printf("Vertex Count: %u\n",buffer->getVertexCount());
	printf("Index Count: %u\n\n",buffer->getIndexCount());
	printf("Total Triangles Count: %u\n\n",buffer->getIndexCount()/3);

	u16* indices = buffer->getIndices();
	void* vertices = buffer->getVertices();
	S3DVertex* vertex = (S3DVertex *) vertices;
	s32 indexc = buffer->getIndexCount();
	u16* indicesc = (u16* )malloc(sizeof(short int)*buffer->getIndexCount());
	vector3df* normals = (vector3df *)malloc(sizeof(vector3df)*buffer->getIndexCount());
	vector3df* vertpnt = (vector3df *)malloc(sizeof(vector3df)*buffer->getIndexCount());
	triangle3df poly;

	//precalculated faces normals and precalculated 1 point of intersection: normals/vertpnt
	memcpy(indicesc, indices, sizeof(short int)*buffer->getIndexCount());
	for(s32 i=0;i<indexc;i+=3) {
		poly.pointA = vertex[indicesc[i]].Pos;
		poly.pointB = vertex[indicesc[i+1]].Pos;
		poly.pointC = vertex[indicesc[i+2]].Pos;
		normals[i/3] = poly.getNormal().normalize();
		vertpnt[i/3] = vertex[indicesc[i]].Pos;
	}

	video::SMaterial material;
	material.Lighting=false;
	material.Wireframe=true;
	material.BackfaceCulling=false;

	int m=0,k=0;

	u32 t1,t2,dt=0;

/*	//uncomment this and comment out PerFaceCulling to test performance 
	scene::ISceneNode* terrnode1 = smgr->addAnimatedMeshSceneNode(terrmesh1);
	terrnode1->setMaterialFlag(video::EMF_NORMALIZE_NORMALS, true);
	terrnode1->setMaterialFlag(video::EMF_WIREFRAME, true);
	terrnode1->setMaterialFlag(video::EMF_BACK_FACE_CULLING, false);
	terrnode1->setPosition(core::vector3df(0,0,0));
	terrnode1->setMaterialFlag(video::EMF_LIGHTING, false);
	terrnode1->setVisible(false);
*/
	device->getCursorControl()->setVisible(false);
 
	while(device->run())
	{
		stringw str = L"FPS: ";str += driver->getFPS();
		str += " TRI:";str += driver->getPrimitiveCountDrawn();
		//comment out the following lines for better performance--start commenting
		str += " Total Triangles: ";str += m;
		str += " Visible Triangles: ";str += k;
		str += " Not Visible Triangles: ";str += m-k;
		str += " Index Count: ";str += k*3;
		str += " dt:";str += dt;
		//--end commenting
		fpstext->setText(str.c_str());
		m = k = 0;

		driver->beginScene(true, true, 0);

// PER FACE CULLING - START

				driver->setMaterial(material);
				driver->setTransform(video::ETS_WORLD, core::matrix4());
				driver->drawMeshBuffer(buffer);

				t1 = device->getTimer()->getRealTime();
				buffer->Indices.erase(0, buffer->getIndexCount());
				for(s32 i=0;i<indexc;i+=3) {
					m++;
					if (normals[i/3].dotProduct(vertpnt[i/3]-MMOCam->getPosition())<0){
						//memcpy((void *)&indices[i],(void *)&indicesc[i], 3); <--doesn't seem to work i'm noob in pointers :D
						buffer->Indices.push_back(indicesc[i]);
						buffer->Indices.push_back(indicesc[i+1]);
						buffer->Indices.push_back(indicesc[i+2]);
						k++;
					}
				}
				t2 = device->getTimer()->getRealTime();
				dt=t2-t1;

// PER FACE CULLING - END

			smgr->drawAll();
			guienv->drawAll();
		driver->endScene();

	}
	device->drop();
	return 0;
}
Here is the link with the precompiled example and the source :

http://irrlichtirc.g0dsoft.com/Ogami_It ... ulling.rar

Posted: Fri Jun 13, 2008 9:47 pm
by rogerborg
Have you profiled it on a SVN trunk build with VBOs? I'm wondering if culling the triangles in the app is quicker than letting the driver do it when VBOs are being used.

Posted: Fri Jun 13, 2008 10:15 pm
by Mirror
rogerborg wrote:Have you profiled it on a SVN trunk build with VBOs? I'm wondering if culling the triangles in the app is quicker than letting the driver do it when VBOs are being used.
i haven't tried VBOs yet because i have some compiling errors with the svn but yes letting the driver do it would be faster. does such a method which does exactly this exist already ?

Posted: Fri Jun 13, 2008 10:42 pm
by rogerborg
Drvr dos it rlry questionr uso whtherit fastetoro do wait ait faster ttod oti in te appp and send tirasnglesto t he driver or fasterirteo send indeexies to driver anad let it sdo the .. word.. what.. CULLING itself.

OOops. SNAKE!!!! very drunk attime.

Posted: Fri Jun 13, 2008 10:56 pm
by Halifax
Honestly I didn't know that OpenGL/Direct3D did per face culling this early in the process. I thought they did winding order culling, or rendering triangles, and then choosing whether they should enter the pipeline.

But maybe I am wrong.

EDIT:

Actually, yes, I am right. OpenGL doesn't perform dot product face culling, but instead does screen space winding order culling which is further down the pipeline than it is when you just eliminate the polygon all-in-one.

So I don't know, this appears as though it would be useful.

EDIT2:

This is in fact useful, and has already been implemented as an OpenGL extension, and it does in fact provide for faster rendering, "In many circumstances, using this extension results in faster rendering, because it culls faces at an earlier stage of the rendering pipeline."

If you would like to use the extension it is called, GL_EXT_cull_vertex.

So this would be a beneficial fallback if the extension were to not be available.

Posted: Sat Jun 14, 2008 9:47 am
by hybrid
Well, what is before the winding order culling? The major benefit of using the cull vertex extension is that it uses the user normal to cull vertices. So it neither needs to calculate the face normal or winding, and it can cull a shared vertex with just one operation thereby possibly removing several faces. So indeed this extension is really useful (and my eeePC does support it, so I want it :) ), but I'd really like to see whether the presented code does give any advantages over the standard techniques. Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.

Posted: Sat Jun 14, 2008 12:45 pm
by Mirror
hybrid wrote:Well, what is before the winding order culling? The major benefit of using the cull vertex extension is that it uses the user normal to cull vertices. So it neither needs to calculate the face normal or winding, and it can cull a shared vertex with just one operation thereby possibly removing several faces. So indeed this extension is really useful (and my eeePC does support it, so I want it :) ), but I'd really like to see whether the presented code does give any advantages over the standard techniques. Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.
yes i will provide some info right now.

sphere polys : 39.600

a) Hardware Back Face Culling ENABLED : all 39.600 polys are rendered ALL the time, 30 FPS on my computer ( nvidia 6600 )

b) Per Face Culling ENABLED, Hardware Back Face Culling DISABLED : in worst case scenario, about 20.000 polygons are rendered, on average anywhere from 5k to 15k. 70-140 FPS on my computer ( nvidia 6600 ).

I updated release.rar with the screenshots im posting here and with 2 precompiled .exe in order to test performance : PerFaceCulling.exe and PerFaceCulling_HardwareBackFaceCulling.exe. they both load sphere.obj with 39.600 polys. In the screenshots you can see 2 different angles, one with 98 FPS and one with 140 FPS.

The more polys the objects have the more is the benefit, which means that you will not notice any big differences if you have a super GPU and very few polys. for example, with a sphere of just 9800 polys, one guy from the irc who has an OC nvidia 8600 noticed a difference from 694 to 704 FPS.

About the time that is required by the CPU for the calculations. It depends on the CPU power and on the number of polys. On my pc ( P4 @ 2.8ghz it requires only 2milliseconds ( which means it will not be a bottleneck for framerates up to 500 FPS ) with about 40k polys. the debug data ( polys visible/invisible/required time is all present in the debug text of the app.


Image

Image

Image

i think the best option would be making this as a second filter in octreescene node and uploading only the indices to the gpu while having the vertices static ( vbo ).

Posted: Sat Jun 14, 2008 2:37 pm
by BlindSide
That's quite an impressive improvement in framerate, but...

Did you only perform this test in wire frame mode? Because that would seriously effect the test environment. What kind of results do you get with normal shading with textures etc?

Posted: Sat Jun 14, 2008 5:09 pm
by Nadro
If in test with normal shading and textures effect will be good, This will be very useful method:)

Posted: Sat Jun 14, 2008 5:45 pm
by shadowslair
This is interesting...

I`ve got the average of 209 fps for the hardware and about 135 fps in the other executable. Test made on AMD k7 1,14 512DDR 128 Ati Radeon.

I`m not that fluent in the culling methods, just wondering if it could give better performance in a real game, where the level is for example 30k, and have 15 character models with 2000 poly each, (equals to 60k polys) without any additional otimisation? Keeping in mind some additional calcs of course...:wink:

Is it for OpenGL only, `cause I`m getting better results with DX9?

Anyway this is cool... :D

Posted: Sat Jun 14, 2008 6:11 pm
by agi_shi
497 FPS with per face culling
594 FPS with native hardware vertex winding culling

There is no possible way that locking a mesh for thousands of dot products per frame and then re-uploading it to the GPU card is faster than native GPU winding order culling with over drawing. Why? Because these days vertex transformations come practically for free. You can render millions of polygons in one draw call, still in the hundreds of FPS.

Posted: Sat Jun 14, 2008 7:32 pm
by Mirror
BlindSide wrote:That's quite an impressive improvement in framerate, but...

Did you only perform this test in wire frame mode? Because that would seriously effect the test environment. What kind of results do you get with normal shading with textures etc?
yes, good question. I tested with a texture and one light. here are the results :

a) BACKFACE CULLING ( all 39.600 polys rendered ) : 30 FPS ( 1st screenshot )

b) per face culling ( 5-15k polys rendered ) : 90-120 FPS ( 2nd - 3d screenshot )



Image

Image

Image

i can't understand the weird results from shadowslair and agi_shi, maybe it's because their graphics cards are so fast that actually it doesn't matter to them the number of 40k polys.

@agi_shi : when you are saying "winding order culling" you are referring to the backface culling ? node->setMaterialFlag(video::EMF_BACK_FACE_CULLING, true); specifically this flag ? which as far as i can understand is described in this link : http://msdn.microsoft.com/en-us/library ... S.85).aspx
p.s. what's your graphics card ?

Posted: Sat Jun 14, 2008 7:55 pm
by Mirror
Halifax wrote:Honestly I didn't know that OpenGL/Direct3D did per face culling this early in the process. I thought they did winding order culling, or rendering triangles, and then choosing whether they should enter the pipeline.

But maybe I am wrong.

EDIT:

Actually, yes, I am right. OpenGL doesn't perform dot product face culling, but instead does screen space winding order culling which is further down the pipeline than it is when you just eliminate the polygon all-in-one.

So I don't know, this appears as though it would be useful.

EDIT2:

This is in fact useful, and has already been implemented as an OpenGL extension, and it does in fact provide for faster rendering, "In many circumstances, using this extension results in faster rendering, because it culls faces at an earlier stage of the rendering pipeline."

If you would like to use the extension it is called, GL_EXT_cull_vertex.

So this would be a beneficial fallback if the extension were to not be available.
hybrid wrote:Well, what is before the winding order culling? The major benefit of using the cull vertex extension is that it uses the user normal to cull vertices. So it neither needs to calculate the face normal or winding, and it can cull a shared vertex with just one operation thereby possibly removing several faces. So indeed this extension is really useful (and my eeePC does support it, so I want it :) ), but I'd really like to see whether the presented code does give any advantages over the standard techniques. Remember that it only seems to be useful if the app is bandwidth limited from CPU to GPU, or if it is really poly limited. And whether the CPU has enough power left to really give some benefits here, or will just stall the GPU instead, is still open.
i read the paper link provided by Halifax and i also found another link for this extension.

http://personal.redestb.es/jmovill/open ... in-19.html

From what i read there, this OpenGL extension culls a polygon ONLY IF both 3 vertices of the polygon are culled which means that it does 3 dot products calculations for each polygon ( triangle ) instead of 1 which i do for each poly, so THEORETICALLY the extension should be slower :P ( unless it's done in hardware )

Posted: Sun Jun 15, 2008 4:20 am
by BlindSide
That's true, just because an extension is supported theres no guarantee that it's performed on hardware, alot of extensions are just the equivalent of utility functions.

Cheers

Re:

Posted: Sun Jun 15, 2008 11:07 am
by PI
Hello guys,

Very interesting conversation. Reminds me what I did back in those days with Revolution3D. Then I was storeing the face normals for each face, and by adding a face normal vector to the camera view normal vector I could define which faces to draw and which one to cull. It was based on the length of the product. It wasn't bad, but I've found that hardware backface culling does the trick faster than computing on the CPU.

Anyway, I have an ATI Radeon HD 2600 Pro, and an Intel Dual Core processor, here are my test results:

Per face culling:
At the startup screen, 365 FPS, 7400 tris rendered.
Inside the sphere, 607 FPS, 0 tris rendered.
No matter how far from the sphere, but facing it, not going under 320 FPS.
Looking away from it - as I suspect - you're still computing, so it's still 320 FPS.

Hardware backface culling:
At the startup screen, 377 FPS, 39600 tris rendered.
Inside the sphere, 378 FPS, 39600 tris rendered.
No matter how far from the sphere, but facing it, not going under 320 FPS.
However, looking away from it, the FPS goes up to 1700.

Cheers,
PI

P.S. Guys, do you know if there's any chance that Hardware Occlusion Culling will be added to Irrlicht? And, by the way, what do you think about it?