Call overhead stats

fmx · Post by **fmx** » Fri Sep 07, 2012 8:04 pm

The uniform cache and these misc material state optimisations should REALLY help to boost performance on mobile platforms.
Irrlicht has far too much happening under the hood at the moment, which was why I opted for my own rendering engine for commercial ios projects.
I'd love to use irrlicht more though, cant wait for the next couple of irrlicht versions to come out

Keep up the great work guys

hendu · Post by **hendu** » Sat Sep 08, 2012 7:48 am

You also get big fps advances by optimizing your shaders with Mesa's GLSL optimizer, as most mobile drivers suck. Wouldn't surprise me in the least if state setting was done badly there, too.

hendu · Post by **hendu** » Sat Sep 08, 2012 10:13 am

Doing that change didn't seem to break anything, as long as the correct lastMaterial was also passed (the current call passes current material for both current and old material, obviously breaking the comparison). Perhaps even 1.8 material?

Patch: https://sourceforge.net/tracker/?func=d ... tid=540678

Will remeasure the call overhead.

hendu · Post by **hendu** » Sat Sep 08, 2012 10:23 am

Total number of calls in one frame: 17557 (~17.5k!)
Average call took: 0.260 usec
Total CPU time spent in these calls: 4559 usec (~4.5ms)

----

All calls, without draw calls took: 2598.2 usec (~2.6ms)

----

Call count: http://pastebin.com/RjirLzDR

----

So in total, removing that unnecessary state reset saved 2.5k gl calls, totalling 0.5ms savings.

hybrid · Post by **hybrid** » Sat Sep 08, 2012 10:37 am

Yeah, sound slike a bug then, or at least a very suboptimal implementation. I'll keep this thread here, though, for general discussions of these optimizations. The patch will be considered from the tracker anyway, so we don't need another thread in bug forum.

hendu · Post by **hendu** » Sat Sep 08, 2012 11:54 am

Added a nice uniform cache. It removed 215 calls to GetUniformLocation, that is, all of them for this late frame. The total calls dropped to 17342 for the frame.

It has no collisions with my app, where the largest shader has 14 uniforms. Will see how easily it patches to svn trunk.

hendu · Post by **hendu** » Sat Sep 08, 2012 12:08 pm

https://sourceforge.net/tracker/?func=d ... tid=540678

Patch.

hendu · Post by **hendu** » Sat Sep 08, 2012 12:37 pm

Backporting the SetWrapMode skip check removed another 1466 calls. This one was already in trunk.

hendu · Post by **hendu** » Sat Sep 08, 2012 1:05 pm

Backported another active texture check in setbasicrendermodes, a couple k calls gone.

Then for active matters, I had 249 calls to glDepthMask. So I added tracking for that, now it's 2 - a drop of 247 calls.

Current gl call count is at 11183, 56% of where we started from two days ago.

---

CPU overhead currently stands at 3.9ms total, 2.0ms without draw calls. About one ms shaved off of each.

hendu · Post by **hendu** » Sat Sep 08, 2012 1:13 pm

Depth mask patch:
https://sourceforge.net/tracker/?func=d ... tid=540678

Nadro · Post by **Nadro** » Sat Sep 08, 2012 1:55 pm

For a cache uniform the better solution is add integer value to SUniformInfo called eg. CachedLocation. Initial value of CachedLocation will be equal -1. In setPixelShaderConstant we will just use it like this:

Code: Select all

GLint Location = UniformInfo[i].CachedLocation;
 
if (Location < 0)
{
    if (Program2)
        Location = Driver->extGlGetUniformLocation(Program2,name);
    else
        Location = Driver->extGlGetUniformLocationARB(Program,name);
 
    UniformInfo[i].CachedLocation = Location;
}

In this implementation an interface stay more clean and this is faster solution, because we use direct access to a value than iterate over const char*.

Thanks for all patches and interest in the subject.

hendu · Post by **hendu** » Sat Sep 08, 2012 2:42 pm

You're right there, let's do it that way.

edit: Patch updated.

Though, now that I took a look at the uniforminfo, that is a linear search. For every uniform update, there is up to N string compares, where N is how many uniforms you have.

Should probably be replaced with a map (RB tree). Or sorted after we know that all uniforms have been added there.

hendu · Post by **hendu** » Sat Sep 08, 2012 3:14 pm

Updated patch with sort + binary search. Yay SVN and no way to do incremental patches.

Nadro · Post by **Nadro** » Sat Sep 08, 2012 8:38 pm

Thanks. I have also an idea to extend cache system - eliminate searching for const values (many values like texture ID, light colors etc. are const for one shader and doesn't change in next frames):

Code: Select all

const s32 i = UniformInfo.binary_search(target);

I think that we can extend IMaterialRendererServices by eg:

Code: Select all

/* This function will be removed uniform from an UniformInfo list and shift it to CachedUniformInfo. Each CachedUniformInfo element will be also store float/int/bool value assigned to current uniform. On COpenGLSLMaterialRenderer::OnRender method all CachedUniformInfo elements should be uploaded to GPU.*/
virtual bool addCachedPixelShaderConstant(const c8* name, const s32* ints, int count) = 0;

To efficient usage of addCachedPixelShaderConstant we should extend also IShaderConstantSetCallBack eg by:

Code: Select all

/* This function will be called only once after material is created, here we should put addCachedPixelShaderConstant calls. */
virtual void OnSetCachedConstants(video::IMaterialRendererServices* services, s32 userData) = 0;

When UniformInfo array will be smaller, set values for a dynamic uniforms will be faster (less element to searching by binary_search). I think that this is good optimization for a shaders with many uniforms and doesn't complicate existing interface too much. I will prepare patch soon. What do You think about this optimization? Personally I think that Handu's patches which doesn't change existing interface we can apply to v1.8, but I'm relatively new developer in a team, so Hybrid and CuteAlien should decide about it

hendu · Post by **hendu** » Sat Sep 08, 2012 9:19 pm

I don't think it's irr's place to decide that. It should be up to the app to decide when to upload what.

For example, in my app I send some uniforms only once (texture sampler slots), some per frame (matrices), some per phase, etc etc.
Setting up one system would be far too limiting, and trying to enumerate all possibilities in an API would be impossible.

Irrlicht Engine

Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats

Re: Call overhead stats