Regarding vcalls
Regarding vcalls
Hi,
I've come to the conclusion that some of the design-decisions taken regarding some implementations in Irrlicht leads to high-frequent calls to virtual functions. Now on most modern architectures such as i7 or similar this isn't much of a problem. They have long enough cpu-pipelines and big enough caches to hide this type of issues. On the other hand, architectures such as the PPC that is on the xbox360 architecture or the ps3 architecture these become a lot more expensive.
My idea and/or intent of a solution is that there is an intermediate structure similar to a command-buffer that are built up by the scene-manager's drawAll()-method. The command-buffer is an intermediate format of platform-independent commands that are sent to the device once it's built. That way the platform-specific implementation of the device can eliminate a lot of not only vcalls but function-calls in total. The idea is that adding commands are inlined and thus does not do any real function-call.
This really should pick up the speed for irrlicht on high-drawcall scenes.
Would Irrlicht welcome to such an implementation?
Cheers,
Simp.
I've come to the conclusion that some of the design-decisions taken regarding some implementations in Irrlicht leads to high-frequent calls to virtual functions. Now on most modern architectures such as i7 or similar this isn't much of a problem. They have long enough cpu-pipelines and big enough caches to hide this type of issues. On the other hand, architectures such as the PPC that is on the xbox360 architecture or the ps3 architecture these become a lot more expensive.
My idea and/or intent of a solution is that there is an intermediate structure similar to a command-buffer that are built up by the scene-manager's drawAll()-method. The command-buffer is an intermediate format of platform-independent commands that are sent to the device once it's built. That way the platform-specific implementation of the device can eliminate a lot of not only vcalls but function-calls in total. The idea is that adding commands are inlined and thus does not do any real function-call.
This really should pick up the speed for irrlicht on high-drawcall scenes.
Would Irrlicht welcome to such an implementation?
Cheers,
Simp.
I think we should rather concentrate on writing good benchmarks first before starting to rework the architecture. I've so far only found time to write a bunch of benchmarks for some of the low-level classes (arrays, strings), but I still plan to improve my profiler far enough that I feel well to add it to Irrlicht. Which would be the first step. Then the next would be to think up some benchmark framework. At least that's what I would like to see.
I don't think too much about reworking the architecture before we have hard data - guessing about speed always goes wrong.
I don't think too much about reworking the architecture before we have hard data - guessing about speed always goes wrong.
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Hi CuteAlien,
With all respect, with the current solution you'll see vcall-issues on shorter pipelines architectures when you work at roughly 3-500 drawcalls. Pushing it anything lesser than that will most likely not show up as anything noticable on profilers.
The differance is essentially not only the 2 fetches that it costs to do a vcall-lookup, but also instruction misses causing pipeline flushes. Not to mention the entire differance of doing the fcall. There is a lot of research that has been done on this issue and while I'm not recommending doing micro-optimizations like this everywhere (which is just unneccessary), I've pinpointed a very specific case which will be the interface being used a lot in inner loops of the entire renderer.
Examples of comparisons done have shown results (don't mind the actual times, just notice the % of comparisons):
virtual functions: 5008477914 ticks (2282.140339ms),
inline functions: 701710284 ticks (319.738127 ms),
normal (direct) functions: 4791365040 ticks (2183.227339 ms),
per-object function pointers: 5290894092 ticks (2410.842116 ms)
That's on a 2.2 ghz laptop. We're talking about 14% of the cost on this type of machine.
virtual: 159.856 ms
direct: 67.962 ms
inline: 8.040 ms
Are the performance measurements on the xbox 360.
These are from either some assembly required or mischief mayhem soap. I suggest reading up on them.
I'm not asking you guys to do this, it's something that I'm considering doing and wondering if you guys would accept such a patch. It would essentially mean that these type of virtuals would be removed and instead the driver would be called once per frame to walk through the command-buffer and solve them all at once instead of splitting it up into this vcall-interface that we see now. The same type of "extensibility" will exist in irrlicht.
What kind of profiler are you developing for Irrlicht?
Cheers,
S
With all respect, with the current solution you'll see vcall-issues on shorter pipelines architectures when you work at roughly 3-500 drawcalls. Pushing it anything lesser than that will most likely not show up as anything noticable on profilers.
The differance is essentially not only the 2 fetches that it costs to do a vcall-lookup, but also instruction misses causing pipeline flushes. Not to mention the entire differance of doing the fcall. There is a lot of research that has been done on this issue and while I'm not recommending doing micro-optimizations like this everywhere (which is just unneccessary), I've pinpointed a very specific case which will be the interface being used a lot in inner loops of the entire renderer.
Examples of comparisons done have shown results (don't mind the actual times, just notice the % of comparisons):
virtual functions: 5008477914 ticks (2282.140339ms),
inline functions: 701710284 ticks (319.738127 ms),
normal (direct) functions: 4791365040 ticks (2183.227339 ms),
per-object function pointers: 5290894092 ticks (2410.842116 ms)
That's on a 2.2 ghz laptop. We're talking about 14% of the cost on this type of machine.
virtual: 159.856 ms
direct: 67.962 ms
inline: 8.040 ms
Are the performance measurements on the xbox 360.
These are from either some assembly required or mischief mayhem soap. I suggest reading up on them.
I'm not asking you guys to do this, it's something that I'm considering doing and wondering if you guys would accept such a patch. It would essentially mean that these type of virtuals would be removed and instead the driver would be called once per frame to walk through the command-buffer and solve them all at once instead of splitting it up into this vcall-interface that we see now. The same type of "extensibility" will exist in irrlicht.
What kind of profiler are you developing for Irrlicht?
Cheers,
S
The point is not that inline functions are faster than virtual function calls, that's obvious. But that working on major architecture changes for optimizations without a framework to actually test the improvements results easily in doing the wrong optimizations or adding accidentally even worse cases.
The profiler I want to add is for manual begin/end blocks and mostly useful to find peak-values as the usual sample-based profiler tools are bad at that. Rather the typical kind of profiler you find in many engines and I'm using it myself already for a long time. But I want to make it a little more comfy to use before adding it to the engine. Last patch for it is: http://www.michaelzeilfelder.de/irrlich ... iler.patch
(and every speed-optimization can show up in profilers when you just call it often enough...)
edit: But please - do make a short example of what kind of commands you are thinking off. We're certainly interested in architecture ideas.
The profiler I want to add is for manual begin/end blocks and mostly useful to find peak-values as the usual sample-based profiler tools are bad at that. Rather the typical kind of profiler you find in many engines and I'm using it myself already for a long time. But I want to make it a little more comfy to use before adding it to the engine. Last patch for it is: http://www.michaelzeilfelder.de/irrlich ... iler.patch
(and every speed-optimization can show up in profilers when you just call it often enough...)
edit: But please - do make a short example of what kind of commands you are thinking off. We're certainly interested in architecture ideas.
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Sure all speed-optimizations show up on a profiler, but what I mean is that the vcall-cost doesn't show up when running a profiler to "find" what the "issue" is.
Anyways, what I thought of was something in the ideas of
You could probably turn that into a linked-list instead of just a single pointer if you want to be able to have several geometries for one drawcall (which you should). The SDrawCall should probably hold a lot of other information as well, such as samplermodes/blendmodes/depthtests/fragment/vertexprogram etc - but I'm just trying to make a point here.
Then you have:
With specific implementations:
And whichever implementation that you choose for specific commands.
You can then provide a simple structure that you can use to push the commands onto a stack or whichever structure that you want to use. Once the structure is built, you can pass it through a device and just let the device go through the none-device-specific command-list and handle each command the way it wants to. That way it would also be easier to "multithread" Irrllicht if one would want to. You can always build one list while the other one is being rendered, you just need two (or more) buffers to build the commands into.
Anyways, what I thought of was something in the ideas of
Code: Select all
struct SDrawCall
{
const SGeometry* m_GeometryPtr;
int m_StartIndex;
int m_IndicesNum;
};
Then you have:
Code: Select all
struct SRenderCommand : public SDrawCall
{
enum CommandType
{
CLEAR,
SET_RENDERTARGET,
etc.
};
CommandType m_CommandType;
};
Code: Select all
struct SClearRenderCommand : public SRenderCommand
{
enum ClearType
{
COLOR,
DEPTH,
STENCIL
};
// Command specific implementations here
};
You can then provide a simple structure that you can use to push the commands onto a stack or whichever structure that you want to use. Once the structure is built, you can pass it through a device and just let the device go through the none-device-specific command-list and handle each command the way it wants to. That way it would also be easier to "multithread" Irrllicht if one would want to. You can always build one list while the other one is being rendered, you just need two (or more) buffers to build the commands into.
Thanks for the ideas. Well, I guess this is atm more Hybrid's territory than mine (I try to get more familiar with device-coding myself, but not there yet for such decisions). I'll work for now on my profiler (when I get to it). But maybe others have comments on this...
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
That is the way most AAA engines handle their rendering these days, queued render commands resulting in minimal drawing overhead in a dedicated thread
I'm not sure how much irrlicht would need to change in order to cope with this new render framework, but its definitely a can of worms all by itself.
Take into account the changes needed for flexible-vertex-formats, and you're talking a complete rewrite for irrlicht, or a new engine altogether...
I'm not sure how much irrlicht would need to change in order to cope with this new render framework, but its definitely a can of worms all by itself.
Take into account the changes needed for flexible-vertex-formats, and you're talking a complete rewrite for irrlicht, or a new engine altogether...
-
- Admin
- Posts: 14143
- Joined: Wed Apr 19, 2006 9:20 pm
- Location: Oldenburg(Oldb), Germany
- Contact:
I still don't see too much relation between the virtual methods calls and the rend4er call batching. But IMHO the former is not our main problem with the engine right now, and the latter would require too much middleware if this gets exposed to the user (talking about custom scene nodes etc). But things like render call batching ofr the GUI is already in discussion, which might go into this direction - or a more pragmatic one just on the low-level structures.
Well, batching rendercalls means that you can essentially exchange a lot of vcall's for a lot fewer. That's the relation.hybrid wrote:I still don't see too much relation between the virtual methods calls and the rend4er call batching.
Hmm I'm not trying to claim that this is the main problem, I'm just pointing out the issue and asking if such an api-change is something that irrlicht would accept, should I decide to implement it.hybrid wrote: But IMHO the former is not our main problem with the engine right now,
Huh? I'm not sure I follow, what does this have to do with middleware? The thought isn't to expose this outside the irrlicht api, instead it's an internal thing (unless people have extended IVideoDriver, which is the interface that would change).hybrid wrote:and the latter would require too much middleware if this gets exposed to the user (talking about custom scene nodes etc).
Cheers,
/S