The woes of Irrlicht's Skinning - Hardware GPU Skinning
The woes of Irrlicht's Skinning - Hardware GPU Skinning
I'm currently wrapping up the only thing that is preventing us from releasing a new version of BAW, BAWIrrlicht and using a GL 3.3+ core context...
Hardware Skinning
Why am I complaining again?
Simply because for some strange reason, again, irrlicht does things completely differently to even the most basic tutorial found on the internet.
And its about to do it again... (lack of TBOs)
What is wrong? - a whole list of things
1) The Normals being calculated are completely wrong (like 2 levels of wrong)
2) The weighting of vertices is managed through a list of indices per bone and a whole ton of linked lists
3) position, rotation and scale hints on joints are completely useless
4) ever present recursion
5) constant recalculation of the bounding box for the mesh
What will be wrong in Irrlicht 1.9?
1) Constant Waterfalling in Hardware Skinning -- you need to implement TBOs before you skin
2) BBox update for skinned meshes
3) The Normals
How did I fix these in my fork?
Texture Buffer Objects
If you attempt to pass joint/bone transformation matrices to the shader as a uniform(GLSL)/constant(HLSL) array, you're going to run into constant waterfalling.
Essentially unless you run GCN 2.0 and use Uniform Buffer Objects, the uniform data sits in registers.
The GPU is a SIMD processor, i.e. the same instruction is carried out across all "threads", usually 32 at a time, in a case of a vertex shader there are 32 vertices processed in one "warp" (at a time).
This means that while i.e. a MUL instruction can take 32 different values as operands, it needs to take the same register, and uniform array elements are different distinct registers.
The texture fetch instruction is an example because the value from the register determines the mem location to fetch.
So this means that when some threads use values from different registers (array indices) the instruction gets carried out multiple times and the results you dont want are masked out.
A similar thing happens with "divergent flow control", a.k.a. if-statements in shaders.
This is not a problem if all threads use the same register most of the time, but in skinning 32 subsequent vertices are very unlikely to be influenced by the same set of bones in the same order.
And that is why I implemented TBOs which sit on top of IGPUBuffers which can be updated any way you deem appropriate (discard/recreate,BufferSubData, persistent mapping, N-Buffer round robin),
and the data is fetched inside the shader through "texelFetch" from a "samplerBuffer" in parallel.
The only way one could implement GPU Skinning right now without TBOs and IGPUBuffer infrastructure, without suffering from constant waterfalling, would be to update an actual 2D (but really 1D) texture all the time.
Not Recalculating a Bounding Box for the skinned mesh from the vertices' positions
One of the complaints about SVN irrlicht which will be version 1.9 is that bounding boxes are not updated for hardware skinned meshes, and well, they cant be.
Thats because the moved vertices are sent from the vertex shader to the rasterizer and a copy is not being kept.
Hell even if a copy was being retrieved, it would be stupid to download it after already drawing the mesh for culling
So if you want a bounding box, you'd need to skin at least the positions on the CPU! which kind of defeats the objective
WRONG
If you notice, the final vertex position is a linear combination of the original vertex position transformed by N bone/joint matrices.
The weights add up to 1, so the combination MUST lay between the different blended positions (be contained by a 3D convex hull enclosing the positions being mixed).
So if you make BoundingBoxes for each bone/joint by adding all vertices which it influences (weight>0.f) into the box, and then transform the bounding boxes by the
matrices of the bones after animation and merge them into one you get a CONSERVATIVE bounding box for your skinned mesh which completely contains the one
you would have made by recalculating it from moved vertex positions.
And all this at least 100x faster, or 800x if you use my new transformBoxEx() function.
Not only that, but you can draw the BoundingBox of the bone and get a much better visualization of the bone than just a line to its parent.
The Awful Linked Lists
The joint has a list of vertices it influences and weights that it exerts on them, to skin one must make a bool helper array to know if its the first bone to modify the position (use '=') or later (use '+=').
Linked lists are horribly inefficient, especially if I have to traverse the vertex array randomly to modify the values.
After we added a flexible vertex format (supports all OpenGL vertex attribute input data floats,integers,packed formats like R10G10B10A2), it became really expensive to set a position or to read it which made the whole thing even slower.
And the recursion, its just awful!
Instead we keep a list of up to 4 boneIDs per vertex that influence it, and cap the maximum number of bones to 256... everybody does it even crysis (except for the 256 bone limit).
We also notice that the weights have to add up to 1 so its useless to store the 4th weight and also that we dont need the full range of the "float".
We use RGB10A2 format for the weights and use the last 2 bits to tell us how many bones influence the vertex (1 up to 4).
This all boils down to only 8bytes extra data per vertex, and a 4x speed increase.
Every skinning tutorial does it like this.
Useless Caching - Pos/Rot/Scale Hints
I made myself a grid of 100 by 100 animated dwarves, all was fine until I set different animation speeds on them.
It turned out the dwarf was only being skinned once per 10000 because the same frame was being requested all the time.
This practically never happens that all instances of the animated mesh play the same animation, at the same speed and perfectly in-sync.
Instead I used std::lower_bound to find my frame keys instead of trying to accelerate it with hints, if log(N) proves to be too slow (versus the N of an invalidated hint by more than one hint),
one can use a fixed number of bins (i.e. 1024 which are fetched in O(1)) which can give us smaller ranges than (0,maxFrameForLastKey) to binary search.
Normals - Level 1 Of Wrong
Simply multiplying with the sub 3x3 matrix of the transformation matrix will not rotate the normal properly, the InverseTranspose of that 3x3 is the correct NormalMatrix!!!
Every Skinning tutorial on the internet Mentions THIS!
Normals - Level 2 Of Wrong
Here I can't blame anyone, as no implementation really takes care of it, the blending of correct 3x3 inverse transposes does not always give the correct normals, unless all vertices involved are influenced by 1 bone with a weight of 1.
The weights change from vertex to vertex, hence vary across the triangle face which makes the triangle stretch and rotate and that invalidates any normals which were pre-calculated.
Imagine a cube, where 4 corners at the top fully belong to bone A and the 4 at the bottom belong to B. Now scale bone A down or rotate it, and you'll see that the sides now have wrong normals, but the top and bottom are fine.
There are some solutions to this in research papers, so I will update you on how I solve that.
Hardware Skinning
Why am I complaining again?
Simply because for some strange reason, again, irrlicht does things completely differently to even the most basic tutorial found on the internet.
And its about to do it again... (lack of TBOs)
What is wrong? - a whole list of things
1) The Normals being calculated are completely wrong (like 2 levels of wrong)
2) The weighting of vertices is managed through a list of indices per bone and a whole ton of linked lists
3) position, rotation and scale hints on joints are completely useless
4) ever present recursion
5) constant recalculation of the bounding box for the mesh
What will be wrong in Irrlicht 1.9?
1) Constant Waterfalling in Hardware Skinning -- you need to implement TBOs before you skin
2) BBox update for skinned meshes
3) The Normals
How did I fix these in my fork?
Texture Buffer Objects
If you attempt to pass joint/bone transformation matrices to the shader as a uniform(GLSL)/constant(HLSL) array, you're going to run into constant waterfalling.
Essentially unless you run GCN 2.0 and use Uniform Buffer Objects, the uniform data sits in registers.
The GPU is a SIMD processor, i.e. the same instruction is carried out across all "threads", usually 32 at a time, in a case of a vertex shader there are 32 vertices processed in one "warp" (at a time).
This means that while i.e. a MUL instruction can take 32 different values as operands, it needs to take the same register, and uniform array elements are different distinct registers.
The texture fetch instruction is an example because the value from the register determines the mem location to fetch.
So this means that when some threads use values from different registers (array indices) the instruction gets carried out multiple times and the results you dont want are masked out.
A similar thing happens with "divergent flow control", a.k.a. if-statements in shaders.
This is not a problem if all threads use the same register most of the time, but in skinning 32 subsequent vertices are very unlikely to be influenced by the same set of bones in the same order.
And that is why I implemented TBOs which sit on top of IGPUBuffers which can be updated any way you deem appropriate (discard/recreate,BufferSubData, persistent mapping, N-Buffer round robin),
and the data is fetched inside the shader through "texelFetch" from a "samplerBuffer" in parallel.
The only way one could implement GPU Skinning right now without TBOs and IGPUBuffer infrastructure, without suffering from constant waterfalling, would be to update an actual 2D (but really 1D) texture all the time.
Not Recalculating a Bounding Box for the skinned mesh from the vertices' positions
One of the complaints about SVN irrlicht which will be version 1.9 is that bounding boxes are not updated for hardware skinned meshes, and well, they cant be.
Thats because the moved vertices are sent from the vertex shader to the rasterizer and a copy is not being kept.
Hell even if a copy was being retrieved, it would be stupid to download it after already drawing the mesh for culling
So if you want a bounding box, you'd need to skin at least the positions on the CPU! which kind of defeats the objective
WRONG
If you notice, the final vertex position is a linear combination of the original vertex position transformed by N bone/joint matrices.
The weights add up to 1, so the combination MUST lay between the different blended positions (be contained by a 3D convex hull enclosing the positions being mixed).
So if you make BoundingBoxes for each bone/joint by adding all vertices which it influences (weight>0.f) into the box, and then transform the bounding boxes by the
matrices of the bones after animation and merge them into one you get a CONSERVATIVE bounding box for your skinned mesh which completely contains the one
you would have made by recalculating it from moved vertex positions.
And all this at least 100x faster, or 800x if you use my new transformBoxEx() function.
Not only that, but you can draw the BoundingBox of the bone and get a much better visualization of the bone than just a line to its parent.
The Awful Linked Lists
The joint has a list of vertices it influences and weights that it exerts on them, to skin one must make a bool helper array to know if its the first bone to modify the position (use '=') or later (use '+=').
Linked lists are horribly inefficient, especially if I have to traverse the vertex array randomly to modify the values.
After we added a flexible vertex format (supports all OpenGL vertex attribute input data floats,integers,packed formats like R10G10B10A2), it became really expensive to set a position or to read it which made the whole thing even slower.
And the recursion, its just awful!
Instead we keep a list of up to 4 boneIDs per vertex that influence it, and cap the maximum number of bones to 256... everybody does it even crysis (except for the 256 bone limit).
We also notice that the weights have to add up to 1 so its useless to store the 4th weight and also that we dont need the full range of the "float".
We use RGB10A2 format for the weights and use the last 2 bits to tell us how many bones influence the vertex (1 up to 4).
This all boils down to only 8bytes extra data per vertex, and a 4x speed increase.
Every skinning tutorial does it like this.
Useless Caching - Pos/Rot/Scale Hints
I made myself a grid of 100 by 100 animated dwarves, all was fine until I set different animation speeds on them.
It turned out the dwarf was only being skinned once per 10000 because the same frame was being requested all the time.
This practically never happens that all instances of the animated mesh play the same animation, at the same speed and perfectly in-sync.
Instead I used std::lower_bound to find my frame keys instead of trying to accelerate it with hints, if log(N) proves to be too slow (versus the N of an invalidated hint by more than one hint),
one can use a fixed number of bins (i.e. 1024 which are fetched in O(1)) which can give us smaller ranges than (0,maxFrameForLastKey) to binary search.
Normals - Level 1 Of Wrong
Simply multiplying with the sub 3x3 matrix of the transformation matrix will not rotate the normal properly, the InverseTranspose of that 3x3 is the correct NormalMatrix!!!
Every Skinning tutorial on the internet Mentions THIS!
Normals - Level 2 Of Wrong
Here I can't blame anyone, as no implementation really takes care of it, the blending of correct 3x3 inverse transposes does not always give the correct normals, unless all vertices involved are influenced by 1 bone with a weight of 1.
The weights change from vertex to vertex, hence vary across the triangle face which makes the triangle stretch and rotate and that invalidates any normals which were pre-calculated.
Imagine a cube, where 4 corners at the top fully belong to bone A and the 4 at the bottom belong to B. Now scale bone A down or rotate it, and you'll see that the sides now have wrong normals, but the top and bottom are fine.
There are some solutions to this in research papers, so I will update you on how I solve that.
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Thanks. Unfortunately we simply don't have any coder working on the animation system since Luke created it years ago.
Would maybe help if you could post test-cases which show the problems with normals.
And yeah - Irrlicht uses linked-list everywhere. I hate it, but it's hard to change without breaking interfaces all over the place. Not sure in this case (I didn't work on animation system, so not very familiar with it).
Would maybe help if you could post test-cases which show the problems with normals.
And yeah - Irrlicht uses linked-list everywhere. I hate it, but it's hard to change without breaking interfaces all over the place. Not sure in this case (I didn't work on animation system, so not very familiar with it).
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
-
- Posts: 1638
- Joined: Mon Apr 30, 2007 3:24 am
- Location: Montreal, CANADA
- Contact:
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
DEVsh, do you plan to use characters in the next release of BAW?
Would be great if we could have someone skilled enough to read your code when you implement this in IrrlichtBAW could also fix this problem in Irrlicht. The modifications I've made to the animation system to take the animations from outside the mesh data and store it in the node data work, but is really inefficient, too much memory wasted... The struct hold too much stuff and I reused it "as-is"...
I'm still happy that Luke contributed to the animation system, before he improved it, it was much worse...
EDIT:
Would be great if we could have someone skilled enough to read your code when you implement this in IrrlichtBAW could also fix this problem in Irrlicht. The modifications I've made to the animation system to take the animations from outside the mesh data and store it in the node data work, but is really inefficient, too much memory wasted... The struct hold too much stuff and I reused it "as-is"...
I'm still happy that Luke contributed to the animation system, before he improved it, it was much worse...
EDIT:
So would it mean a vertex could be affected by up to 4 bones? I've seen an issue with an other engine that used only 4 and had issues with models from Mixamo, they implemented up to 8 instead to fix it. Here is the link: http://steamcommunity.com/app/443970/di ... 489402139/Instead we keep a list of up to 4 boneIDs per vertex that influence it...
Last edited by christianclavet on Sun Sep 25, 2016 2:00 pm, edited 1 time in total.
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
That was a well-written post devsh. Your style has much improved.
-
- Competition winner
- Posts: 523
- Joined: Tue Jan 15, 2013 6:36 pm
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Can't wait to see some of your work devsh. I recently started using irrlicht trunk as shader pipeline has never been updated in along time.
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
"Normals - Level 2 Of Wrong" maybe could be fixed using quaternion slerps instead of matrices linear interpolations. Normals don't change with position, they are rotational vectors by nature, thus interpolating any amount of transformations to get a proper rotation is the work of the quaternions. I don't know how much complexity would add, though, but sounds definitively reasonable.
"There is nothing truly useless, it always serves as a bad example". Arthur A. Schmitt
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Think about the cube, the rotation of the vertices is not uniform throughout the mesh
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
I'm about to open my veins...
I've flattened out the Bone/Joint tree into a single flat array where the children are at farther positions than the parents, this removes recursion and will allow for GPU-Boning.
The Caching mechanism prevents the mesh being skinned or animated multiple times when a getMeshForCurrentFrame() is being requested.
The thing is, if you have multiple nodes using the same mesh they have different CurrentFrames so they pollute the LastFrameAnimated and SkinnedThisFrame states inside CSkinnedMesh
So basically unless you have just 1 node per mesh, or all your nodes using the mesh are playing the same animation loop in sync... the mesh gets skinned 3 times and its BBox is recalculated 3 times!!!!
This also means that every node needs to be funking animated and skinned twice just so you can cull it from rendering XD
I've flattened out the Bone/Joint tree into a single flat array where the children are at farther positions than the parents, this removes recursion and will allow for GPU-Boning.
The Caching mechanism prevents the mesh being skinned or animated multiple times when a getMeshForCurrentFrame() is being requested.
The thing is, if you have multiple nodes using the same mesh they have different CurrentFrames so they pollute the LastFrameAnimated and SkinnedThisFrame states inside CSkinnedMesh
So basically unless you have just 1 node per mesh, or all your nodes using the mesh are playing the same animation loop in sync... the mesh gets skinned 3 times and its BBox is recalculated 3 times!!!!
This also means that every node needs to be funking animated and skinned twice just so you can cull it from rendering XD
It doesn't matter, I reskin the mesh to only consider the 4 most powerful bones.. again you could modify the mesh loader to allow for more weights and bonesSo would it mean a vertex could be affected by up to 4 bones? I've seen an issue with an other engine that used only 4 and had issues with models from Mixamo, they implemented up to 8 instead to fix it.
-
- Posts: 1638
- Joined: Mon Apr 30, 2007 3:24 am
- Location: Montreal, CANADA
- Contact:
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Hi. Was almost doing the same thing when I was looking at the CSkinnedmesh struct and your kungfu programming skills are way more powerful than mine!devsh wrote: I'm about to open my veins...
The Caching mechanism prevents the mesh being skinned or animated multiple times when a getMeshForCurrentFrame() is being requested.
The thing is, if you have multiple nodes using the same mesh they have different CurrentFrames so they pollute the LastFrameAnimated and SkinnedThisFrame states inside CSkinnedMesh
I'm sure that you'll be able to come with a much better solution!
My idea was to put the animation data in the node after loading it. So each instance could get access to it's own animation data, and allow for much more flexibility (multiple instances of the same mesh, having different animations and allow for easier animation tweaking.... Later I've found out that this could be done externally, but the CSkinnedMesh and the struct for the weights and animation are not that efficient...devsh wrote: So basically unless you have just 1 node per mesh, or all your nodes using the mesh are playing the same animation loop in sync... the mesh gets skinned 3 times and its BBox is recalculated 3 times!!!!
This also means that every node needs to be funking animated and skinned twice just so you can cull it from rendering XD
Hummm. If you can try to get some meshes from Mixamo with the auto-rigger and test...devsh wrote:It doesn't matter, I reskin the mesh to only consider the 4 most powerful bones.. again you could modify the mesh loader to allow for more weights and bonesSo would it mean a vertex could be affected by up to 4 bones? I've seen an issue with an other engine that used only 4 and had issues with models from Mixamo, they implemented up to 8 instead to fix it.
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Good skinning for characters seldom requires even more than 3 bones.
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Threw out IBoneSceneNodes as well as EJUOR_enums, IBoneSceneNode is now a subclass of ISkinningStateManager and updatingAbsolutePositions makes the parents recalculate (without using recursive functions )
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Aaaaand I've finished
My GPU implementation gives a 6-8x (99 vs 600-800) FPS speedup over the CPU skinning from irrlicht 1.8.4 against this code sample:
I was expecting the gain to be MUCH MUCH more, but:
a) the dwarf mesh only has 1200 vertices (now skinned meshes have regularily in excess of 16k)
b) my XLoader loads un-interleaved vertex data (will fix this!!!) which will cause -20% on render speed
c) the dwarf has many bones (46) and a very deep bone hierarchy (12 - 16 levels)
d) The performance deterioration under CPU skinning is ~linear (8000fps 1 dwarf -> 99fps 100 dwarves)
under GPU skinning its better than linear (11600fps 1 dwarf -> 650fps 100 dwarves)
This means that CPU skinning has the balance skewed in favour, otherwise I'd be 60-100x faster
My CPU bottleneck is updating (streaming) the backing Buffer for the TBO, and calculating the bone positions.
I'll update you how much INSTANCING has helped, when I implement!!!
P.S. My Irrlicht fork is 45% faster to begin with XD
My GPU implementation gives a 6-8x (99 vs 600-800) FPS speedup over the CPU skinning from irrlicht 1.8.4 against this code sample:
Code: Select all
#include <irrlicht.h>
#include <iostream>
#include "driverChoice.h"
using namespace irr;
#ifdef _MSC_VER
#pragma comment(lib, "Irrlicht.lib")
#endif
int main()
{
// ask if user would like shadows
char i;
printf("Please press 'y' if you want to use realtime shadows.\n");
std::cin >> i;
const bool shadows = (i == 'y');
// ask user for driver
video::E_DRIVER_TYPE driverType=driverChoiceConsole();
if (driverType==video::EDT_COUNT)
return 1;
/*
Create device and exit if creation failed. We make the stencil flag
optional to avoid slow screen modes for runs without shadows.
*/
IrrlichtDevice *device =
createDevice(driverType, core::dimension2d<u32>(1920, 1080),
16, false, shadows);
if (device == 0)
return 1; // could not create selected driver.
video::IVideoDriver* driver = device->getVideoDriver();
scene::ISceneManager* smgr = device->getSceneManager();
/*
For our environment, we load a .3ds file. It is a small room I modelled
with Anim8or and exported into the 3ds format because the Irrlicht
Engine does not support the .an8 format. I am a very bad 3d graphic
artist, and so the texture mapping is not very nice in this model.
Luckily I am a better programmer than artist, and so the Irrlicht
Engine is able to create a cool texture mapping for me: Just use the
mesh manipulator and create a planar texture mapping for the mesh. If
you want to see the mapping I made with Anim8or, uncomment this line. I
also did not figure out how to set the material right in Anim8or, it
has a specular light color which I don't really like. I'll switch it
off too with this code.
*/
scene::IAnimatedMesh* mesh;
scene::ISceneNode* node = 0;
mesh = smgr->getMesh("../../media/dwarf.x");
scene::IAnimatedMeshSceneNode* anode = 0;
#define kInstanceSquareSize 10
for (size_t x=0; x<kInstanceSquareSize; x++)
for (size_t z=0; z<kInstanceSquareSize; z++)
{
anode = smgr->addAnimatedMeshSceneNode(mesh);
anode->setScale(core::vector3df(0.5f));
anode->setPosition(core::vector3df(x,0.f,z)*40.f);
anode->setAnimationSpeed(18.f*float(x+1+(z+1)*kInstanceSquareSize)/float(kInstanceSquareSize*kInstanceSquareSize));
}
/*
Finally we simply have to draw everything, that's all.
*/
scene::ICameraSceneNode* camera = smgr->addCameraSceneNodeFPS();
camera->setPosition(core::vector3df(-50,50,-150));
camera->setFarValue(10000.0f); // this increase a shadow visible range.
// disable mouse cursor
device->getCursorControl()->setVisible(false);
s32 lastFPS = -1;
while(device->run())
if (device->isWindowActive())
{
driver->beginScene(true, true, 0);
smgr->drawAll();
driver->endScene();
const s32 fps = driver->getFPS();
if (lastFPS != fps)
{
core::stringw str = L"Irrlicht Engine - SpecialFX example [";
str += driver->getName();
str += "] FPS:";
str += fps;
device->setWindowCaption(str.c_str());
lastFPS = fps;
}
}
device->drop();
return 0;
}
/*
**/
a) the dwarf mesh only has 1200 vertices (now skinned meshes have regularily in excess of 16k)
b) my XLoader loads un-interleaved vertex data (will fix this!!!) which will cause -20% on render speed
c) the dwarf has many bones (46) and a very deep bone hierarchy (12 - 16 levels)
d) The performance deterioration under CPU skinning is ~linear (8000fps 1 dwarf -> 99fps 100 dwarves)
under GPU skinning its better than linear (11600fps 1 dwarf -> 650fps 100 dwarves)
This means that CPU skinning has the balance skewed in favour, otherwise I'd be 60-100x faster
My CPU bottleneck is updating (streaming) the backing Buffer for the TBO, and calculating the bone positions.
I'll update you how much INSTANCING has helped, when I implement!!!
P.S. My Irrlicht fork is 45% faster to begin with XD
-
- Competition winner
- Posts: 523
- Joined: Tue Jan 15, 2013 6:36 pm
Re: The woes of Irrlicht's Skinning - Hardware GPU Skinning
Sounds great devsh will this work with trunk?