I'm here to explain why your hardware skinning might be slow and not work....
[WARNING] - Blindslide shader code seem to work only on GLSL 4.0! (As my primary PC), tested it on my HTPC (Radeon 4350 - Pentium - Glsl 3.3) and the shader give 9 errors and fail to compile! Found out after Lazerblade message... We'll need to convert the shader to a lesser version of OpenGL as my current knowledge of shader is insuficient to do the task. Until then, the open GL version will only work on very recent cards... (I wonder how could shadowlair could have tested openGL?!)
With: 580 fps
Without: 830 fps
Conclusion: my hardware sucks! =/
With: 23 fps
Without: 47 fps
Lol

)
Constant Cascading
Basically this is probably why the shader refuses to work with pre-OpenGL 4.0 hardware, you're passing bone data as a uniform array (BAAAAAAD)
When the vertex shader accesses the bone array by bone ID stored in the vertex attribute, it can ask for different bone IDs between the vertex shader invocations.
In pre-GL 4.0 days, the uniform data sits in constant registers which prevent divergent access.
Basically 8 to 32 invocations of the vertex shader run at once, and if you ask for a different bone data (with different index from the array) this causes branching and the shader will fetch all the bone data in serial mode and then mask the results for each vertex-shader thread in the warp.
You need to make a class for the Texture Buffer Object (which is a texture "window" onto a GPU buffer) which allows you to use texelFetch() inside the shader and for the values to be cached in the texture cache.
You also need my transient or granular buffer to update the buffer object... or just use glBufferSubData every frame
Then if different "threads" ask for different pixels they can do it in parallel just like in the pixel shader when texturing.
If you really want to test if I'm correct, then "freeze" the animation and shove the uniform bone data into a small 1xN floating point texture and make the shader get its data from the texture instead.
Also not to mention that a TBO gives you a minimum size of 128MB, as opposed to the 16kb of uniforms you can have in one shader, its also persistent (you don't have to reupload every time you set the shader) and its size allows for a number of bones only limited by the bit-depth of your per-vertex boneID.
Additionally the 128MB can be used for skinned mesh instancing

(1000 dwarves with 34 bones each as 8 float dual quaternions all drawn in one pass)
Other reasons:
A) Too much data, do you really need floating point bone weights? why not use normalized uint8? (a.k.a. bone weights as color)
B) Why not have bone transformations as dual quaternions, you dont need 4x4 or 4x3 matrices! Blending them is much faster!