write your own intrinsic implementation? (i.e. _mm_mul_ps() etc.)- Convert SIMD instructions into multiple regular x86 instructions (it is possible, and that would at least remove some maintenance burden from C++ developers: you just write SIMD code).. as far as I know Emscripten already do that but that's for web.
anyway, I'm going to explore the matrix associativity issue, and let you guys know later how it went