Unicode-aware Irrlicht
Unicode-aware Irrlicht
So, I briefly discussed this with CuteAlien and he suggested that I move this discussion to a public place where other programmers can take a look at it.
Basically, I wanted to do some basic i18n (internationalization and localization) stuff with Irrlicht and came to realize two basic problems with Irrlicht:
1) wchar_t is not consistent between platforms, and Irrlicht uses it everywhere.
2) Irrlicht assumes that multibyte characters won't be used.
The issue is that adding any sort of serious i18n features to Irrlicht would require a lot of internal changes to get it working. Multiple times throughout the code, Irrlicht assumes that multibyte characters don't exist. It takes a wchar_t array and iterates through, parsing each character one at a time. On Linux and MacOSX, where a wchar_t is 32-bits, this is fine, but on Windows, where it is 16-bits, this will cause issues when multibyte characters come up.
Another issue is the fact that the XML reader can not convert between the UTF versions correctly, which will break serialization if any multibyte characters are used on a platform where wchar_t isn't 32-bits.
The obvious solution is to make Irrlicht use UTF-16 internally, be able to convert to UTF-8, UTF-32, and wchar_t when needed, and to be aware of multibyte characters when processing text. This is no small feat considering that all references to wchar_t would have to be removed.
One idea that I had was to either change the Irrlicht string class or build a new one that stores UTF-16 strings. It would be able to convert between the UTF versions and wchar_t internally so you could pass/return different string versions. It would contain an iterator that returns a UTF-32 charater so one can iterate through the string character by character, taking into account multibyte characters. operator[] would also do the same thing, returning a single UTF-32 character. These would make it very easy to alter Irrlicht to support multibyte characters.
Personally, I chose to use UTF-16 strings because ICU uses them too. It would make it easier for Irrlicht to work with ICU. Plus, ICU makes a good case for it:
http://userguide.icu-project.org/icufaq ... ifference-
Having the string's functions return UTF-32 characters is so Irrlicht can retain the ability to process a single character at a time as there are no multibyte UTF-32 characters. Plus, it would make it easier to integrate Irrlicht with FreeType, which takes UTF-32 characters in the FT_Get_Char_Index() function.
I was planning on attempting this sometime in the future and I figure it would be best to ask about it first so I don't potentially make something that can't be used.
-------------------
Latest version of ustring: Download - DEAD LINK
-------------------
UPDATE: The culmination of this topic can be found here:
http://irrlicht.sourceforge.net/phpBB2/ ... hp?t=37296
Basically, I wanted to do some basic i18n (internationalization and localization) stuff with Irrlicht and came to realize two basic problems with Irrlicht:
1) wchar_t is not consistent between platforms, and Irrlicht uses it everywhere.
2) Irrlicht assumes that multibyte characters won't be used.
The issue is that adding any sort of serious i18n features to Irrlicht would require a lot of internal changes to get it working. Multiple times throughout the code, Irrlicht assumes that multibyte characters don't exist. It takes a wchar_t array and iterates through, parsing each character one at a time. On Linux and MacOSX, where a wchar_t is 32-bits, this is fine, but on Windows, where it is 16-bits, this will cause issues when multibyte characters come up.
Another issue is the fact that the XML reader can not convert between the UTF versions correctly, which will break serialization if any multibyte characters are used on a platform where wchar_t isn't 32-bits.
The obvious solution is to make Irrlicht use UTF-16 internally, be able to convert to UTF-8, UTF-32, and wchar_t when needed, and to be aware of multibyte characters when processing text. This is no small feat considering that all references to wchar_t would have to be removed.
One idea that I had was to either change the Irrlicht string class or build a new one that stores UTF-16 strings. It would be able to convert between the UTF versions and wchar_t internally so you could pass/return different string versions. It would contain an iterator that returns a UTF-32 charater so one can iterate through the string character by character, taking into account multibyte characters. operator[] would also do the same thing, returning a single UTF-32 character. These would make it very easy to alter Irrlicht to support multibyte characters.
Personally, I chose to use UTF-16 strings because ICU uses them too. It would make it easier for Irrlicht to work with ICU. Plus, ICU makes a good case for it:
http://userguide.icu-project.org/icufaq ... ifference-
Having the string's functions return UTF-32 characters is so Irrlicht can retain the ability to process a single character at a time as there are no multibyte UTF-32 characters. Plus, it would make it easier to integrate Irrlicht with FreeType, which takes UTF-32 characters in the FT_Get_Char_Index() function.
I was planning on attempting this sometime in the future and I figure it would be best to ask about it first so I don't potentially make something that can't be used.
-------------------
Latest version of ustring: Download - DEAD LINK
-------------------
UPDATE: The culmination of this topic can be found here:
http://irrlicht.sourceforge.net/phpBB2/ ... hp?t=37296
Last edited by Nalin on Thu Jun 16, 2011 2:18 am, edited 3 times in total.
Well, I started doing a little bit of work on creating a new string class to handle unicode strings. You can see the current progress here:
http://irrlicht.pastebin.com/f69552966
I have not tested the code at all, and some of it hasn't been worked on yet (ie, some of the string manipulation functions aren't finished yet.) It is hastily thrown together. I'm posting this to help generate some discussion on the best design implementation for this. I figure some code to look at will help with that.
The commented uchar_t stuff was related to the commented toUTF*() functions, so they can be ignored, unless anybody wants to comment on them. Basically, I had originally wanted to use the string class to store the utf-8, utf-32, and wchar_t strings, but conflicts with some of the operator+= overloads prevented that from working. I messed around with the uchar_t stuff before deciding it was just not worth it, so I switched the toUTF*() functions to return a pointer to an array that must be manually deleted.
In the current design, it makes extensive use of an iterator that iterates through the character string converting the multi-byte utf-16 characters into a single utf-32 character. Most of the string manipulation functions use the iterator, except the few where it is beneficial to parse by hand (like the remove() function).
Anyways, as I said, it is very basic. It doesn't check for invalid unicode strings at all and the iterator could use some work. It is mainly just a container that helps Irrlicht interface with external libraries, like ICU and FreeType.
http://irrlicht.pastebin.com/f69552966
I have not tested the code at all, and some of it hasn't been worked on yet (ie, some of the string manipulation functions aren't finished yet.) It is hastily thrown together. I'm posting this to help generate some discussion on the best design implementation for this. I figure some code to look at will help with that.
The commented uchar_t stuff was related to the commented toUTF*() functions, so they can be ignored, unless anybody wants to comment on them. Basically, I had originally wanted to use the string class to store the utf-8, utf-32, and wchar_t strings, but conflicts with some of the operator+= overloads prevented that from working. I messed around with the uchar_t stuff before deciding it was just not worth it, so I switched the toUTF*() functions to return a pointer to an array that must be manually deleted.
In the current design, it makes extensive use of an iterator that iterates through the character string converting the multi-byte utf-16 characters into a single utf-32 character. Most of the string manipulation functions use the iterator, except the few where it is beneficial to parse by hand (like the remove() function).
Anyways, as I said, it is very basic. It doesn't check for invalid unicode strings at all and the iterator could use some work. It is mainly just a container that helps Irrlicht interface with external libraries, like ICU and FreeType.
Great work so far. I've only had a few minutes to browse it (sorry, hard for me to find time currently), but I think it looks very good. Not converting to the string classes is certainly a little sad, but I guess we could do that probably with friend functions and also by adding corresponding conversions in the other string classes (but I have not tried that, maybe that also conflicts with operator=). But that's details, I guess we will find some solution for that.
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
The problem came from this:CuteAlien wrote:Great work so far. I've only had a few minutes to browse it (sorry, hard for me to find time currently), but I think it looks very good. Not converting to the string classes is certainly a little sad, but I guess we could do that probably with friend functions and also by adding corresponding conversions in the other string classes (but I have not tried that, maybe that also conflicts with operator=). But that's details, I guess we will find some solution for that.
Code: Select all
string<T>& operator += (T c)
string<T>& operator += (const int i)
Good work, Nalin!
I can recommend implementation of Qt string class. Very valuable reference.
But maybe it would be better for simplicity and performance to have simple u16 (without taking into account surrogate pairs) and u32 string classes. Both with conversion methods. I have done some projects with all european chars and one with japanese and 16 bits were more than enough. But if some project will require utf32 we can write smth like #define _USE_UTF32 1. Combining of surrogate pairs can decrease performance at runtime (in the current non-cached static text implementation).
I can recommend implementation of Qt string class. Very valuable reference.
But maybe it would be better for simplicity and performance to have simple u16 (without taking into account surrogate pairs) and u32 string classes. Both with conversion methods. I have done some projects with all european chars and one with japanese and 16 bits were more than enough. But if some project will require utf32 we can write smth like #define _USE_UTF32 1. Combining of surrogate pairs can decrease performance at runtime (in the current non-cached static text implementation).
Careful, qt is using another license. I look a lot at qt interfaces, but so far I avoided looking into implementations for that reason.zet.dp.ua wrote:Good work, Nalin!
I can recommend implementation of Qt string class. Very valuable reference.
Speed is certainly a valid concern. But maybe rather hard to solve at the same time as correctness. It's getting complex.... so maybe it's not a bad idea just to go on doing a correct solution including utf32 first. I suppose adding something like defines for optimization would be another step. Adding a define will probably mean working with a typedef in the engine and being really careful to never confuse length with size. But not sure yet if the interface will be similar enough in the end to allow that (can c_str() still be used similar?).zet.dp.ua wrote: But maybe it would be better for simplicity and performance to have simple u16 (without taking into account surrogate pairs) and u32 string classes. Both with conversion methods. I have done some projects with all european chars and one with japanese and 16 bits were more than enough. But if some project will require utf32 we can write smth like #define _USE_UTF32 1. Combining of surrogate pairs can decrease performance at runtime (in the current non-cached static text implementation).
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
No, no, i don't say "take qstring and include", just look at idea of data organization and usage. I think it is not forbidden.CuteAlien wrote: Careful, qt is using another license. I look a lot at qt interfaces, but so far I avoided looking into implementations for that reason.
Usually string class in the engine should manage a limited number of methods: add, insert, remove, replace, find, compare, iteration. Conversions are required when we have to comunicate with an external environment (WinAPI, Carbon, ...). It is not a problem to write toCStr(), toWChar(), toUtf8/16/32(), but how many changes it requires? XML, font...
Yes, all the handling of surrogate pairs will decrease performance, but I was going for was a unicode string class that would return full utf-32 characters for font drawing. Ignoring surrogate pairs leads to the original problem where Irrlicht was passing the individual surrogates to the font drawing routines, instead of converting them into a valid character.zet.dp.ua wrote:But maybe it would be better for simplicity and performance to have simple u16 (without taking into account surrogate pairs) and u32 string classes. Both with conversion methods. I have done some projects with all european chars and one with japanese and 16 bits were more than enough. But if some project will require utf32 we can write smth like #define _USE_UTF32 1. Combining of surrogate pairs can decrease performance at runtime (in the current non-cached static text implementation).
Well, there are a couple choices from here.CuteAlien wrote:Speed is certainly a valid concern. But maybe rather hard to solve at the same time as correctness. It's getting complex.... so maybe it's not a bad idea just to go on doing a correct solution including utf32 first. I suppose adding something like defines for optimization would be another step. Adding a define will probably mean working with a typedef in the engine and being really careful to never confuse length with size. But not sure yet if the interface will be similar enough in the end to allow that (can c_str() still be used similar?).
A) Develop a utf-16 string class that takes into account surrogate pairs in the string manipulation functions. Slowest, but safest for the programmer.
B) Develop a utf-16 string class that doesn't take into account surrogate pairs in the string manipulation functions. Faster, but it will be very easy to mangle the unicode string. In fact, it can be guaranteed on.
C) Develop a utf-32 string class. The most simple solution, but it takes up the most memory as it uses 4 bytes per character.
What I was thinking about was doing solution A like you already do. But so far the interface of your class looks similar to your old string class - so I was just thinking that it might be possible (not sure yet) to use a typedef in Irrlicht which can be switched with a define. Then you could switch between current solution for speed and your solution A for correctness.
So far I think the way you do it right now is the best way when people need unicode. But I certainly also see zet.dp.ua's point and as long as it's possible I see no problem in a define which allows switching unicode support off.
So far I think the way you do it right now is the best way when people need unicode. But I certainly also see zet.dp.ua's point and as long as it's possible I see no problem in a define which allows switching unicode support off.
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Right, i would even say there are 2 choices:Nalin wrote: Well, there are a couple choices from here.
A) Develop a utf-16 string class that takes into account surrogate pairs in the string manipulation functions. Slowest, but safest for the programmer.
B) Develop a utf-16 string class that doesn't take into account surrogate pairs in the string manipulation functions. Faster, but it will be very easy to mangle the unicode string. In fact, it can be guaranteed on.
C) Develop a utf-32 string class. The most simple solution, but it takes up the most memory as it uses 4 bytes per character.
1. Develop a utf-16 string class that takes into account surrogate pairs and is used in cache-based text elements instead of using draw(stringw ...)-like methods - fast and safe.
2. Develop utf16-sliced/utf32 string class with common string-related methods and specialized conversion, so user can decide what is better suitable for the title - very fast but can require memory.
I can post a sample textnode class that draws buffer generated at text set time if you want to optimize current textnode in case of A way + bitmap font tool with glyphs packer and font generator from .psd files.
Well, what would be used in the draw methods if not the utf-16 string? Would you require that the utf-16 string be converted into a utf-32 string first? That would take even longer than just iterating through the utf-16 string, as converting a utf-16 surrogate pair to a utf-32 character is just some bitshifts and a few subtractions. And you would have to do that anyways to convert it to utf-32.zet.dp.ua wrote:Right, i would even say there are 2 choices:
1. Develop a utf-16 string class that takes into account surrogate pairs and is used in cache-based text elements instead of using draw(stringw ...)-like methods - fast and safe.
That is always an option. Develop two different unicode string classes that can be toggled with some #defines. But a utf-32 string class is easy, as all you need to do is use the existing string class and just add some conversion functions, so I'm not going to focus on that at this point. But it wouldn't be hard to add that ability.zet.dp.ua wrote:2. Develop utf16-sliced/utf32 string class with common string-related methods and specialized conversion, so user can decide what is better suitable for the title - very fast but can require memory.
In the long run, I definitely thing the text scene nodes need to be changed to cache the text.zet.dp.ua wrote:I can post a sample textnode class that draws buffer generated at text set time if you want to optimize current textnode in case of A way + bitmap font tool with glyphs packer and font generator from .psd files.
I updated the class again:
http://irrlicht.pastebin.com/f2de564ba
I got rid of some of the magic numbers (the UTF-16 surrogate starting values) and converted more of the string manipulation functions (split is, I think, the only one not implemented currently.)
I also worked on the iterator a bit, improving the performance, and made some of the functionality consistent by making all of the functions, including the iterator constructors, specify position in terms of characters, instead of individual code points. Only the *_raw() functions deal with individual code points.
I think all I need to do is develop a new split() function and have validate() also determine the validity of the UTF-16 string. Do you have any other suggestions for the class?
What should the next step be after I test the class and fix bugs? Should I work on the XML parser and make sure it serializes the class correctly, as well as reads/converts unicode properly, or should I work on some new text scene nodes that take advantage of the unicode string class?
http://irrlicht.pastebin.com/f2de564ba
I got rid of some of the magic numbers (the UTF-16 surrogate starting values) and converted more of the string manipulation functions (split is, I think, the only one not implemented currently.)
I also worked on the iterator a bit, improving the performance, and made some of the functionality consistent by making all of the functions, including the iterator constructors, specify position in terms of characters, instead of individual code points. Only the *_raw() functions deal with individual code points.
I think all I need to do is develop a new split() function and have validate() also determine the validity of the UTF-16 string. Do you have any other suggestions for the class?
What should the next step be after I test the class and fix bugs? Should I work on the XML parser and make sure it serializes the class correctly, as well as reads/converts unicode properly, or should I work on some new text scene nodes that take advantage of the unicode string class?
Having a few tests/examples would be the most important for now I guess. Also I try if I can find some time on the weekend to check it out some more, but no promises.
A little side-note on the TAlloc - there's a recently found bug in irr::string for that which you also copied: http://sourceforge.net/tracker/?func=de ... tid=540676
In short - if you would use another allocator than the default one it would crash because it would no longer use the operator=. But well, we also haven't fixed that yet in the engine (and it's been in a _long_ time in the engine without anyone complaining so I guess allocators are not really used that much...).
A little side-note on the TAlloc - there's a recently found bug in irr::string for that which you also copied: http://sourceforge.net/tracker/?func=de ... tid=540676
In short - if you would use another allocator than the default one it would crash because it would no longer use the operator=. But well, we also haven't fixed that yet in the engine (and it's been in a _long_ time in the engine without anyone complaining so I guess allocators are not really used that much...).
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
Only vertex buffer in draw method of textnode class. setText(utf16) -> iterate each symbol -> extract glyph data -> fill vertex buffer -> draw bufferNalin wrote:Well, what would be used in the draw methods if not the utf-16 string? Would you require that the utf-16 string be converted into a utf-32 string first? That would take even longer than just iterating through the utf-16 string, as converting a utf-16 surrogate pair to a utf-32 character is just some bitshifts and a few subtractions. And you would have to do that anyways to convert it to utf-32.
Some c++ comments:Nalin wrote:I updated the class again:
http://irrlicht.pastebin.com/f2de564ba
Don't use virtual functions in such "low-level" classes, try to use inner iterator and access classes instead. And extract all possible common code blocks into separate functions (Convert the surrogate pair into a single UTF-32 character...)