Unicode-aware Irrlicht

Nalin · Post by **Nalin** » Wed Dec 09, 2009 10:00 pm

So, I briefly discussed this with CuteAlien and he suggested that I move this discussion to a public place where other programmers can take a look at it.

Basically, I wanted to do some basic i18n (internationalization and localization) stuff with Irrlicht and came to realize two basic problems with Irrlicht:

1) wchar_t is not consistent between platforms, and Irrlicht uses it everywhere.
2) Irrlicht assumes that multibyte characters won't be used.

The issue is that adding any sort of serious i18n features to Irrlicht would require a lot of internal changes to get it working. Multiple times throughout the code, Irrlicht assumes that multibyte characters don't exist. It takes a wchar_t array and iterates through, parsing each character one at a time. On Linux and MacOSX, where a wchar_t is 32-bits, this is fine, but on Windows, where it is 16-bits, this will cause issues when multibyte characters come up.

Another issue is the fact that the XML reader can not convert between the UTF versions correctly, which will break serialization if any multibyte characters are used on a platform where wchar_t isn't 32-bits.

The obvious solution is to make Irrlicht use UTF-16 internally, be able to convert to UTF-8, UTF-32, and wchar_t when needed, and to be aware of multibyte characters when processing text. This is no small feat considering that all references to wchar_t would have to be removed.

One idea that I had was to either change the Irrlicht string class or build a new one that stores UTF-16 strings. It would be able to convert between the UTF versions and wchar_t internally so you could pass/return different string versions. It would contain an iterator that returns a UTF-32 charater so one can iterate through the string character by character, taking into account multibyte characters. operator[] would also do the same thing, returning a single UTF-32 character. These would make it very easy to alter Irrlicht to support multibyte characters.

Personally, I chose to use UTF-16 strings because ICU uses them too. It would make it easier for Irrlicht to work with ICU. Plus, ICU makes a good case for it:
http://userguide.icu-project.org/icufaq ... ifference-

Having the string's functions return UTF-32 characters is so Irrlicht can retain the ability to process a single character at a time as there are no multibyte UTF-32 characters. Plus, it would make it easier to integrate Irrlicht with FreeType, which takes UTF-32 characters in the FT_Get_Char_Index() function.

I was planning on attempting this sometime in the future and I figure it would be best to ask about it first so I don't potentially make something that can't be used.

-------------------

Latest version of ustring: Download - DEAD LINK

-------------------

UPDATE: The culmination of this topic can be found here:
http://irrlicht.sourceforge.net/phpBB2/ ... hp?t=37296

Nalin · Post by **Nalin** » Wed Dec 16, 2009 4:27 am

Well, I started doing a little bit of work on creating a new string class to handle unicode strings. You can see the current progress here:
http://irrlicht.pastebin.com/f69552966

I have not tested the code at all, and some of it hasn't been worked on yet (ie, some of the string manipulation functions aren't finished yet.) It is hastily thrown together. I'm posting this to help generate some discussion on the best design implementation for this. I figure some code to look at will help with that.

The commented uchar_t stuff was related to the commented toUTF*() functions, so they can be ignored, unless anybody wants to comment on them. Basically, I had originally wanted to use the string class to store the utf-8, utf-32, and wchar_t strings, but conflicts with some of the operator+= overloads prevented that from working. I messed around with the uchar_t stuff before deciding it was just not worth it, so I switched the toUTF*() functions to return a pointer to an array that must be manually deleted.

In the current design, it makes extensive use of an iterator that iterates through the character string converting the multi-byte utf-16 characters into a single utf-32 character. Most of the string manipulation functions use the iterator, except the few where it is beneficial to parse by hand (like the remove() function).

Anyways, as I said, it is very basic. It doesn't check for invalid unicode strings at all and the iterator could use some work. It is mainly just a container that helps Irrlicht interface with external libraries, like ICU and FreeType.

CuteAlien · Post by **CuteAlien** » Wed Dec 16, 2009 7:13 am

Great work so far. I've only had a few minutes to browse it (sorry, hard for me to find time currently), but I think it looks very good. Not converting to the string classes is certainly a little sad, but I guess we could do that probably with friend functions and also by adding corresponding conversions in the other string classes (but I have not tried that, maybe that also conflicts with operator=). But that's details, I guess we will find some solution for that.

Nalin · Post by **Nalin** » Wed Dec 16, 2009 8:23 am

CuteAlien wrote:Great work so far. I've only had a few minutes to browse it (sorry, hard for me to find time currently), but I think it looks very good. Not converting to the string classes is certainly a little sad, but I guess we could do that probably with friend functions and also by adding corresponding conversions in the other string classes (but I have not tried that, maybe that also conflicts with operator=). But that's details, I guess we will find some solution for that.

The problem came from this:

Code: Select all

string<T>& operator += (T c)
string<T>& operator += (const int i)

When you create a core::string<s32> for a utf-32 string, you end up with duplicate function declarations, as s32 is a typedef for int. It would work properly if I made use of the C++0x unicode data types char16_t and char32_t, but the only compiler that supports them is gcc 4.5, and that isn't acceptable for a cross-platform library like Irrlicht. The uchar_t hack I experimented with worked, but it didn't work smoothly, so I scrapped it to just return a character array.

zet.dp.ua · Post by **zet.dp.ua** » Wed Dec 16, 2009 9:59 am

Good work, Nalin!
I can recommend implementation of Qt string class. Very valuable reference.
But maybe it would be better for simplicity and performance to have simple u16 (without taking into account surrogate pairs) and u32 string classes. Both with conversion methods. I have done some projects with all european chars and one with japanese and 16 bits were more than enough. But if some project will require utf32 we can write smth like #define _USE_UTF32 1. Combining of surrogate pairs can decrease performance at runtime (in the current non-cached static text implementation).

CuteAlien · Post by **CuteAlien** » Wed Dec 16, 2009 10:23 am

zet.dp.ua wrote:Good work, Nalin!
I can recommend implementation of Qt string class. Very valuable reference.

Careful, qt is using another license. I look a lot at qt interfaces, but so far I avoided looking into implementations for that reason.

zet.dp.ua wrote: But maybe it would be better for simplicity and performance to have simple u16 (without taking into account surrogate pairs) and u32 string classes. Both with conversion methods. I have done some projects with all european chars and one with japanese and 16 bits were more than enough. But if some project will require utf32 we can write smth like #define _USE_UTF32 1. Combining of surrogate pairs can decrease performance at runtime (in the current non-cached static text implementation).

Speed is certainly a valid concern. But maybe rather hard to solve at the same time as correctness. It's getting complex.... so maybe it's not a bad idea just to go on doing a correct solution including utf32 first. I suppose adding something like defines for optimization would be another step. Adding a define will probably mean working with a typedef in the engine and being really careful to never confuse length with size. But not sure yet if the interface will be similar enough in the end to allow that (can c_str() still be used similar?).

zet.dp.ua · Post by **zet.dp.ua** » Wed Dec 16, 2009 1:42 pm

CuteAlien wrote: Careful, qt is using another license. I look a lot at qt interfaces, but so far I avoided looking into implementations for that reason.

No, no, i don't say "take qstring and include", just look at idea of data organization and usage. I think it is not forbidden.

Usually string class in the engine should manage a limited number of methods: add, insert, remove, replace, find, compare, iteration. Conversions are required when we have to comunicate with an external environment (WinAPI, Carbon, ...). It is not a problem to write toCStr(), toWChar(), toUtf8/16/32(), but how many changes it requires? XML, font...

Nalin · Post by **Nalin** » Wed Dec 16, 2009 3:21 pm

zet.dp.ua wrote:But maybe it would be better for simplicity and performance to have simple u16 (without taking into account surrogate pairs) and u32 string classes. Both with conversion methods. I have done some projects with all european chars and one with japanese and 16 bits were more than enough. But if some project will require utf32 we can write smth like #define _USE_UTF32 1. Combining of surrogate pairs can decrease performance at runtime (in the current non-cached static text implementation).

Yes, all the handling of surrogate pairs will decrease performance, but I was going for was a unicode string class that would return full utf-32 characters for font drawing. Ignoring surrogate pairs leads to the original problem where Irrlicht was passing the individual surrogates to the font drawing routines, instead of converting them into a valid character.

CuteAlien wrote:Speed is certainly a valid concern. But maybe rather hard to solve at the same time as correctness. It's getting complex.... so maybe it's not a bad idea just to go on doing a correct solution including utf32 first. I suppose adding something like defines for optimization would be another step. Adding a define will probably mean working with a typedef in the engine and being really careful to never confuse length with size. But not sure yet if the interface will be similar enough in the end to allow that (can c_str() still be used similar?).

Well, there are a couple choices from here.
A) Develop a utf-16 string class that takes into account surrogate pairs in the string manipulation functions. Slowest, but safest for the programmer.
B) Develop a utf-16 string class that doesn't take into account surrogate pairs in the string manipulation functions. Faster, but it will be very easy to mangle the unicode string. In fact, it can be guaranteed on.
C) Develop a utf-32 string class. The most simple solution, but it takes up the most memory as it uses 4 bytes per character.

CuteAlien · Post by **CuteAlien** » Wed Dec 16, 2009 3:43 pm

What I was thinking about was doing solution A like you already do. But so far the interface of your class looks similar to your old string class - so I was just thinking that it might be possible (not sure yet) to use a typedef in Irrlicht which can be switched with a define. Then you could switch between current solution for speed and your solution A for correctness.

So far I think the way you do it right now is the best way when people need unicode. But I certainly also see zet.dp.ua's point and as long as it's possible I see no problem in a define which allows switching unicode support off.

zet.dp.ua · Post by **zet.dp.ua** » Wed Dec 16, 2009 4:21 pm

Nalin wrote: Well, there are a couple choices from here.
A) Develop a utf-16 string class that takes into account surrogate pairs in the string manipulation functions. Slowest, but safest for the programmer.
B) Develop a utf-16 string class that doesn't take into account surrogate pairs in the string manipulation functions. Faster, but it will be very easy to mangle the unicode string. In fact, it can be guaranteed on.
C) Develop a utf-32 string class. The most simple solution, but it takes up the most memory as it uses 4 bytes per character.

Right, i would even say there are 2 choices:
1. Develop a utf-16 string class that takes into account surrogate pairs and is used in cache-based text elements instead of using draw(stringw ...)-like methods - fast and safe.
2. Develop utf16-sliced/utf32 string class with common string-related methods and specialized conversion, so user can decide what is better suitable for the title - very fast but can require memory.

I can post a sample textnode class that draws buffer generated at text set time if you want to optimize current textnode in case of A way + bitmap font tool with glyphs packer and font generator from .psd files.

Nalin · Post by **Nalin** » Thu Dec 17, 2009 10:46 pm

zet.dp.ua wrote:Right, i would even say there are 2 choices:
1. Develop a utf-16 string class that takes into account surrogate pairs and is used in cache-based text elements instead of using draw(stringw ...)-like methods - fast and safe.

Well, what would be used in the draw methods if not the utf-16 string? Would you require that the utf-16 string be converted into a utf-32 string first? That would take even longer than just iterating through the utf-16 string, as converting a utf-16 surrogate pair to a utf-32 character is just some bitshifts and a few subtractions. And you would have to do that anyways to convert it to utf-32.

zet.dp.ua wrote:2. Develop utf16-sliced/utf32 string class with common string-related methods and specialized conversion, so user can decide what is better suitable for the title - very fast but can require memory.

That is always an option. Develop two different unicode string classes that can be toggled with some #defines. But a utf-32 string class is easy, as all you need to do is use the existing string class and just add some conversion functions, so I'm not going to focus on that at this point. But it wouldn't be hard to add that ability.

zet.dp.ua wrote:I can post a sample textnode class that draws buffer generated at text set time if you want to optimize current textnode in case of A way + bitmap font tool with glyphs packer and font generator from .psd files.

In the long run, I definitely thing the text scene nodes need to be changed to cache the text.

Nalin · Post by **Nalin** » Fri Dec 18, 2009 12:52 am

I updated the class again:
http://irrlicht.pastebin.com/f2de564ba

I got rid of some of the magic numbers (the UTF-16 surrogate starting values) and converted more of the string manipulation functions (split is, I think, the only one not implemented currently.)

I also worked on the iterator a bit, improving the performance, and made some of the functionality consistent by making all of the functions, including the iterator constructors, specify position in terms of characters, instead of individual code points. Only the *_raw() functions deal with individual code points.

I think all I need to do is develop a new split() function and have validate() also determine the validity of the UTF-16 string. Do you have any other suggestions for the class?

What should the next step be after I test the class and fix bugs? Should I work on the XML parser and make sure it serializes the class correctly, as well as reads/converts unicode properly, or should I work on some new text scene nodes that take advantage of the unicode string class?

CuteAlien · Post by **CuteAlien** » Fri Dec 18, 2009 8:32 am

Having a few tests/examples would be the most important for now I guess. Also I try if I can find some time on the weekend to check it out some more, but no promises.

A little side-note on the TAlloc - there's a recently found bug in irr::string for that which you also copied: http://sourceforge.net/tracker/?func=de ... tid=540676
In short - if you would use another allocator than the default one it would crash because it would no longer use the operator=. But well, we also haven't fixed that yet in the engine (and it's been in a _long_ time in the engine without anyone complaining so I guess allocators are not really used that much...).

zet.dp.ua · Post by **zet.dp.ua** » Fri Dec 18, 2009 8:38 am

Nalin wrote:Well, what would be used in the draw methods if not the utf-16 string? Would you require that the utf-16 string be converted into a utf-32 string first? That would take even longer than just iterating through the utf-16 string, as converting a utf-16 surrogate pair to a utf-32 character is just some bitshifts and a few subtractions. And you would have to do that anyways to convert it to utf-32.

Only vertex buffer in draw method of textnode class. setText(utf16) -> iterate each symbol -> extract glyph data -> fill vertex buffer -> draw buffer

Nalin wrote:I updated the class again:
http://irrlicht.pastebin.com/f2de564ba

Some c++ comments:
Don't use virtual functions in such "low-level" classes, try to use inner iterator and access classes instead. And extract all possible common code blocks into separate functions (Convert the surrogate pair into a single UTF-32 character...)

zet.dp.ua · Post by **zet.dp.ua** » Fri Dec 18, 2009 8:44 am

One suggestion: uchar16_t* toUTF16() -> core::array<uchar16_t> toUTF16()
to simplify delete []