i18n with irrlicht for cyrillic

CuteAlien · Post by **CuteAlien** » Thu Apr 19, 2007 5:16 pm

I've worked on a cyrillic version for my game the last few days. That was surprisingly difficult despite the
fact that we are not using much text, so i thought it might be of some general interest how to do that stuff with Irrlicht.
All sources posted by me in this thread are public domain and maybe parts of it will find their way back in Irrlicht.
Oh - and i18n stands for "Internationalization".

What i needed:
- Font output using the truetype font class from here: http://irrlicht.sourceforge.net/phpBB2/ ... highlight=
- Support for cyrillic fonts in the editbox (so i needed the correct event.KeyInput.Char)
- Support for keyboard keys for playing (so i needed a usable event.KeyInput.Key) with the correct names
- Has to work on linux and windows 98 to windows vista

Some preparations i had already done in advance, which are always a good idea in any application:
- All texts which are used for display were put in a stringtable (there's a simple stringtableclass here, but it does not work yet with i18n for reasons described below: http://www.michaelzeilfelder.de/irrlicht.htm)
- All those texts are internally using widechar strings (wchar_t, irr::stringw, std::wstring)

Also some general information you will need to understand when programming with Unicode and widechars:
The c++ type wchar_t has 2 bytes on Windows and 4 bytes in Linux. It is therefore typically used for encoding the Unicode formats UTF-16/UCS-2 on Windows and UTF-32/UCS-4 on Linux.
UTF-32/UCS-4 are identical, but there's a difference between UTF-16 and UCS-2, as UTF-16 can use several 16bit numbers
in a row to encode charsets which won't fit in 16bit.
This does not matter much for cyrillic which can be represented by UCS-2 and will have identical codes for UTF-16 and UTF-32/UCS-4. You might have to care about that when using some more eastern languages (and it will even get a little harder when using fantasy languages like Klingon which are also supported by Unicode).

Ok, so i'm happy with wchar_t's using UCS-2 for text output. But now i stumpled upon my first problem - my stringtable class is using tinyXML and that is using yet another Unicode format called UTF-8. UTF-8 works for some western languages (p.E. English and German) like ASCII, as it's only using a single byte for those languages. But UTF-8 can be used to represent any unicode char and it does that by having a way to use several bytes in a row to represent a single character. Cyrillic chars for example won't fit in the first byte so UTF-8 will use two bytes for each char for that language.

For this conversion from UTF-8 to UTF-16 i used the function FromUtf8 from this site: http://www.codeproject.com/useritems/UtfConverter.asp
Edit (2007-04-20): I'm not 100% sure if FromUtf8 and ToUtf8 will work in all cases and i found now one problem with those functions. The resulting strings seemed ok, but the == operator of the string class failed when i assigned the resulting strings to other strings and compared those. To fix that i changed them somewhat insofar as i return no longer the resultstring in those functions but do it like that now:

Code: Select all

return std::wstring( resultstring.c_str() );

This function does build upon a file from http://www.unicode.org/Public/PROGRAMS/CVTUTF/
And wow - that was already all i needed to print some nice cyrillic characters. Btw, while those transformations caused some work for me, it is actually a way often recommended when doing i18n stuff to use UTF-8 for files and UCS-2 or UCS-4 within your application. This way your application will be fast, but you can still use all tools which are working with ASCII-text files.

Now to the keyboard input. Let's do Linux first.
The keyevents needed to be fixed for two cases
a) KeyInput.Key should have an EKEY_CODE for russian keyboards
b) KeyInput.Char should have the correct Unicode id for cyrillic chars

For a) i found only a rather ugly solutions for Linux (there's maybe a better one, but i gave up on that).
It will work for most keys, by returning the corresponding english key, which is usually also printed onto russian keyboards.
I just enhanced the keymap in CIrrDeviceLinux.cpp in createKeyMap by that:

Code: Select all

#ifdef XK_CYRILLIC
    KeyMap.push_back(SKeyMap(XK_Cyrillic_shorti, KEY_KEY_Q));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_SHORTI, KEY_KEY_Q));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_tse, KEY_KEY_W));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_TSE, KEY_KEY_W));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_u, KEY_KEY_E));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_U, KEY_KEY_E));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ka, KEY_KEY_R));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_KA, KEY_KEY_R));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ie, KEY_KEY_T));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_IE, KEY_KEY_T));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_en, KEY_KEY_Y));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_EN, KEY_KEY_Y));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ghe, KEY_KEY_U));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_GHE, KEY_KEY_U));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_sha, KEY_KEY_I));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_SHA, KEY_KEY_I));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_shcha, KEY_KEY_O));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_SHCHA, KEY_KEY_O));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ze, KEY_KEY_P));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ZE, KEY_KEY_P));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ha, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_HA, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_hardsign, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_HARDSIGN, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ef, KEY_KEY_A));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_EF, KEY_KEY_A));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_yeru, KEY_KEY_S));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_YERU, KEY_KEY_S));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ve, KEY_KEY_D));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_VE, KEY_KEY_D));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_a, KEY_KEY_F));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_A, KEY_KEY_F));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_pe, KEY_KEY_G));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_PE, KEY_KEY_G));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_er, KEY_KEY_H));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ER, KEY_KEY_H));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_o, KEY_KEY_J));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_O, KEY_KEY_J));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_el, KEY_KEY_K));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_EL, KEY_KEY_K));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_de, KEY_KEY_L));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_DE, KEY_KEY_L));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_zhe, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ZHE, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_e, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_E, 0));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ya, KEY_KEY_Z));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_YA, KEY_KEY_Z));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_che, KEY_KEY_X));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_CHE, KEY_KEY_X));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_es, KEY_KEY_C));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_ES, KEY_KEY_C));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_em, KEY_KEY_V));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_EM, KEY_KEY_V));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_i, KEY_KEY_B));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_I, KEY_KEY_B));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_te,  KEY_KEY_N));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_TE,  KEY_KEY_N));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_softsign, KEY_KEY_M));
    KeyMap.push_back(SKeyMap(XK_Cyrillic_SOFTSIGN, KEY_KEY_M));
#endif // #ifdef XK_CYRILLIC

For b) i found some help on the web. Those are public domain sources to translate those codes from the internal x-windows representation to Unicode:
http://www.cl.cam.ac.uk/~mgk25/ucs/keysym2ucs.h
http://www.cl.cam.ac.uk/~mgk25/ucs/keysym2ucs.c
I added those files to the engine and had to change some lines in bool CIrrDeviceLinux::run() in the cases of KeyPress and KeyRelease:

Code: Select all

// old code solution
//mbtowc(&irrevent.KeyInput.Char, buf, 4);	
// new solution 
long int ucsCode = keysym2ucs(mp.X11Key);
if (ucsCode == -1) 
    ucsCode =0; 
memcpy( &irrevent.KeyInput.Char, &ucsCode, sizeof(wchar_t) );

So now i had already a mostly working support for cyrillic in Linux. The last step missing was that i needed the cyrillic names for all EKEY_CODE's. The solution to that was to add a GetNameButton function to my keyboard interface class, which returns now a stringtable entry for each EKEY_CODE. So i can now translate the keynames in a xml file.

Ok, same stuff for Windows. I should mention in advance that i had some more restrictions for windows which complicated things somewhat which you might not have. First of all i need to support Windows 98. While this is often no problem it is a big problem when it comes to Unicode because Windows 95/98 and ME had basically no support for that. Then it has to compile with MinGW. And last it must work legally with a commercially distributed game.

Now Microsoft does offer the "Microsoft Layer for Unicode on Windows 95/98/ME Systems" since a few years. Which is nice.
They allow you even do get and distribute it for free, which is even nicer. Well, really great would have been if they
would also release such codes, which do basically fix stuff that sucks in 95/98/ME, as open source with a license which can
just be used. They are not that nice. Actually they have the usual EULA-stuff which for example only allows you to use
that layer when you put yourself such an restrictive EULA in front of your game. And it's certainly the usual closed source lib
which is harder to get working with other Environments like MinGW. I'm not even sure if it would be legal to use there.

There are two open libraries which try to help you there.
libunicows allows you using the MS Unicode layer with other compilers, like MinGW: http://libunicows.sourceforge.net/
opencow is a free replacement for the MS Unicode layer, but seems not yet to be complete: http://opencow.sourceforge.net/
Adding new libs is always some trouble, and i don't know if those libs will work and i didn't know if the Mozilla license of opencow
would have been compatible with my project. Still those libs might be fine and useful for you in case you're working on similar stuff.

But in the end i found another way to do it, which will need another short introduction to a few things:
ANSI codepages: Windows is using a separate codepage for a lot of languages, which is basically a table of keycodes in the multibyteformat (yet another format - it's not yet Unicode).
HKL (i think it stands for: handle keyboard layout): That is telling you the language id for the language to which your keyboard is currently set. You can change that with the language symbols in the tray if you have installed several languages in windows.

I haven't found any function to get the ANSI codepage from the HKL or the language id's, but i found a table for it on
http://www.science.co.il/Language/Local ... ?s=decimal
So i wrote it:

Code: Select all

unsigned int LangIdToCodepage(unsigned int langId_)
{
    switch ( langId_ )
    {
        case 1098:  // Telugu
        case 1095:  // Gujarati
        case 1094:  // Punjabi
        case 1103:  // Sanskrit
        case 1111:  // Konkani
        case 1114:  // Syriac
        case 1099:  // Kannada
        case 1102:  // Marathi
        case 1125:  // Divehi
        case 1067:  // Armenian
        case 1081:  // Hindi
        case 1079:  // Georgian
        case 1097:  // Tamil
            return 0;
        case 1054:  // Thai
            return 874;
        case 1041:  // Japanese
            return 932;
        case 2052:  // Chinese (PRC)
        case 4100:  // Chinese (Singapore)
            return 936;
        case 1042:  // Korean
            return 949;
        case 5124:  // Chinese (Macau S.A.R.)
        case 3076:  // Chinese (Hong Kong S.A.R.)
        case 1028:  // Chinese (Taiwan)
            return 950;
        case 1048:  // Romanian
        case 1060:  // Slovenian
        case 1038:  // Hungarian
        case 1051:  // Slovak
        case 1045:  // Polish
        case 1052:  // Albanian
        case 2074:  // Serbian (Latin)
        case 1050:  // Croatian
        case 1029:  // Czech
            return 1250;
        case 1104:  // Mongolian (Cyrillic)
        case 1071:  // FYRO Macedonian
        case 2115:  // Uzbek (Cyrillic)
        case 1058:  // Ukrainian
        case 2092:  // Azeri (Cyrillic)
        case 1092:  // Tatar
        case 1087:  // Kazakh
        case 1059:  // Belarusian
        case 1088:  // Kyrgyz (Cyrillic)
        case 1026:  // Bulgarian
        case 3098:  // Serbian (Cyrillic)
        case 1049:  // Russian
            return 1251;
        case 8201:  // English (Jamaica)
        case 3084:  // French (Canada)
        case 1036:  // French (France)
        case 5132:  // French (Luxembourg)
        case 5129:  // English (New Zealand)
        case 6153:  // English (Ireland)
        case 1043:  // Dutch (Netherlands)
        case 9225:  // English (Caribbean)
        case 4108:  // French (Switzerland)
        case 4105:  // English (Canada)
        case 1110:  // Galician
        case 10249:  // English (Belize)
        case 3079:  // German (Austria)
        case 6156:  // French (Monaco)
        case 12297:  // English (Zimbabwe)
        case 1069:  // Basque
        case 2067:  // Dutch (Belgium)
        case 2060:  // French (Belgium)
        case 1035:  // Finnish
        case 1080:  // Faroese
        case 1031:  // German (Germany)
        case 3081:  // English (Australia)
        case 1033:  // English (United States)
        case 2057:  // English (United Kingdom)
        case 1027:  // Catalan
        case 11273:  // English (Trinidad)
        case 7177:  // English (South Africa)
        case 1030:  // Danish
        case 13321:  // English (Philippines)
        case 15370:  // Spanish (Paraguay)
        case 9226:  // Spanish (Colombia)
        case 5130:  // Spanish (Costa Rica)
        case 7178:  // Spanish (Dominican Republic)
        case 12298:  // Spanish (Ecuador)
        case 17418:  // Spanish (El Salvador)
        case 4106:  // Spanish (Guatemala)
        case 18442:  // Spanish (Honduras)
        case 3082:  // Spanish (International Sort)
        case 13322:  // Spanish (Chile)
        case 19466:  // Spanish (Nicaragua)
        case 2058:  // Spanish (Mexico)
        case 10250:  // Spanish (Peru)
        case 20490:  // Spanish (Puerto Rico)
        case 1034:  // Spanish (Traditional Sort)
        case 14346:  // Spanish (Uruguay)
        case 8202:  // Spanish (Venezuela)
        case 1089:  // Swahili
        case 1053:  // Swedish
        case 2077:  // Swedish (Finland)
        case 5127:  // German (Liechtenstein)
        case 1078:  // Afrikaans
        case 6154:  // Spanish (Panama)
        case 4103:  // German (Luxembourg)
        case 16394:  // Spanish (Bolivia)
        case 2055:  // German (Switzerland)
        case 1039:  // Icelandic
        case 1057:  // Indonesian
        case 1040:  // Italian (Italy)
        case 2064:  // Italian (Switzerland)
        case 2068:  // Norwegian (Nynorsk)
        case 11274:  // Spanish (Argentina)
        case 1046:  // Portuguese (Brazil)
        case 1044:  // Norwegian (Bokmal)
        case 1086:  // Malay (Malaysia)
        case 2110:  // Malay (Brunei Darussalam)
        case 2070:  // Portuguese (Portugal)
            return 1252;
        case 1032:  // Greek
            return 1253;
        case 1091:  // Uzbek (Latin)
        case 1068:  // Azeri (Latin)
        case 1055:  // Turkish
            return 1254;
        case 1037:  // Hebrew
            return 1255;
        case 5121:  // Arabic (Algeria)
        case 15361:  // Arabic (Bahrain)
        case 9217:  // Arabic (Yemen)
        case 3073:  // Arabic (Egypt)
        case 2049:  // Arabic (Iraq)
        case 11265:  // Arabic (Jordan)
        case 13313:  // Arabic (Kuwait)
        case 12289:  // Arabic (Lebanon)
        case 4097:  // Arabic (Libya)
        case 6145:  // Arabic (Morocco)
        case 8193:  // Arabic (Oman)
        case 16385:  // Arabic (Qatar)
        case 1025:  // Arabic (Saudi Arabia)
        case 10241:  // Arabic (Syria)
        case 14337:  // Arabic (U.A.E.)
        case 1065:  // Farsi
        case 1056:  // Urdu
        case 7169:  // Arabic (Tunisia)
            return 1256;
        case 1061:  // Estonian
        case 1062:  // Latvian
        case 1063:  // Lithuanian
            return 1257;
        case 1066:  // Vietnamese
            return 1258;
    }
    return 65001;   // utf-8
}

I added that function to CIrrDeviceWin32.cpp and i also added some variables (outside all scopes - this could certainly be done nicer, but it's easier to paste here that way).

Code: Select all

static HKL KEYBOARD_INPUT_HKL=0;
static unsigned int KEYBOARD_INPUT_CODEPAGE = 1252;

I set them once at the end of the CIrrDeviceWin32 constructor to initialize them.

Code: Select all

	// get the codepage used for keyboard input
    KEYBOARD_INPUT_HKL = GetKeyboardLayout(0);
    KEYBOARD_INPUT_CODEPAGE = LangIdToCodepage( LOWORD(KEYBOARD_INPUT_HKL) );

And as the user can change it at runtime i catch that in the WndProc

Code: Select all

case WM_INPUTLANGCHANGE:
        // get the new codepage used for keyboard input
        KEYBOARD_INPUT_HKL = GetKeyboardLayout(0);
        KEYBOARD_INPUT_CODEPAGE = LangIdToCodepage( LOWORD(KEYBOARD_INPUT_HKL) );
        return 0;

With that information i can now change WM_KEYDOWN:

Code: Select all

	case WM_KEYDOWN:
		{
			event.EventType = irr::EET_KEY_INPUT_EVENT;
			event.KeyInput.Key = (irr::EKEY_CODE)wParam;
			event.KeyInput.PressedDown = true;
			dev = getDeviceFromHWnd(hWnd);

			BYTE allKeys[256];
			WORD KeyAsc=0;
			GetKeyboardState(allKeys);
			ToAsciiEx(wParam,lParam,allKeys,&KeyAsc,0,KEYBOARD_INPUT_HKL);	// ToAscii wouldn't work for unicode on newer window systems

			event.KeyInput.Shift = ((allKeys[VK_SHIFT] & 0x80)!=0);
			event.KeyInput.Control = ((allKeys[VK_CONTROL] & 0x80)!=0);

            WORD unicodeChar;
            MultiByteToWideChar(
                KEYBOARD_INPUT_CODEPAGE,
                MB_PRECOMPOSED, // default
                (LPCSTR)&KeyAsc,
                sizeof(KeyAsc),
                (WCHAR*)&unicodeChar,
                1 );
            event.KeyInput.Char = unicodeChar;

			if (dev)
				dev->postEventFromUser(event);

			return 0;
		}

Hm, that's it already. If you ever have to do that it's probably still some work, but i hope this text will help you a little. I can't promise yet it's completely without bugs - i only could test it on two systems so far, but i will do some more testing in the next weeks and will notice you if something does not work yet.

Edit (2007-04-23): I found another i18n problem which can happen when saving files. Cyrillic filenames are not supported by all systems (i don't even know yet which work and which don't). Just converting strings to Utf8-Names can even result in filenames which can, for example in Windows 98, no longer be renamed or deleted (at least outside the dos-box). So far i have no solution for that, except to use English filenames if you can.

Edit (2007-05-30): Added lester's fix for spaces appearing when pressing shift, ctrl, esc.

lester · Post by **lester** » Fri May 25, 2007 12:00 pm

Thank you for this wonderful how-to, CuteAlien! It works great but I have one issue. See, if you press a shift or control key, the ucsCode is set to -1, which is returned by keysym2ucs function. But the memcpy copies that value to the irrevent.KeyInput.Char producing the sufficient space in the text field. So I suggest to add an expression before memcpy call

Code: Select all

if (ucsCode == -1) ucsCode =0;

which fixes this behavior. Btw it also fixes that annoying bug with Ecs key

CuteAlien · Post by **CuteAlien** » Wed May 30, 2007 2:10 am

Thanks lester, i have fixed it now.