Unicode-aware Irrlicht

Nalin · Post by **Nalin** » Fri Dec 18, 2009 10:36 am

zet.dp.ua wrote:Only vertex buffer in draw method of textnode class. setText(utf16) -> iterate each symbol -> extract glyph data -> fill vertex buffer -> draw buffer

Yeah, I plan on creating a new text scene node that does that.

zet.dp.ua wrote:Some c++ comments:
Don't use virtual functions in such "low-level" classes, try to use inner iterator and access classes instead. And extract all possible common code blocks into separate functions (Convert the surrogate pair into a single UTF-32 character...)

Are you talking about the virtual functions in the _ustring_base class? I had to add those to work around a limitation in C++. Basically, the iterator references ustring, and ustring references the iterator. If that workaround wasn't there, it would fail to compile the iterator because ustring hasn't been defined yet.

Do you have any idea on how to fix it?

EDIT: Oh. I think I may have realized how to do it. :x

zet.dp.ua wrote:One suggestion: uchar16_t* toUTF16() -> core::array<uchar16_t> toUTF16()
to simplify delete []

That is a good idea. I can't believe I didn't even think of that.

zet.dp.ua · Post by **zet.dp.ua** » Fri Dec 18, 2009 2:39 pm

Are you talking about the virtual functions in the _ustring_base class? I had to add those to work around a limitation in C++. Basically, the iterator references ustring, and ustring references the iterator. If that workaround wasn't there, it would fail to compile the iterator because ustring hasn't been defined yet.

Do you have any idea on how to fix it?

EDIT: Oh. I think I may have realized how to do it.

Yes, maybe you know already. Smth like it is done in std, irrMap...

Code: Select all

class string
{
public:
  class const_iterator
  {
     ...
  };
  ...
};

Nalin · Post by **Nalin** » Fri Dec 18, 2009 8:56 pm

Again: http://irrlicht.pastebin.com/f26df2de4

I merged the iterator classes into the ustring class to get rid of the _ustring_base hack, like how irrMap does it.
I changed the to*() functions to return a core::array instead of a pointer.
I fixed the string allocator issue.
I simplified some of the code.
I fixed a couple bugs.

I won't be able to work on this for a couple days as I will be very busy this weekend. If anybody else chimes in with some suggestions, I'll get to them next week.

zet.dp.ua · Post by **zet.dp.ua** » Sat Dec 19, 2009 10:25 am

Great! It become better and better! Thank you for your work.

One more suggestion:

Code: Select all

class _ustring_iterator_access : public class _ustring_const_iterator_access
{
public:
  _ustring_iterator_access& operator=(const uchar32_t c)
  {
    ...
  }
};

as well as _ustring_iterator

Nalin · Post by **Nalin** » Sat Dec 19, 2009 6:59 pm

zet.dp.ua wrote:Great! It become better and better! Thank you for your work.

One more suggestion:
Code: Select all
class _ustring_iterator_access : public class _ustring_const_iterator_access
{
public:
  _ustring_iterator_access& operator=(const uchar32_t c)
  {
    ...
  }
};
as well as _ustring_iterator

Wouldn't that then inherit the const ustring ref, making the operator= function fail?

zet.dp.ua · Post by **zet.dp.ua** » Mon Dec 21, 2009 8:17 am

Nalin wrote: Wouldn't that then inherit the const ustring ref, making the operator= function fail?

Yes, but it is possible to remove const modifier. Const is useful, but in some cases it can be removed without any problems. I just think that the simpler the better. Less probability to make copy-paste error

But you decide.

Nalin · Post by **Nalin** » Tue Dec 22, 2009 11:12 am

New version: http://irrlicht.pastebin.com/f90b1b63

I finished implementing split() and started testing the class, fixing bugs as I came across them.

I also implemented toUTF8_s() and toWCHAR_s(), which return a stringc and a stringw. I also implemented a toUTF16_s() and a toUTF32_s() function, which is only turned on for C++0x compatible compilers (it will only work with GCC 4.5 at the moment, as that is the only compiler with the new unicode string data types.) This is the only solution I can come up with currently because of the issues with the string class. Also, I added a couple C++0x move semantics for fun.

I have attached the code I used to test the class with. It shows all of the functions that I have currently tested. I think there is only a few I haven't tested yet. And I still need to test the correctness of my UTF-8 conversion functions.

EDIT: I got around to testing all the last bits of code, including the UTF-8 conversion functions. It all worked perfectly. Here are the updated tests:

Code: Select all

	core::ustring test(L"This is a test: ");
	core::ustring test2(test);
	test.append((uchar32_t)0x10405);
	test += " - ???";

	core::ustring::iterator i = test.begin();
	i += 3;
	*i = 0x10400;
	++i;
	*i = 0x10401;

	test.erase(3);
	s32 loc1 = test.find(L"is");
	s32 loc2 = test.findLast((uchar32_t)'e');
	s32 loc3 = test.findFirst((uchar32_t)0x10401);

	uchar32_t buf[3] = {0x10400, 0x10401, 0x10405};
	s32 loc4 = test.findFirstChar(buf, 3);
	s32 loc5 = test.findFirstCharNotInList(buf, 3);
	s32 loc6 = test.findLastChar(buf, 3);
	s32 loc7 = test.findLastCharNotInList(buf, 3);
	uchar32_t lc = test.lastChar();
	test.replace('s', 'S');

	bool e1 = test.equalsn(test2, 5);
	bool e2 = test.equalsn(test2, 3);
	bool e3 = test.equalsn(test2, 4);
	bool e4 = (test == test2);
	bool e5 = (test != test2);

	uchar32_t ct1 = test[3];
	uchar32_t ct2 = test[4];
	test[4] = (uchar32_t)'I';
	
	core::ustring test3 = test.subString(3, 6);

	uchar32_t splitPos = 'S';
	core::list<core::ustring> sstrings;
	test.split<core::list<core::ustring> >(sstrings, &splitPos, 1);

	i.toStart();
	while (!i.atEnd())
	{
		uchar32_t c = *i;
		device->getLogger()->log(core::stringw(c).c_str());
		++i;
	}

	core::stringw t2 = test.toWCHAR_s();
	core::stringc t3 = test.toUTF8_s();

	io::IWriteFile* f1 = device->getFileSystem()->createAndWriteFile("test.txt");
	f1->write(t3.c_str(), t3.size());
	f1->drop();

	io::IReadFile* f2 = device->getFileSystem()->createAndOpenFile("test.txt");
	c8* fbuf = new c8[f2->getSize() + 1];
	f2->read(fbuf, f2->getSize());
	fbuf[f2->getSize()] = 0;
	core::ustring test4(fbuf);
	delete[] fbuf;
	f2->drop();

	bool u8test = (test == test4);

	core::ustring bigunicode;
	bigunicode.append(0x10401);
	test2.removeChars("s");
	test.removeChars(bigunicode);

zet.dp.ua · Post by **zet.dp.ua** » Wed Dec 23, 2009 10:08 am

Excellent work!!!
Now the hardest part: decide where to use unicode string

I think it is better to have the single string class in any engine. But this will require a lot of changes almost everywhere. It is major API change.

Some pedantic notes

:
1.Where is simple insert function? And are you sure that insert_raw(..., 0) will work (u32, --, 0)?
2. The life is much easier when string have smth like friend bool operator== (const T* const strB, const string<T>& strA) (to be able to compare "" == ustring)

CuteAlien · Post by **CuteAlien** » Wed Dec 23, 2009 12:13 pm

zet.dp.ua wrote:Excellent work!!!
Now the hardest part: decide where to use unicode string :)
I think it is better to have the single string class in any engine. But this will require a lot of changes almost everywhere. It is major API change.

I think getting unicode support is worth a API break and it should probably be used throughout. Not before the 1.7 release, we're trying to get that out asap, but afterward this is something we will consider.

Dorth · Post by **Dorth** » Wed Dec 23, 2009 3:48 pm

Yeah, unicode is really a must nowadays and it's been too long coming

Nalin · Post by **Nalin** » Wed Dec 23, 2009 7:38 pm

zet.dp.ua wrote:Excellent work!!!
Now the hardest part: decide where to use unicode string
I think it is better to have the single string class in any engine. But this will require a lot of changes almost everywhere. It is major API change.

Some pedantic notes :
1.Where is simple insert function? And are you sure that insert_raw(..., 0) will work (u32, --, 0)?
2. The life is much easier when string have smth like friend bool operator== (const T* const strB, const string<T>& strA) (to be able to compare "" == ustring)

Well, what do you know. I forgot the insert function. Nice catch.

New version: http://irrlicht.pastebin.com/f7e652f3d

I added the missing insert() function and added a bunch of doxygen comments.

Nalin · Post by **Nalin** » Sun Jan 17, 2010 1:24 am

It has been a while, but I have released a new version:
http://irrlicht.pastebin.com/f374c5218

I focused on validating strings this time. This version will validate UTF-8 strings when you pass one to the class, replacing invalid character sequences with the letter U+FFFD, which is the unicode "replacement character." Also, the validate() function will now also check the validity of the UTF-16 string, along with its original function to make sure no \0 characters exist in the middle of the string.

I hope to start working on some scene nodes soon that take advantage of this class.

CuteAlien · Post by **CuteAlien** » Sun Jan 17, 2010 7:42 am

Hm, instead of working on scenenodes, maybe think about working on xml-saving first. Right now we can load different formats, but not save them and that's a real missing feature.
That way we could get the new string class into Irrlicht, have it already do some stuff and having more of a chance testing it out before starting to replace current strings. So that might be a good place to introduce a new string class.

For the other strings like in guielements and scenenodes , I'm wondering if we should maybe start thinking about using stringtables. For any international application I'm using a patched Irrlicht myself already where all classes using strings always have another variable for a string-table entry and the real string is set from that when the string-table entry is not empty (and when the string-table entry is not found in the string-table it's label-value is used directly as string. Also real-string certainly only changes when I change the string-table entry or one of it's parameters (yes it can do parameters)). The reason I'm mentioning that is that this might maybe (maybe(!), not sure at all) even allow keeping the current interface intact and just extending it. The string-tables could return the new string format. That could for output be converted to the current string format. (I'll put my stringtable implementation online later on, just needs a little clean-up first. Edit: online now: http://www.michaelzeilfelder.de/irrlich ... bleComplex).

Nalin · Post by **Nalin** » Tue Jan 19, 2010 5:12 am

Bug fixes:
http://irrlicht.pastebin.com/f73ea61b5

Mainly some bug fixes. I fixed a crash bug and I added a new operator=() function to solve an interesting recursion bug.

Also, I have created a new XML writer class to save XML in UTF-8 format. I then modified Irrlicht's XML reader to correctly convert unicode (Irrlicht cannot convert multi-byte unicode correctly). Finally, I modified CFileSystem::createXMLWriter() to create an instance of my UTF-8 XML writer class. The end result is that Irrlicht can load UTF-8, UTF-16, or UTF-32 encoded XML files and write UTF-8 XML files.

Here is a patch that includes all of those changes (including irrUString.h):
http://nalin.suckerfree.org/public/code ... utf8.patch

EDIT: Oops, I found an error in my UTF-8 validation routine that caused it to insert a mangled U+FFFD replacement character. I have fixed it and updated both links.

CuteAlien · Post by **CuteAlien** » Tue Jan 19, 2010 9:06 am

Thanks, great work again :-)
I suppose you are using some test-codes for development. If you could post those also, it would be easier for others checking the patch out. Although we're still all deep in 1.7 release-fixing currently, so please be patient (also I'm even deeper in finishing a day-project this weeks).