[fixed]XML reader can't read files produced by XML writer

You discovered a bug in the engine, and you are sure that it is not a problem of your code? Just post it in here. Please read the bug posting guidelines first.
nburlock
Posts: 17
Joined: Tue Oct 21, 2008 3:33 am
Location: Australia

[fixed]XML reader can't read files produced by XML writer

Post by nburlock »

I've just finished writing the Linux implementation of IrrFontTool, but I've been having trouble with getFont rejecting the produced XML font file. The problem seems to be that the XML writer is producing a file with four bytes per character, which is the size of a wchar_t on 64 bit Linux, but the reader seems to only be able to handle two bytes per character in an XML file.

I have a couple of questions for someone knowledgeable:
1) Is there some sort of "quick fix" that I haven't heard about for 64bit systems that makes this work?
2) If this is a problem, then is it that the reader should be able to handle four byte characters, or is it that the writer shouldn't be producing four byte characters?

As soon as I've got this out the way, I'll be able to finish testing the Linux implementation of IrrFontTool and release it.
Dorth
Posts: 931
Joined: Sat May 26, 2007 11:03 pm

Post by Dorth »

Just a thought: Great way to make an entrance ^^
Finding a bug and extending Irrlicht in your 2 first posts. Nice :)
rogerborg
Admin
Posts: 3590
Joined: Mon Oct 09, 2006 9:36 am
Location: Scotland - gonnae no slag aff mah Engleesh
Contact:

Re: XML reader can't read files produced by XML writer

Post by rogerborg »

nburlock wrote:1) Is there some sort of "quick fix" that I haven't heard about for 64bit systems that makes this work?
I believe that wchar_t is 32 bits by default on gcc compilers, regardless of the CPU architecture being targetted. -fshort-wchar should force it to be 16 bits.

nburlock wrote:2) If this is a problem, then is it that the reader should be able to handle four byte characters, or is it that the writer shouldn't be producing four byte characters?
It's a fundamental problem with wchar_t, which is why it's not a good type for data exchange. It would be great if Irrlicht defined its own wide type instead, perhaps a UCS-2 type (since UTF-16 brings its own sizing problems to the party).

Hmm, I'm meandering here. I guess I should actually look into doing a patch for this, although robustly testing it across all platforms will be interesting.
Please upload candidate patches to the tracker.
Need help now? IRC to #irrlicht on irc.freenode.net
How To Ask Questions The Smart Way
nburlock
Posts: 17
Joined: Tue Oct 21, 2008 3:33 am
Location: Australia

Re: XML reader can't read files produced by XML writer

Post by nburlock »

rogerborg wrote:-fshort-wchar should force it to be 16 bits.
Great info, thanks for that.

I've logged it as a bug:

https://sourceforge.net/tracker2/?func= ... tid=540676
CuteAlien
Admin
Posts: 9734
Joined: Mon Mar 06, 2006 2:25 pm
Location: Tübingen, Germany
Contact:

Post by CuteAlien »

I don't think that's the problem. All xml-files produced by Irrlicht on Linux are (unfortunately) always 4 bytes and it usually can also read them.

I have no experience with the IrrFontTool, but search around in the forum, I remember having seen already a few threads about that.
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
nburlock
Posts: 17
Joined: Tue Oct 21, 2008 3:33 am
Location: Australia

Post by nburlock »

The problem is happening inside the read method of IXMLReader. If I give it a four byte per char file (created by Font Tool), that function will fail. If I strip the extra 2 bytes out of each char in the XML file, then read will work. I've also noticed that Text Editor (Ubuntu's Wordpad equivalent) can't open the four byte per char XML file (it thinks it's a binary file), while Firefox can.

I went back and had Irrlicht create the simplest possible XML file, just a header and one tag, and the same problem is present. I checked the file in a Hex editor, and apart from the Unicode header in the first two bytes of the file, 0xFFFE, everything else is one character value followed by 3 zero bytes which should be legal. Again, Firefox can open this file, but Text Editor and Irrlicht can't.
CuteAlien
Admin
Posts: 9734
Joined: Mon Mar 06, 2006 2:25 pm
Location: Tübingen, Germany
Contact:

Post by CuteAlien »

Irrlicht checks for the following formats:

Code: Select all

const unsigned char UTF8[] = {0xEF, 0xBB, 0xBF}; // 0xEFBBBF;
const int UTF16_BE = 0xFFFE;
const int UTF16_LE = 0xFEFF;
const int UTF32_BE = 0xFFFE0000;
const int UTF32_LE = 0x0000FEFF;
So 0xfffe would be utf16_be, only if it's followed by 0000 then it's an utf32_be.

I'm not really an expert on IrrXML, but I'm often using utf32 files with Irrlicht so that's why I would be surprised to see a problem there. Which version of irrlicht are you using?
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
nburlock
Posts: 17
Joined: Tue Oct 21, 2008 3:33 am
Location: Australia

Post by nburlock »

I'm running 1.4.2

I've tracked the problem down. It starts at line 573 of CXMLReaderImpl.h:

Code: Select all

char32* data32 = reinterpret_cast<char32*>(data8);
Then, the following is defined a little further on:

Code: Select all

const int UTF32_BE = 0xFFFE0000;
const int UTF32_LE = 0x0000FEFF;
Two if statements are used to determine of the first four bytes of the file are big (line 587) or little endian (594):

Code: Select all

if (size >= 4 && data32[0] == (char32)UTF32_BE)

if (size >= 4 && data32[0] == (char32)UTF32_LE)
Both tests fail because:

Code: Select all

data32[0] = 0x0000FEFF
(char32) UTF32_BE = 0xFFFE0000
(char32) UTF32_LE = 0xFEFF
And the code goes on to determine that it's a 2 byte character file of type UTF16_LE, which is why it doesn't work. This will need someone with more experience of the system to say what needs to be fixed.
CuteAlien
Admin
Posts: 9734
Joined: Mon Mar 06, 2006 2:25 pm
Location: Tübingen, Germany
Contact:

Post by CuteAlien »

Looks like something for hybrid (I guess he's currently in holiday as he didn't post the last days and it's holiday time in his area).

Still I don't really get it as 0xFEFF should be equal to 0x0000FEFF and so it should recognice the UTF32_LE in that 'if' clause.
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
nburlock
Posts: 17
Joined: Tue Oct 21, 2008 3:33 am
Location: Australia

Post by nburlock »

I mistyped the value of data32[0] in my previous post, it's actually 0x3C0000FFFE.

char32 is defined as an unsigned long, which is eight bytes on my 64 bit system. That explains why this isn't working, because it's comparing the first 8 bytes of the file against a four byte value. The following code demonstrates the problem:

Code: Select all

        char data8[8] = { 0xFE,0xFF,0x00,0x00,0x3C,0x00,0x00,0x00 };
        char32* data32 = reinterpret_cast<char32*>(&data8[0]);
        char16* data16 = reinterpret_cast<char16*>(&data8[0]);
        const int UTF32_BE = 0xFFFE0000;
        const int UTF32_LE = 0x0000FEFF;
        
        if (data32[0] == (char32)UTF32_BE)
            printf("big endian\n");

        if (data32[0] == (char32)UTF32_LE)
            printf("little endian\n");
So then I guess that the solution is to change char32 to some type that is four bytes long on all platforms.
CuteAlien
Admin
Posts: 9734
Joined: Mon Mar 06, 2006 2:25 pm
Location: Tübingen, Germany
Contact:

Post by CuteAlien »

nburlock wrote: So then I guess that the solution is to change char32 to some type that is four bytes long on all platforms.
Yes, that sounds like a rather good idea :-)
IRC: #irrlicht on irc.libera.chat
Code snippet repository: https://github.com/mzeilfelder/irr-playground-micha
Free racer made with Irrlicht: http://www.irrgheist.com/hcraftsource.htm
nburlock
Posts: 17
Joined: Tue Oct 21, 2008 3:33 am
Location: Australia

Post by nburlock »

I've posted the info to the bug report, but I'm not going to post a patch - I've no idea what types are constant across all the different compilers and platforms Irrlicht supports :P
Last edited by nburlock on Thu Oct 30, 2008 3:17 am, edited 1 time in total.
hybrid
Admin
Posts: 14143
Joined: Wed Apr 19, 2006 9:20 pm
Location: Oldenburg(Oldb), Germany
Contact:

Post by hybrid »

Hmm, long type is no good idea, indeed. I also thought that I fixed the 64bit problems some month ago, but I'll chek when I'm home from holidays.
rogerborg
Admin
Posts: 3590
Joined: Mon Oct 09, 2006 9:36 am
Location: Scotland - gonnae no slag aff mah Engleesh
Contact:

Post by rogerborg »

Do we just want char32 to be an unsigned 32 bit type?

Presumably u32 is an unsigned 32 bit type, even on a 64 bit system?

Unfortunately, we can't just "typedef u32 char32", since that farks up the string<char32> type defined by CXMLReaderImpl ( operator += (const unsigned int i) is the same as operator += (T c) )

What a pretty pickle!
Please upload candidate patches to the tracker.
Need help now? IRC to #irrlicht on irc.freenode.net
How To Ask Questions The Smart Way
vitek
Bug Slayer
Posts: 3919
Joined: Mon Jan 16, 2006 10:52 am
Location: Corvallis, OR

Post by vitek »

rogerborg wrote:Do we just want char32 to be an unsigned 32 bit type?
I wouldn't think so. I think that a char32 should be a 32-bit integral type that has the same signedness as a char.
rogerborg wrote:Presumably u32 is an unsigned 32 bit type, even on a 64 bit system?
Yeah, it should.
rogerborg wrote:that farks up the string<char32> type defined by CXMLReaderImpl (operator += (const unsigned int i) is the same as operator += (T c) )
There are ways around this. One would be to just remove the operator overloading and use unique method names. Of course that breaks source compatibility for some users. Another way is to us SFINAE and remove one of the overloads when T is unsigned int.

Travis
Post Reply