[fixed]XML reader can't read files produced by XML writer

nburlock · Post by **nburlock** » Tue Oct 21, 2008 3:56 am

I've just finished writing the Linux implementation of IrrFontTool, but I've been having trouble with getFont rejecting the produced XML font file. The problem seems to be that the XML writer is producing a file with four bytes per character, which is the size of a wchar_t on 64 bit Linux, but the reader seems to only be able to handle two bytes per character in an XML file.

I have a couple of questions for someone knowledgeable:
1) Is there some sort of "quick fix" that I haven't heard about for 64bit systems that makes this work?
2) If this is a problem, then is it that the reader should be able to handle four byte characters, or is it that the writer shouldn't be producing four byte characters?

As soon as I've got this out the way, I'll be able to finish testing the Linux implementation of IrrFontTool and release it.

Dorth · Post by **Dorth** » Tue Oct 21, 2008 6:24 am

Just a thought: Great way to make an entrance ^^
Finding a bug and extending Irrlicht in your 2 first posts. Nice

rogerborg · Post by **rogerborg** » Tue Oct 21, 2008 10:52 am

nburlock wrote:1) Is there some sort of "quick fix" that I haven't heard about for 64bit systems that makes this work?

I believe that wchar_t is 32 bits by default on gcc compilers, regardless of the CPU architecture being targetted. -fshort-wchar should force it to be 16 bits.

nburlock wrote:2) If this is a problem, then is it that the reader should be able to handle four byte characters, or is it that the writer shouldn't be producing four byte characters?

It's a fundamental problem with wchar_t, which is why it's not a good type for data exchange. It would be great if Irrlicht defined its own wide type instead, perhaps a UCS-2 type (since UTF-16 brings its own sizing problems to the party).

Hmm, I'm meandering here. I guess I should actually look into doing a patch for this, although robustly testing it across all platforms will be interesting.

nburlock · Post by **nburlock** » Tue Oct 21, 2008 12:20 pm

rogerborg wrote:-fshort-wchar should force it to be 16 bits.

Great info, thanks for that.

I've logged it as a bug:

https://sourceforge.net/tracker2/?func= ... tid=540676

CuteAlien · Post by **CuteAlien** » Tue Oct 21, 2008 2:24 pm

I don't think that's the problem. All xml-files produced by Irrlicht on Linux are (unfortunately) always 4 bytes and it usually can also read them.

I have no experience with the IrrFontTool, but search around in the forum, I remember having seen already a few threads about that.

nburlock · Post by **nburlock** » Tue Oct 21, 2008 4:31 pm

The problem is happening inside the read method of IXMLReader. If I give it a four byte per char file (created by Font Tool), that function will fail. If I strip the extra 2 bytes out of each char in the XML file, then read will work. I've also noticed that Text Editor (Ubuntu's Wordpad equivalent) can't open the four byte per char XML file (it thinks it's a binary file), while Firefox can.

I went back and had Irrlicht create the simplest possible XML file, just a header and one tag, and the same problem is present. I checked the file in a Hex editor, and apart from the Unicode header in the first two bytes of the file, 0xFFFE, everything else is one character value followed by 3 zero bytes which should be legal. Again, Firefox can open this file, but Text Editor and Irrlicht can't.

CuteAlien · Post by **CuteAlien** » Tue Oct 21, 2008 5:30 pm

Irrlicht checks for the following formats:

Code: Select all

const unsigned char UTF8[] = {0xEF, 0xBB, 0xBF}; // 0xEFBBBF;
const int UTF16_BE = 0xFFFE;
const int UTF16_LE = 0xFEFF;
const int UTF32_BE = 0xFFFE0000;
const int UTF32_LE = 0x0000FEFF;

So 0xfffe would be utf16_be, only if it's followed by 0000 then it's an utf32_be.

I'm not really an expert on IrrXML, but I'm often using utf32 files with Irrlicht so that's why I would be surprised to see a problem there. Which version of irrlicht are you using?

nburlock · Post by **nburlock** » Wed Oct 22, 2008 12:19 am

I'm running 1.4.2

I've tracked the problem down. It starts at line 573 of CXMLReaderImpl.h:

Code: Select all

char32* data32 = reinterpret_cast<char32*>(data8);

Then, the following is defined a little further on:

Code: Select all

const int UTF32_BE = 0xFFFE0000;
const int UTF32_LE = 0x0000FEFF;

Two if statements are used to determine of the first four bytes of the file are big (line 587) or little endian (594):

Code: Select all

if (size >= 4 && data32[0] == (char32)UTF32_BE)

if (size >= 4 && data32[0] == (char32)UTF32_LE)

Both tests fail because:

Code: Select all

data32[0] = 0x0000FEFF
(char32) UTF32_BE = 0xFFFE0000
(char32) UTF32_LE = 0xFEFF

And the code goes on to determine that it's a 2 byte character file of type UTF16_LE, which is why it doesn't work. This will need someone with more experience of the system to say what needs to be fixed.

CuteAlien · Post by **CuteAlien** » Wed Oct 22, 2008 3:53 am

Looks like something for hybrid (I guess he's currently in holiday as he didn't post the last days and it's holiday time in his area).

Still I don't really get it as 0xFEFF should be equal to 0x0000FEFF and so it should recognice the UTF32_LE in that 'if' clause.

nburlock · Post by **nburlock** » Wed Oct 22, 2008 4:58 am

I mistyped the value of data32[0] in my previous post, it's actually 0x3C0000FFFE.

char32 is defined as an unsigned long, which is eight bytes on my 64 bit system. That explains why this isn't working, because it's comparing the first 8 bytes of the file against a four byte value. The following code demonstrates the problem:

Code: Select all

        char data8[8] = { 0xFE,0xFF,0x00,0x00,0x3C,0x00,0x00,0x00 };
        char32* data32 = reinterpret_cast<char32*>(&data8[0]);
        char16* data16 = reinterpret_cast<char16*>(&data8[0]);
        const int UTF32_BE = 0xFFFE0000;
        const int UTF32_LE = 0x0000FEFF;
        
        if (data32[0] == (char32)UTF32_BE)
            printf("big endian\n");

        if (data32[0] == (char32)UTF32_LE)
            printf("little endian\n");

So then I guess that the solution is to change char32 to some type that is four bytes long on all platforms.

CuteAlien · Post by **CuteAlien** » Wed Oct 22, 2008 6:57 am

nburlock wrote: So then I guess that the solution is to change char32 to some type that is four bytes long on all platforms.

Yes, that sounds like a rather good idea :-)

nburlock · Post by **nburlock** » Wed Oct 22, 2008 9:17 am

I've posted the info to the bug report, but I'm not going to post a patch - I've no idea what types are constant across all the different compilers and platforms Irrlicht supports

hybrid · Post by **hybrid** » Wed Oct 22, 2008 7:04 pm

Hmm, long type is no good idea, indeed. I also thought that I fixed the 64bit problems some month ago, but I'll chek when I'm home from holidays.

rogerborg · Post by **rogerborg** » Mon Nov 17, 2008 10:45 pm

Do we just want char32 to be an unsigned 32 bit type?

Presumably u32 is an unsigned 32 bit type, even on a 64 bit system?

Unfortunately, we can't just "typedef u32 char32", since that farks up the string<char32> type defined by CXMLReaderImpl ( operator += (const unsigned int i) is the same as operator += (T c) )

What a pretty pickle!

vitek · Post by **vitek** » Tue Nov 18, 2008 5:45 am

rogerborg wrote:Do we just want char32 to be an unsigned 32 bit type?

I wouldn't think so. I think that a char32 should be a 32-bit integral type that has the same signedness as a char.

rogerborg wrote:Presumably u32 is an unsigned 32 bit type, even on a 64 bit system?

Yeah, it should.

rogerborg wrote:that farks up the string<char32> type defined by CXMLReaderImpl (operator += (const unsigned int i) is the same as operator += (T c) )

There are ways around this. One would be to just remove the operator overloading and use unique method names. Of course that breaks source compatibility for some users. Another way is to us SFINAE and remove one of the overloads when T is unsigned int.

Travis

Irrlicht Engine

[fixed]XML reader can't read files produced by XML writer

[fixed]XML reader can't read files produced by XML writer

Re: XML reader can't read files produced by XML writer

Re: XML reader can't read files produced by XML writer