Guidelines for generating XML

Over the last little while I've come across quite a few XML feed generators written in PHP, with varying degrees of 'correctness'. Even though generating XML should be very simple, there's still quite a bit of pitfalls I feel every PHP or (insert your language)-developer should know about.

1. You are better off using an XML library

This is the first and foremost rule. Most people end up generating their xml using simple string concatenation, while there are many dedicated tools out there that really help you generate your own XML.

In PHP land the best example is XMLWriter. It is actually quite easy to use:

  1. <?php
  2.  
  3. $xmlWriter = new XMLWriter();
  4. $xmlWriter->openMemory();
  5. $xmlWriter->startDocument('1.0','UTF-8');
  6. $xmlWriter->startElement('root');
  7. $xmlWriter->text('Contents of the root tag');
  8. $xmlWriter->endElement(); // root
  9. $xmlWriter->endDocument();
  10. echo $xmlWriter->outputMemory();
  11.  
  12. ?>

Granted, XMLWriter is verbose, but you have to worry a lot less about escaping and validating your xml documents.

2. Understand Unicode

Do you know the difference between a byte, a character and a codepoint? If you don't, I'd probably think twice about hiring you. It's absolutely shocking how many programmers are out there that don't understand the basics of unicode, UTF-8 and how it relates to the web.

An often-heard excuse for not having to care for non-ascii characters, such as people in English speaking countries. However, if you need to use the euro-sign (€) or if you deal with people copy-pasting from word documents, you most definitely will come across problems.

A simple call to utf8_encode is not actually enough. If some of your source-data was already encoded as UTF-8 you will end up losing data. Only use utf8_encode if you know your source-data is encoded as ISO-8859-1.

The one true way to go about it, is to make sure that every step of the way in your web application is UTF-8. Including your HTTP/HTML contenttype, MySQL database and anything that basically ingests data for your application (email, csv importers, xml readers, web services). Once you are absolutely sure every part in your application is UTF-8, and converted any old data things will start to behave correctly.

3. CDATA is never a solution

It might be tempting to solve any encoding issues by simply surrounding it with <![CDATA[ and ]]>. This might make sure that XML parsers don't throw an error when reading, but they still have 'incorrect' characters. If your XML document has CDATA tags, or you think you need CDATA, you are probably wrong.

More often than not using CDATA actually stems from encoding problems (see section 2). CDATA is not a method to encode binary characters, xml parsers will still throw errors if they come across certain byte sequences. If you do really need to encode binary data in XML, the best way is to use something like base64_encode instead.

If your XML feed uses CDATA because of encoding issues you actually defer your problem to the consumer of your XML feed. So instead of seeing 'weird characters' on your side, the person that reads your xml feed now has no good way to detect which encoding was actually used. If it's for example an RSS feed you're generating, this can result in RSS readers throwing errors, or characters showing up incorrectly.

4. Be liberal with whitespace

An error like "unexpected character at line 1, column 176456" is much harder to debug than "line 5078, column 24". Whitespace between xml tags does usually not have any significance, so you can add as much indentation and linebreaks (\n) as you want. Note that tools such as XMLWriter will indent for you automatically.

5. Be verbose

Even though you might easily figure out that <ORD_NR> means 'order number', there's no reason why you shouldn't actually state it as <order-number>. Note that the following rules appear to fall in favor for most people:

  • Use lowercase for tags and attribute names.
  • Use dashes (-) to separate words, not underscores (_).
  • Minimize the use of attributes, nested tags allow more flexibility.

6. Be careful with entities

The only valid entities in XML are &lt; (<), &gt; (>) &amp; (&) and &quot; ("), so any other entity will simply not work and throw errors.

HTML DTD's add many entities, so if you're mostly used to using HTML you might expect other entities to work. If your source-data already has entities, you might have to get rid of these first.

In PHP it means you should use htmlspecialchars, instead of htmlentities.

Feel free to discuss, disagree, or add on to this list in the comments, I'm happy to hear your experiences.

mbstring function overloading: don't use it

As a library author, the worst thing I have to deal with is PHP settings that affect global behaviour. Some examples of this include:

  • Making sure that the library still works in your specific locale setting.
  • Don't rely on a specific error_reporting setting to catch errors.
  • If it was 1997, don't rely on a specific magic_quotes or register_globals setting.
  • Don't rely on the current setting of mb_internal_encoding, and instead always pass the desired encodings to the mb_* functions.

Not only should I not rely on these settings, I also can't change them. I should assume that the application using my library might have a preference for a specific setting, so I can't dictate what the setting should be. The exception to this are cases where I change a setting temporarily and revert it.

Obviously I'm not perfect and not aware of every flag that changes the environment. When I come across incompatibility bug reports I'll quickly try to change the bits that affect this compatibility.

So now I'm faced with a bug report about my library failing when mbstring function overloading is turned on. Definitely something I've missed.

mbstring overloading alters the behaviour of 17 common PHP string functions, such as strpos and substr. Because I deal with binary data this fails on a number of places. The only solution is to look for all the instances where I'm using these functions and replace instances of strlen($string) with mb_strlen($string, '8bit');.

I'm using these functions on a ton of places though. I'm wondering in this case if I should simply throw an error when I find out function overloading is turned on.

Conclusion

To make a long story short. If you're ever intending to use external PHP libraries, there's a very good chance they haven't accounted for mbstring.func_overload. I can highly recommend always using the mb_* functions directly, and keep that setting off.

Filesystem encoding and PHP

Many PHP applications save files to a local filesystem. Most of the times for the bulk of readers here you'll likely only ever store files using US-ASCII encoding, either because your filenames are simply based on databasefields (as you should try in most cases), or simply because most of your users never have a need for non-english characters.

When you do though, it's important to know how operating systems cope with these characters. Unsurprising, all of them do this differently.

To illustrate the differences, I'm going to do some tests on Ubuntu, OS/X 10.6.3 and Windows XP and 7.

Linux

In Linux filenames are binary. Linux does not care what encoding your filenames are, and it will accept anything besides 0x00. This means filenames can contain carriage-returns (\n), tabs (\t) or even a bell (ascii code 07).

To illustrate this, I'm going to make a tiny file using a php script:

  1. <?php
  2. file_put_contents("saved by the \x07.txt","contents");
  3. ?>

After running this I simply get a questionmark when viewing the file using 'ls', but when I auto-complete it, it expands to ^G (which is bell). In Nautilus, this is displayed:

fsencoding_gnome.png

If I run this script:

  1. <?php
  2. print_r(glob('saved*'));
  3. ?>

The output is simply missing my bell character, and I get a short beep.

This doesn't mean it's a good idea to do this. Even though the underlying filesystem is binary-safe, applications that list filenames will still have to make a decision on an encoding to display the characters to the user. You can't even show this character in any PHP page, and firewalls might even block this if you used this in a url.

This also applies to the applications on your linux machine. Most of them, such as Gnome Terminal and Nautilus, default to UTF-8. However, I believe for the PuTTY application this was for the longest time ISO-8859-1 (latin1). A symptom of this is that any non-ascii characters look different when read them from Putty vs. Nautilus.

The other thing I wanted to test on linux is how it behaves if I create a file in the filemanager using a special character. For this example I'm using ü, because it's a bit ambiguous as there's multiple ways to encode it using unicode (more on this later) and it also appears in ISO-8859-1.

Back to the test. I'm now creating a new file from the Nautilus interface, and want to see how it shows up for PHP. Im creating a file called test_ü.txt and listing it with the following script:

  1. <?php
  2. list($file) = glob('test_*');
  3. echo urlencode($file) . "\n";
  4. ?>

Output:

  1. test_%C3%BC.txt

%C3%BC is the UTF-8 encoding of codepoint U+00FC, which is the most common way to encode ü. Great!

The last test is to create this file using ISO-8859-1/latin1 encoding. The latin1 representation of ü is 0xFC. The script for this:

  1. <?php
  2. file_put_contents("uumlaut_\xFC.txt","contents");
  3. ?>

Linux stores the file with that exact byte sequence. 'ls' shows the questionmark again, and this type in gnome I'm getting the typical 'incorrect encoding' question mark.

OS/X

On OS/X all filenames are encoded as UTF-16. You don't have to know about this, because the API's PHP uses are UTF-8, and are transparently translated for you.

We'll start with the bell test. The result is the same as on linux. The bell character is represented by ?. When checking it out in finder, the character is missing altogether. It's definitely still there though, as the following script illustrates:

  1. <?php
  2. list($filename) = glob('saved*');
  3. echo urlencode($filename) . "\n";
  4. ?>

Output:

  1. saved+by+the+%07.txt

Next, we're going to do the ü test. First, I'll encode it as latin-1, which would be invalid for this UTF-8 filesystem.

  1. <?php
  2. file_put_contents("uumlaut_\xFC.txt","contents");
  3. ?>

This one is weird. If I now do 'ls', the result is this:

  1. drwxr-xr-x 10 evert2 staff 340 16 Apr 17:08 .
  2. drwxr-xr-x 32 evert2 staff 1088 16 Apr 16:53 ..
  3. -rw-r--r-- 1 evert2 staff 8 16 Apr 16:54 saved by the ?.txt
  4. -rw-r--r-- 1 evert2 staff 121 16 Apr 16:54 test1.php
  5. -rw-r--r-- 1 evert2 staff 8 16 Apr 16:54 test2.php
  6. -rw-r--r-- 1 evert2 staff 101 16 Apr 17:07 test3.php
  7. -rw-r--r-- 1 evert2 staff 57 16 Apr 17:08 test4.php
  8. -rw-r--r-- 1 evert2 staff 8 16 Apr 17:08 uumlaut_%FC.txt

Instead of taking the literal bytes, OS/X urlencoded them, and stored those sequences instead. This translation is transparent; but it might be confusing if you ever try to store latin1 filenames from your users.

The last test is to store the umlaut again, but this time using the correct utf-8 sequence:

  1. <?php
  2. file_put_contents("uumlaut2_\xC3\xBC.txt","contents");
  3. ?>

Upon first sight this seems to have worked as expected, but it gets weird when we check out how this was actually stored:

  1. <?php
  2. list($file) = glob('uumlaut2_*');
  3. echo urlencode($file) . "\n";
  4. ?>

Output:

  1. uumlaut2_u%CC%88.txt

OS/X stored u0xCC88 instead of 0xC3BC. Note that the u is not a typo. OS/X uses a different way to store the ü. The encoding we used is unicode codepoint U+00FC, which is ü. OS/X first stores the u and the two little dots as separate characters, taking up 3 bytes instead of 2.

This is called normalization. Unicode defines a few different normalization models which dictate how these combinations of characters are stored. So even though they are different byte-sequences and different codepoints they are still considered equivalent.

The PHP intl extension includes a class that allows you to do the unicode normalization yourself, namely the normalizer class. The documentation also includes a short description of what the 4 different normalization forms are. OS/X uses a slightly modified version of Normalization Form D (yes, nobody can ever standardize on anything).

This is how you would do this conversion yourself:

  1. <?php
  2.  
  3. $before = "\xC3\xBC";
  4. $after = Normalizer::normalize($before, Normalizer::FORM_D);
  5.  
  6. echo 'Before: ', urlencode($before), "\n";
  7. echo 'After: ', urlencode($after), "\n";
  8. ?>

Output:

  1. Before: %C3%BC
  2. After: u%CC%88

This normalization process for OS/X is also transparent. Whenever you will try to open a file with the wrong normalization form, OS/X will put it in form D before opening.

Windows

Windows also uses UTF-16 to store filenames (using NTFS). Just like OS/X, this translation is done automatically, due to the filesystem api's php uses. We'll start with the bell-test:

  1. <?php
  2. file_put_contents("saved by the \x07.txt","contents");
  3. ?>

Output:

  1. Warning: file_put_contents(saved by the .txt): failed to open stream: Invalid argument in C:\Documents and Settings\Administrator\test\test.php on line 2

Indeed, windows does not allow control characters such as bell. The second thing we'll try is the latin-1 encoded ü:

  1. <?php
  2. file_put_contents("uumlaut_\xFC.txt","contents");
  3. list($file) = glob('uumlaut_*');
  4. echo urlencode($file) . "\n";
  5. ?>

Output:

  1. uumlaut_%FC.txt

Not only did windows accept this encoding, it also displayed correctly in both cmd.exe, and the windows explorer. So it appears that windows and PHP actually translate from and to ISO-8859-1/latin1 instead of UTF-8. When trying this with the UTF-8 encoding of ü this gets confirmed.

  1. <?php
  2. file_put_contents("uumlaut2_\xC3\xBC.txt","contents");
  3. list($file) = glob('uumlaut2_*');
  4. echo urlencode($file) . "\n";
  5. ?>

Output:

  1. uumlaut2_%C3%BC.txt

While windows stores this correctly, the filename is now garbled in cmd.exe and windows explorer. Here it looks like ü. This is pretty bad. I do know that Windows does support UTF-8, so I can't help but wonder what would happen if I do the exact opposite: making a file containing non-ascii characters in windows explorer, and reading out the filename in PHP.

The results were interesting. I used the ü again, and 한글, which is the name of the korean writing system, hangul. With 2 files in this directory, I simply did:

  1. <?php
  2. $files = glob('*');
  3. foreach($files as $file) {
  4. echo urlencode($file), "\n";
  5. }
  6. echo "total: " . count($files) . "\n";
  7. ?>

Output:

  1. test.php
  2. uumlaut%FC.txt
  3. total: 2

My korean file was completely missing. Just to make sure I did the same with scandir:

  1. <?php
  2. $files = scandir('.');
  3. foreach($files as $file) {
  4. echo urlencode($file), "\n";
  5. }
  6. echo "total: " . count($files) . "\n";
  7. ?>

Output:

  1. .
  2. ..
  3. hangul_%3F%3F.txt
  4. test.php
  5. uumlaut%FC.txt
  6. total: 5

Oddly enough it did show up here. This time however, the korean characters were replaced by %3F, which is, surprise: the question mark. We've seen characters replaced by question marks before, but this is the first time it ends up in a literal string.

Conclusion

Using non-latin characters in filenames is messy. It would be possible to provide a consistent experience, if it weren't for windows. Windows does have all the proper api's to deal with international filenames, but I can only assume PHP simply does not support them. I do believe this was scheduled for PHP6, but now that's off the hook. I hope the filesystem api's are replaced even before the entire language is unicode-based.

While the Linux solution (treat everything as binary, allow everything besides 0x00) might seem like the most straightforward, in the end filenames are meant to be written or read by people which means it will be encoded.

The best system in this case really is OS/X, which not only treats everything as UTF-8, it also handles incorrect sequences well and makes sure that characters with an identical meaning are also always stored the same way (normalization).

Here's what I recommend:

If you want to support all characters on all operating systems in a consistent matter, you have no other option than to use an intermediate encoding. You could for instance simply urlencode all your filenames before writing them to disk.

Url-encoding does not mean you can forget about the encoding though. urlencoding means that a different way is used to store certain bytes, but the characters they represent remain the same. Therefore, you should always make sure that the filenames you're using are valid UTF-8 sequences. UTF-8 is today's encoding of choice.

If you know absolutely sure you will only use characters in the ISO-8859-1/latin-1 character-set, the following table applies:

WindowsEncode using ISO-8859-1
LinuxEncode using UTF-8 (will accept other encodings, but not recommended).
OS/XEncode using UTF-8. Will transparently encode to normalization-form D

Here's a table of sequences and what happens on specific operating systems:

url-encoded filenamedescriptionLinuxOS/XWindows
%07bell%07 on disk%07 on diskthrows error and doesn't save
%FCü in ISO-8859-1%FC on disk, question marks in UI's%25FC on disk (%25 = %, so the literal string %FC on disk).%FC on disk, correct in UI
%C3%BCü in UTF-8 normalization form C%C3%BC on disk. correct in UIu%CC%88 on disk, correct in UI%C3%BC on disk, shows up as ü in UI's
u%CC%88ü in UTF-8 normalization form Du%CC%88 on disk, correct in UIu%CC%88 on disk, correct in UIuntested, but assumed to be similar to the last testcase.

Configuration list

Lastly, the list of relevant software I used for this:

  • Windows
    • Tested on XP SP3 and 7
    • PHP 5.3.2 VC9 x86 build from windows.php.net
    • NTFS filesystem
  • Linux
    • Ubuntu 9.10
    • PHP 5.2.10 from ubuntu package repository
    • ext3 filesystem
  • OS/X
    • v10.6.3
    • PHP 5.3.1 as shipped with OS/X
    • HFS+ filesystem

Unicode nearing 50% of the web

According to a recent post from the Google Blog, Unicode nearing 50% uptake on the web. A rather steep graph as well:

unicode uptake graph

This is pretty good news. I've had the 'pleasure' of working with a number of integration project where the 3rd party was still using iso-8859-1 (aka latin-1). Usually when this is the case, its not by choice but because of their software's default settings (Browsers, MySQL, etc.). I for one hope non-unicode charsets will soon be a thing of the past.

One other note in the post was about ligatures, such as fi and the dutch ij. If this is the first time you heard about these, you might be surprised to see that you can (likely) only copy-paste ij as a whole, and not just the i or j. It's one unicode character, not two. It just made me wonder: what kind of software would generate these, and more importantly why?

javascript's escape and encodeURI vs. PHP $_POST

I just stumbled upon an odd encoding issue with a web application.

Basically, data is coming into our PHP application through a Javascript's XMLHttpRequest (ajax). The data is sent as a standard form encoding (application/x-www-form-urlencoded), and picked up by PHP using the $_POST array. Any strings in form POST request are 'urlencoded', also known as Percent-encoding. As an example, this will turn a space into the often-seen %20.

Normally everything in the $_POST and $_GET arrays is already decoded, so when you're dealing with these arrays you don't really have to think about this. This time however, I was dealing with some non-latin unicode characters and for some reason they were never decoded and ended up in de database as raw url-encoded strings.

Doing a bit of research led me to the following: normally any special character is encoded as %XX, X and X being 2 hexadecimal values. These values simply represent bytes. The values I got were different altogether and took the form %uXXXX. I just assumed this was part of standard uri-encoding for unicode characters, so I was still a bit shook-up to see that PHP didn't just pick them up.

After a bit of research, I found out that the unicode representation was rejected by W3c, which is probably also why the PHP authors decided to not implement this. Javascript actually has 2 different methods to do percent-encoding, namely:

escape("☢"); // returns %u2622
encodeURI("☢"); // returns %E2%98%A2

Guess which one we were using?

Even though the %u syntax is arguably better to represent unicode characters, W3c seems to have voted against the syntax for backwards compatibility reasons. Before this happened the escape method was already adopted in javascript which in turn caused me to stumble upon this problem and write an article about it.

The more you know..

 1

About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.