Guidelines for generating XML

Over the last little while I've come across quite a few XML feed generators written in PHP, with varying degrees of 'correctness'. Even though generating XML should be very simple, there's still quite a bit of pitfalls I feel every PHP or (insert your language)-developer should know about.

1. You are better off using an XML library

This is the first and foremost rule. Most people end up generating their xml using simple string concatenation, while there are many dedicated tools out there that really help you generate your own XML.

In PHP land the best example is XMLWriter. It is actually quite easy to use:

  1. <?php
  2.  
  3. $xmlWriter = new XMLWriter();
  4. $xmlWriter->openMemory();
  5. $xmlWriter->startDocument('1.0','UTF-8');
  6. $xmlWriter->startElement('root');
  7. $xmlWriter->text('Contents of the root tag');
  8. $xmlWriter->endElement(); // root
  9. $xmlWriter->endDocument();
  10. echo $xmlWriter->outputMemory();
  11.  
  12. ?>

Granted, XMLWriter is verbose, but you have to worry a lot less about escaping and validating your xml documents.

2. Understand Unicode

Do you know the difference between a byte, a character and a codepoint? If you don't, I'd probably think twice about hiring you. It's absolutely shocking how many programmers are out there that don't understand the basics of unicode, UTF-8 and how it relates to the web.

An often-heard excuse for not having to care for non-ascii characters, such as people in English speaking countries. However, if you need to use the euro-sign (€) or if you deal with people copy-pasting from word documents, you most definitely will come across problems.

A simple call to utf8_encode is not actually enough. If some of your source-data was already encoded as UTF-8 you will end up losing data. Only use utf8_encode if you know your source-data is encoded as ISO-8859-1.

The one true way to go about it, is to make sure that every step of the way in your web application is UTF-8. Including your HTTP/HTML contenttype, MySQL database and anything that basically ingests data for your application (email, csv importers, xml readers, web services). Once you are absolutely sure every part in your application is UTF-8, and converted any old data things will start to behave correctly.

3. CDATA is never a solution

It might be tempting to solve any encoding issues by simply surrounding it with <![CDATA[ and ]]>. This might make sure that XML parsers don't throw an error when reading, but they still have 'incorrect' characters. If your XML document has CDATA tags, or you think you need CDATA, you are probably wrong.

More often than not using CDATA actually stems from encoding problems (see section 2). CDATA is not a method to encode binary characters, xml parsers will still throw errors if they come across certain byte sequences. If you do really need to encode binary data in XML, the best way is to use something like base64_encode instead.

If your XML feed uses CDATA because of encoding issues you actually defer your problem to the consumer of your XML feed. So instead of seeing 'weird characters' on your side, the person that reads your xml feed now has no good way to detect which encoding was actually used. If it's for example an RSS feed you're generating, this can result in RSS readers throwing errors, or characters showing up incorrectly.

4. Be liberal with whitespace

An error like "unexpected character at line 1, column 176456" is much harder to debug than "line 5078, column 24". Whitespace between xml tags does usually not have any significance, so you can add as much indentation and linebreaks (\n) as you want. Note that tools such as XMLWriter will indent for you automatically.

5. Be verbose

Even though you might easily figure out that <ORD_NR> means 'order number', there's no reason why you shouldn't actually state it as <order-number>. Note that the following rules appear to fall in favor for most people:

  • Use lowercase for tags and attribute names.
  • Use dashes (-) to separate words, not underscores (_).
  • Minimize the use of attributes, nested tags allow more flexibility.

6. Be careful with entities

The only valid entities in XML are &lt; (<), &gt; (>) &amp; (&) and &quot; ("), so any other entity will simply not work and throw errors.

HTML DTD's add many entities, so if you're mostly used to using HTML you might expect other entities to work. If your source-data already has entities, you might have to get rid of these first.

In PHP it means you should use htmlspecialchars, instead of htmlentities.

Feel free to discuss, disagree, or add on to this list in the comments, I'm happy to hear your experiences.

Converting ICalendar to XML

I've started working on a CalDAV implementation, which also requires analysis of ICalendar (rfc 2554) objects.

ICalendar objects have properties, components (such as VEVENT, VTODO) and attributes. This is awfully familiar to XML. So instead of trying to come up with a complicated parser and object structure, I decided to just convert it to XML and use PHP's simplexml.

This is my current script:

  1. <?php
  2.  
  3. function iCalendarToXML($icalendarData) {
  4.  
  5. // Detecting line endings
  6. if (strpos($icalendarData,"\r\n")) $lb = "\r\n";
  7. elseif (strpos($icalendarData,"\n")) $lb = "\n";
  8. else $lb = "\r\n";
  9.  
  10. // Splitting up items per line
  11. $lines = explode($lb,$icalendarData);
  12.  
  13. // Properties can be folded over 2 lines. In this case the second
  14. // line will be preceeded by a space or tab.
  15. $lines2 = array();
  16. foreach($lines as $line) {
  17.  
  18. if ($line[0]==" " || $line[0]=="\t") {
  19. $lines2[count($lines2)-1].=substr($line,1);
  20. continue;
  21. }
  22.  
  23. $lines2[]=$line;
  24.  
  25. }
  26.  
  27. $xml = '<?xml version="1.0"?>' . "\n";
  28.  
  29. $spaces = 0;
  30. foreach($lines2 as $line) {
  31.  
  32. $matches = array();
  33. // This matches PROPERTYNAME;ATTRIBUTES:VALUE
  34. if (preg_match('/^([^:^;]*)(?:;([^:]*))?:(.*)$/',$line,$matches)) {
  35. $propertyName = strtoupper($matches[1]);
  36. $attributes = $matches[2];
  37. $value = $matches[3];
  38.  
  39. // If the line was in the format BEGIN:COMPONENT or END:COMPONENT, we need to special case it.
  40. if ($propertyName == 'BEGIN') {
  41. $xml.=str_repeat(" ",$spaces);
  42. $xml.='<' . strtoupper($value) . ">\n";
  43. $spaces+=2;
  44. continue;
  45. } elseif ($propertyName == 'END') {
  46. $spaces-=2;
  47. $xml.=str_repeat(" ",$spaces);
  48. $xml.='</' . strtoupper($value) . ">\n";
  49. continue;
  50. }
  51.  
  52. $xml.=str_repeat(" ",$spaces);
  53. $xml.='<' . $propertyName;
  54. if ($attributes) {
  55. // There can be multiple attributes
  56. $attributes = explode(';',$attributes);
  57. foreach($attributes as $att) {
  58.  
  59. list($attName,$attValue) = explode('=',$att,2);
  60. $xml.=' ' . $attName . '="' . htmlspecialchars($attValue) . '"';
  61.  
  62. }
  63. }
  64.  
  65. $xml.='>'. htmlspecialchars($value) . '</' . $propertyName . ">\n";
  66.  
  67. }
  68.  
  69. }
  70.  
  71. return $xml;
  72.  
  73. }
  74.  
  75. ?>

This will convert:

  1. BEGIN:VCALENDAR
  2. VERSION:2.0
  3. PRODID:-//Example Corp.//CalDAV Client//EN
  4. BEGIN:VTIMEZONE
  5. LAST-MODIFIED:20040110T032845Z
  6. TZID:US/Eastern
  7. BEGIN:DAYLIGHT
  8. DTSTART:20000404T020000
  9. RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=4
  10. TZNAME:EDT
  11. TZOFFSETFROM:-0500
  12. TZOFFSETTO:-0400
  13. END:DAYLIGHT
  14. BEGIN:STANDARD
  15. DTSTART:20001026T020000
  16. RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
  17. TZNAME:EST
  18. TZOFFSETFROM:-0400
  19. TZOFFSETTO:-0500
  20. END:STANDARD
  21. END:VTIMEZONE
  22. BEGIN:VEVENT
  23. DESCRIPTION:Hello Im evert
  24. Next line also
  25. Blabla
  26. ATTENDEE;PARTSTAT=ACCEPTED;ROLE=CHAIR:mailto:cyrus@example.com
  27. ATTENDEE;PARTSTAT=NEEDS-ACTION:mailto:lisa@example.com
  28. DTSTAMP:20060206T001220Z
  29. DTSTART;TZID=US/Eastern:20060104T100000
  30. DURATION:PT1H
  31. LAST-MODIFIED:20060206T001330Z
  32. ORGANIZER:mailto:cyrus@example.com
  33. SEQUENCE:1
  34. STATUS:TENTATIVE
  35. SUMMARY:Event #3
  36. UID:DC6C50A017428C5216A2F1CD@example.com
  37. X-ABC-GUID:E1CX5Dr-0007ym-Hz@example.com
  38. END:VEVENT
  39. END:VCALENDAR
To:
  1. <?xml version="1.0"?>
  2. <VCALENDAR>
  3. <VERSION>2.0</VERSION>
  4. <PRODID>-//Example Corp.//CalDAV Client//EN</PRODID>
  5. <VTIMEZONE>
  6. <LAST-MODIFIED>20040110T032845Z</LAST-MODIFIED>
  7. <TZID>US/Eastern</TZID>
  8. <DAYLIGHT>
  9. <DTSTART>20000404T020000</DTSTART>
  10. <RRULE>FREQ=YEARLY;BYDAY=1SU;BYMONTH=4</RRULE>
  11. <TZNAME>EDT</TZNAME>
  12. <TZOFFSETFROM>-0500</TZOFFSETFROM>
  13. <TZOFFSETTO>-0400</TZOFFSETTO>
  14. </DAYLIGHT>
  15. <STANDARD>
  16. <DTSTART>20001026T020000</DTSTART>
  17. <RRULE>FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10</RRULE>
  18. <TZNAME>EST</TZNAME>
  19. <TZOFFSETFROM>-0400</TZOFFSETFROM>
  20. <TZOFFSETTO>-0500</TZOFFSETTO>
  21. </STANDARD>
  22. </VTIMEZONE>
  23. <VEVENT>
  24. <DESCRIPTION>Hello Im evertNext line also Blabla</DESCRIPTION>
  25. <ATTENDEE PARTSTAT="ACCEPTED" ROLE="CHAIR">mailto:cyrus@example.com</ATTENDEE>
  26. <ATTENDEE PARTSTAT="NEEDS-ACTION">mailto:lisa@example.com</ATTENDEE>
  27. <DTSTAMP>20060206T001220Z</DTSTAMP>
  28. <DTSTART TZID="US/Eastern">20060104T100000</DTSTART>
  29. <DURATION>PT1H</DURATION>
  30. <LAST-MODIFIED>20060206T001330Z</LAST-MODIFIED>
  31. <ORGANIZER>mailto:cyrus@example.com</ORGANIZER>
  32. <SEQUENCE>1</SEQUENCE>
  33. <STATUS>TENTATIVE</STATUS>
  34. <SUMMARY>Event #3</SUMMARY>
  35. <UID>DC6C50A017428C5216A2F1CD@example.com</UID>
  36. <X-ABC-GUID>E1CX5Dr-0007ym-Hz@example.com</X-ABC-GUID>
  37. </VEVENT>
  38. </VCALENDAR>

I hope this is useful to anyone else.

CDATA in xml.. bad idea?

While working on a simple feed parser, I hit upon some wordpress feeds.

I noticed that wordpress feeds make heavy usage of CDATA to encode content. I always figured this was a bad idea if you cannot control what ends up in the xml feed. (Example here.).

Doing some googling to see if I'm not just kicking dust brought me to an xml.com article titled 'Escaped Markup Considered Harmful, which seems to agree with my standpoint for the following reason:

Escaping markup, particularly with CDATA sections, just doesn't work. There are other things that might be wrong that would make the documents not well formed. There are Unicode characters that are forbidden, there are encoding issues for the characters that are allowed, and there are sequences of characters that must be avoided. (e.g., "]]>"). Not to mention the fact that CDATA sections don't nest.

CDATA can't be used to just dump in any type of content that won't work in normal XML sections.. You're still obligated to make your data valid unicode. In fact, it's the opposite; There's no way you could ever escape the ]]> character sequence.

 1

About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.