Over the last little while I've come across quite a few XML feed generators written in PHP, with varying degrees of 'correctness'. Even though generating XML should be very simple, there's still quite a bit of pitfalls I feel every PHP or (insert your language)-developer should know about.
1. You are better off using an XML library
This is the first and foremost rule. Most people end up generating their xml using simple string concatenation, while there are many dedicated tools out there that really help you generate your own XML.
In PHP land the best example is XMLWriter. It is actually quite easy to use:
- <?php
-
- $xmlWriter = new XMLWriter();
- $xmlWriter->openMemory();
- $xmlWriter->startDocument('1.0','UTF-8');
- $xmlWriter->startElement('root');
- $xmlWriter->text('Contents of the root tag');
- $xmlWriter->endElement(); // root
- $xmlWriter->endDocument();
- echo $xmlWriter->outputMemory();
-
- ?>
Granted, XMLWriter is verbose, but you have to worry a lot less about escaping and validating your xml documents.
2. Understand Unicode
Do you know the difference between a byte, a character and a codepoint? If you don't, I'd probably think twice about hiring you. It's absolutely shocking how many programmers are out there that don't understand the basics of unicode, UTF-8 and how it relates to the web.
An often-heard excuse for not having to care for non-ascii characters, such as people in English speaking countries. However, if you need to use the euro-sign (€) or if you deal with people copy-pasting from word documents, you most definitely will come across problems.
A simple call to utf8_encode is not actually enough. If some of your source-data was already encoded as UTF-8 you will end up losing data. Only use utf8_encode if you know your source-data is encoded as ISO-8859-1.
The one true way to go about it, is to make sure that every step of the way in your web application is UTF-8. Including your HTTP/HTML contenttype, MySQL database and anything that basically ingests data for your application (email, csv importers, xml readers, web services). Once you are absolutely sure every part in your application is UTF-8, and converted any old data things will start to behave correctly.
3. CDATA is never a solution
It might be tempting to solve any encoding issues by simply surrounding it with <![CDATA[ and ]]>. This might make sure that XML parsers don't throw an error when reading, but they still have 'incorrect' characters. If your XML document has CDATA tags, or you think you need CDATA, you are probably wrong.
More often than not using CDATA actually stems from encoding problems (see section 2). CDATA is not a method to encode binary characters, xml parsers will still throw errors if they come across certain byte sequences. If you do really need to encode binary data in XML, the best way is to use something like base64_encode instead.
If your XML feed uses CDATA because of encoding issues you actually defer your problem to the consumer of your XML feed. So instead of seeing 'weird characters' on your side, the person that reads your xml feed now has no good way to detect which encoding was actually used. If it's for example an RSS feed you're generating, this can result in RSS readers throwing errors, or characters showing up incorrectly.
4. Be liberal with whitespace
An error like "unexpected character at line 1, column 176456" is much harder to debug than "line 5078, column 24". Whitespace between xml tags does usually not have any significance, so you can add as much indentation and linebreaks (\n) as you want. Note that tools such as XMLWriter will indent for you automatically.
5. Be verbose
Even though you might easily figure out that <ORD_NR> means 'order number', there's no reason why you shouldn't actually state it as <order-number>. Note that the following rules appear to fall in favor for most people:
- Use lowercase for tags and attribute names.
- Use dashes (-) to separate words, not underscores (_).
- Minimize the use of attributes, nested tags allow more flexibility.
6. Be careful with entities
The only valid entities in XML are < (<), > (>) & (&) and " ("), so any other entity will simply not work and throw errors.
HTML DTD's add many entities, so if you're mostly used to using HTML you might expect other entities to work. If your source-data already has entities, you might have to get rid of these first.
In PHP it means you should use htmlspecialchars, instead of htmlentities.
Feel free to discuss, disagree, or add on to this list in the comments, I'm happy to hear your experiences.

In regards to #4, whitespace (carriage return, new line, space, tab) should only be added to XML during development and to troubleshoot issues. It has impact in production, such as bandwidth (10 bytes of whitespace per request x 1,000 requests per second x 86400 seconds per day = 800 MB of "wasted" bandwidth per day.)
Jeff, 10 bytes is a bit of a bad example, because the chance is relatively small an additional tcp package will be needed.
If the number of bytes is significant you are correct though.
Why not just use DOMDocument -> http://www.formatix.eu/en/php-how-to-create-an-xml-file.html
I personally don't mind DOMDocument, and use it for SabreDAV.
If your needs are simple, I think XMLWriter is easier to use.
Hi!
this is an interesting topic, for all the wrong reasons (that's not your fault though :). You point it out yourself:
"Even though generating XML should be very simple, ..."
It should be simple, but it's not. Even more troublesome is reading XML (yes, even with a good XML parser). Anyway, before I disgres too much:
Re #2, You mention understanding unicode, and the difference between byte, codepoint and character. But in my experience character encoding is more often a source of headaches. I mean that in practice character encoding mismatches (regardless of the characterset) are more often a source of errors than misunderstanding the difference between a codepoint and a character.
Re #3, CDATA sections are certainly an appropriate solution in any case where preserving whitespace is of the utmost importance. Code snippets come to mind - I really want them to be as is, and I don't want to risk change of semantics either due to typical whitespace semantics or character entities for XML metacharacters. I agree to the point that CDATA is never a solution for encoding binary data - but that is simply because CDATA sections have nothing to do with encoding :)
Re #6, "The only valid entities in XML are (<), >="" (="">) ∼
This is not true. You can use any entity that is declared - in addition to the list you mention, (for the single quote) is also implicitly declared as per the standard. You can declare whatever entity you like yourself in the DTD if you like, and as long as the character reference is valid you can use the numerical notation (like ) too.
kind regards,
Roland Bouman
mm, it seems the commen module doesn't strip/encode XML metacharacters....Re #6 I was pointing out that apos is a valid entity name toof or the single quote and that you can also use the "ampersand - hash - digits - semicolon' notation as long as the character is valid
Hi Roland,
You bring up good points. Some responses:
About unicode: I agree that creating good XML does not directly require understanding what a codepoint is. My argument is that as a programming you _should_ fully understand unicode and how it works. The fact that this is not always true baffles me, but I wanted to make this quite clear.
About CDATA: That's definitely a good usecase. My perspective comes from what CDATA _is_ used for 99% of the time.
About entities: Totally agree, but I personally hope I don't have to deal much with XML documents declaring custom or existing DTD's. It definitely adds to the complexity. Just because you can, doesn't mean you should.
I mainly wanted to list some of the things I often see going wrong or abused. I'm often a consumer of rather shitty xml data, and this is my top 6 of things I don't like seeing I suppose.
@Jeff Stoner:
You should compress your generated XML using gzip instead to cripple the readability and maintainability of your xml code.
In relation to XML Writer , here is a class that allows you to transform a multilevel php array to an xml string using xml writer :
http://www.berejeb.com/2010/06/utilisez-xmlwriter-pour-generer-du-flux-xml-avec-php/
The post is in french but you can just pick the class code ;-)
Personally I like using attributes; being less flexible they are less error prone; an attribute value must be a string - it cannot contain (for example) sub-nodes, comments, CDATA section etc. This simplifies the structure of the document and makes it less likely that someone will form it in a valid way which your parser doesn't understand (because of nodes in the DOM it isn't expecting, for instance)
Not to mention the fact that attribute names only appear once (per element) in the document, which means you use fewer characters.
Would you condescend to write about "the difference between a byte, a character and a codepoint"? As you explicitly claim to be more knowledgeable than most programmers on the subject, I'm sure we can all learn a thing or two from you.
I wonder if since so many - a "shocking" number, no less - programmers "don't understand the basics of unicode, UTF-8 and how it relates to the web", well then maybe it's either not that basic, or not that related to the web.
Anyway, looking forward to you explaining all.
P.S Why would a session expiring prevent posting of a blog comment?
Hi Steven,
I think I'm detecting some sarcasm. Perhaps I didn't make the right choice of words there.
My experience is mostly personal, so I really don't have numbers to back it up. I work mostly in the web industry, and I just see this stuff happening around me.
I personally feel Unicode should be in everybody's curriculum, whether you're an author of HTML documents or a backend developer. The implications of not understanding it can cause security bugs, data-loss or plain unexpected behavior. My annoyance stems from having to fix these things often.
If you weren't sarcastic about wanting to know more about it, a quick Google resulted in an article from Joel Spolski:
http://www.joelonsoftware.com/articles/Unicode.html
Evert
P.S.: Not sure about the session problem. I've had some issues too
HEHE ... article is ok but some of the comments are laughable : -)
Kind of agree with you that web-devs should have utf-8 understanding but its not 'basics' from my experience. Not as long as all packages/tools/languages support it in easy to use fashion :- )
All in all good advices.
btw. in my opinion Joel's books are whack, dont buy them.
Art