Arnaud's Blog

Opinions on open source, standards, and other things

XML vs Open

I heard Microsoft claiming that OOXML is open because it is in XML. In “open” they mean that anyone can use, process, manipulate, interpret OOXML documents. Is that really so? I say not!

A while ago my colleague Kelvin Lawrence had a blog entry titled “It uses XML so it is a standard right? wrong!,” on a type of abuse regarding XML which consists of people claiming that because their format is in XML it is a standard. I had then commented to Kelvin’s entry pointing out another fallacy regarding XML which is that because a format is in XML anybody can process it.

The claim from Microsoft regarding OOXML being open because it is an XML format hits that very point I was making. This is just plain wrong and people need to understand why. So I’m going to expand a bit on what I said in my comment to Kelvin’s entry.

The best analogy I’ve found to get people to understand why this assertion is false is that saying that your format is in XML is about the same as saying that your language uses the roman alphabet. This alone clearly doesn’t guarantee that anyone who knows the roman alphabet can understand your language.

At most, knowing the roman alphabet only guarantees that you can decipher the letters, one by one. This certainly doesn’t guarantee that you will be able to understand the words, yet alone the sentences, the letters form.

The same is true about XML formats. Knowing that a format is in XML merely guarantees that you can parse the document. Parsing in computer science is the function that scans a document, typically a file, to extract the information it contains. XML makes it easy to do this operation and turn the content of an XML document into a structure in memory. But what that structure represents, what the pieces of that structure represent, you don’t know. They are just bits and pieces in a hierarchical form.

Because XML is a text-based format in which data is tagged, as a human being, you might actually be able to guess a bit more by looking inside the document. If you see a tag called “table” for instance, it’s probably safe to infer that this part of the document contains tabular data. But you’re unlikely to go much further than that and a program certainly won’t do any of that guessing.

If the document comes with a schema, such as an XML schema, the structure in memory may be a bit richer. Instead of only having character strings, you’ll have typed data for one thing. So, for instance, instead of having the character string “123”, you may have the number 123. You may also know that a set of pieces of data is referenced as some kind of record called “customer”. But you still won’t have much more than that.

Tim Berners-Lee intends to go one step further with a set of technologies the W3C has been developing under the umbrella of “Semantic Web“. However, we have yet to see how far this will get us and in any case this doesn’t apply to formats such as OOXML for which this technology isn’t used.

So the only way to know more is to have a documentation that tells you what the format is really made of, what each tag corresponds to, and how they relate to each other. This is where the specification comes in to play.

The specification is the document that tells you that the “P” tag corresponds to a paragraph and that you can expect to find on the “P” tag the “align” attribute that specifies the paragraph alignment. The specification is what defines the semantic, the meaning of what’s in the document, beyond the XML format.

Only by carefully reading the specification, and writing programming code that interprets the document content accordingly, you will be able to fully process the document as intended. Without the specification how are you supposed to know that “P” stands for a paragraph rather than, say, a person?

This is why the specification is so important, and this is one of the reasons so many people have been complaining about OOXML. OOXML is so poorly defined that there is no way two engineers in two different places in the world can sit down, implement the specification, and except the same behavior. The OOXML specification has way too many unspecified or incompletely specified features.

This isn’t to say that there is no value in a format being XML based. Obviously I wouldn’t have spent several years working on XML if I thought so. Having a format in XML allows you to use existing code to parse the document in memory rather than having to write a different parser for every document format. There is definitely value in that and it does contribute to making a format more open by lowering the cost of implementation but that’s not enough to make it “open”.

Interestingly enough, if Microsoft fully documented its existing binary format for Office and made that documentation freely available to all without any legal barrier, their binary format could be more open than OOXML is, even though it’s not XML based.

Of course the fact that Microsoft keeps referring to its format as “Open XML” only makes the situation more confusing.

In any case, don’t fall for it. Look beyond the claims.


October 23, 2007 Posted by | open, standards | , , , , | 2 Comments