Arnaud's Blog

Opinions on open source, standards, and other things

XML vs Open

I heard Microsoft claiming that OOXML is open because it is in XML. In “open” they mean that anyone can use, process, manipulate, interpret OOXML documents. Is that really so? I say not!

A while ago my colleague Kelvin Lawrence had a blog entry titled “It uses XML so it is a standard right? wrong!,” on a type of abuse regarding XML which consists of people claiming that because their format is in XML it is a standard. I had then commented to Kelvin’s entry pointing out another fallacy regarding XML which is that because a format is in XML anybody can process it.

The claim from Microsoft regarding OOXML being open because it is an XML format hits that very point I was making. This is just plain wrong and people need to understand why. So I’m going to expand a bit on what I said in my comment to Kelvin’s entry.

The best analogy I’ve found to get people to understand why this assertion is false is that saying that your format is in XML is about the same as saying that your language uses the roman alphabet. This alone clearly doesn’t guarantee that anyone who knows the roman alphabet can understand your language.

At most, knowing the roman alphabet only guarantees that you can decipher the letters, one by one. This certainly doesn’t guarantee that you will be able to understand the words, yet alone the sentences, the letters form.

The same is true about XML formats. Knowing that a format is in XML merely guarantees that you can parse the document. Parsing in computer science is the function that scans a document, typically a file, to extract the information it contains. XML makes it easy to do this operation and turn the content of an XML document into a structure in memory. But what that structure represents, what the pieces of that structure represent, you don’t know. They are just bits and pieces in a hierarchical form.

Because XML is a text-based format in which data is tagged, as a human being, you might actually be able to guess a bit more by looking inside the document. If you see a tag called “table” for instance, it’s probably safe to infer that this part of the document contains tabular data. But you’re unlikely to go much further than that and a program certainly won’t do any of that guessing.

If the document comes with a schema, such as an XML schema, the structure in memory may be a bit richer. Instead of only having character strings, you’ll have typed data for one thing. So, for instance, instead of having the character string “123”, you may have the number 123. You may also know that a set of pieces of data is referenced as some kind of record called “customer”. But you still won’t have much more than that.

Tim Berners-Lee intends to go one step further with a set of technologies the W3C has been developing under the umbrella of “Semantic Web“. However, we have yet to see how far this will get us and in any case this doesn’t apply to formats such as OOXML for which this technology isn’t used.

So the only way to know more is to have a documentation that tells you what the format is really made of, what each tag corresponds to, and how they relate to each other. This is where the specification comes in to play.

The specification is the document that tells you that the “P” tag corresponds to a paragraph and that you can expect to find on the “P” tag the “align” attribute that specifies the paragraph alignment. The specification is what defines the semantic, the meaning of what’s in the document, beyond the XML format.

Only by carefully reading the specification, and writing programming code that interprets the document content accordingly, you will be able to fully process the document as intended. Without the specification how are you supposed to know that “P” stands for a paragraph rather than, say, a person?

This is why the specification is so important, and this is one of the reasons so many people have been complaining about OOXML. OOXML is so poorly defined that there is no way two engineers in two different places in the world can sit down, implement the specification, and except the same behavior. The OOXML specification has way too many unspecified or incompletely specified features.

This isn’t to say that there is no value in a format being XML based. Obviously I wouldn’t have spent several years working on XML if I thought so. Having a format in XML allows you to use existing code to parse the document in memory rather than having to write a different parser for every document format. There is definitely value in that and it does contribute to making a format more open by lowering the cost of implementation but that’s not enough to make it “open”.

Interestingly enough, if Microsoft fully documented its existing binary format for Office and made that documentation freely available to all without any legal barrier, their binary format could be more open than OOXML is, even though it’s not XML based.

Of course the fact that Microsoft keeps referring to its format as “Open XML” only makes the situation more confusing.

In any case, don’t fall for it. Look beyond the claims.

October 23, 2007 Posted by | open, standards | , , , , | 2 Comments

Introduction

I work for Bob Sutor in IBM’s Open source and standards project office. Given Bob’s level of activity and celebrity in the “blogosphere” I suppose it won’t be a surprise to anyone if I say that he’s been trying to get me to create my own blog for a long time.

So, why am I doing this now? Well, I’ve been participating in various public debates lately, such as the Goscon panel, and the need to be able to follow up and tell what I think on certain issues, such as OOXML, has been nagging me. Since blogs have now become the main communication channel for one to express himself, for better or for worse, I’ve decided to put aside all of the issues I have with them and forge ahead.

Now, let me explain why despite Bob’s insistence I have until now refrained from creating a public blog.

The first reason is that I find blogs to be very egocentric. And while I don’t claim to be particularly more humble than anyone else this fundamentally bugs me.

I’m a long time internet user and I used to be very active in public forums (a.k.a newsgroups) and mailing lists. The fundamental difference between these communication channels and blogs is that each of them typically focuses on a particular topic. Blogs on the other hand are centered around individuals. Furthermore, while comments and trackbacks provide for some level of dialog, blogs are primarily one way communication channels, unlike newsgroups and mailing lists that are essentially symmetrical. The fact that most syndication feeds merely communicate the blogger’s entries and not the comments only makes this worse.

Aside from the egocentric nature of blogs, another reason for not having created a blog earlier is the fact that blogs very often have no particular topic. This means that readers have to deal with all sorts of information that they may have no interest in to get to the information they do have an interest in. A perfect example of this is Bob’s blog. I’m very interested in Bob’s opinion when it comes to open source and standards. In fact, open source and standards being the focus of my own job it’s pretty essential for me to read Bob’s blog. But quite frankly I’m not so interested in his opinion on music and Bob Dylan. Not that I think there is anything wrong with his taste, I might actually be happy to engage in a discussion on that very topic while having a casual dinner with him but in the context of my work this is just noise.

Finally, the third reason for not having created a blog earlier is simply time. Lack of it that is. Like many, I already have a hard time keeping up with the email I receive and dealing with the long list of projects I’m responsible for. I know from past experience that engaging in public discussions on the internet can be very time consuming and I just don’t know that I can dedicate enough time to this to do it well.

For what it’s worth I’ve been wondering how Bob manages to write so much. Having just traveled with him I think I now have part of the answer. I think the amount of travel he does associated with the fact that he insists on being at the airport up to 4 hours ahead of his flight has something to do with it… 😉

Don’t expect this to be my personal diary. Even though the opinions expressed hereby are only mine and do not necessarily represent those of my employer I intend to use this blog primarily for work purposes.

Of course, the lack of time will remain a problem and for that reason if nothing else I won’t commit to writing on a regular basis. Hopefully though, one will find interest in what I manage to publish.

October 22, 2007 Posted by | blogs, ibm, open, opensource, standards | 7 Comments