The XML Meta-Architecture

Henry S. Thompson

HCRC Language Technology Group
University of Edinburgh

World Wide Web Consortium




Presented at XML DevCon, London, 2001-02-21




© 2001 Henry S. Thompson

XML has grown

XML the language

Namespaces

A great success

As long as you keep your expectations suitably low

XSLT/XPath/DOM

_____________________

XLink/XPointer

XML Schema

Canonical XML/XML Signatures

_____________________

XML Query/XML Protocols


What’s missing?

In the interests of time, XML 1.0 did not define its own data model

So XPath had to define it

And XLink had to define it

And the DOM had to define it

Finally, later than we’d have liked, we’re about to get

The XML Information Set

Or Infoset

(now in Last Call)


What’s the Infoset?

The XML 1.0 plus Namespaces abstract data model

What’s an ‘abstract data model’?

The thing that a sequence of start tags and attributes and character data represents

A formalization of our intuition of what it means to “be the same document”

The thing that’s common to all the uninterestingly different ways of representing it

Single or double quotes

Whitespace inside tags

General entity and character references

Alternate forms of empty content

Specified vs. defaulted attribute values


What does it mean to be ‘abstract’?

The Infoset is a description of the information in a document

It’s a vocabulary for expressing requirements on XML applications

It’s a bit like numbers

As opposed to numerals

If you’re a type theorist

It’s just the definition of the XML Document type


What the Infoset isn’t

It’s not the DOM

Much higher level

It’s not about implementation or interfacing at all

But you can think of it as a kind of fuzzy data structure if that helps

It’s not an SGML property set/grove

But it’s close


Infoset details

Defines a modest number of information items

Element, attribute, namespace declaration, comment, processing instruction, document ...

Each one is composed of properties

Which in turn may have information items as values

Both element and attribute information items have [local name] and [namespace URI] properties

Element information items have [children] and [attributes]

Attribute information items have a [normalized value]

For more details, see my colleague Richard Tobin’s talk on Thursday

He’s the editor of the Infoset spec.


The Infoset Revolution

We’ve sort of understood that XML is special because of its universality

Schemas and stylesheets and queries and … are all notated in XML

But now we can understand this in a deeper way

The Infoset is the common currency of all the XML specs and languages

XML applications can best be understood as Infoset pipelines

Angle brackets and equal signs are just an Infoset’s way of perpetuating itself


The Infoset Pipeline begins

An XML Parser builds an Infoset from a character stream

A streaming parser gives only a limited view of it

A validating parser builds a richer Infoset than a non-validating one

Defaulted values

Whitespace normalisation

Ignorable whitespace

If a document isn’t well-formed invalid, or isn’t Namespace-conformant

It doesn’t have an Infoset!


The XML Schema comes next

Validity and well-formedness are XML 1.0 concepts

They are defined over character sequences

Namespace-compliant is a Namespace concept

It’s defined over character sequences too

Schema-validity is the XML Schema concept

It is defined over Infosets


The Schema and the Infoset

So crucially, schemas are about infosets, not character sequences

You could schema-validate a DOM tree you built by hand!

Using a schema which exists only as data structures ditto


The Infoset grows

Crucially, schemas are about much more than validation

They tell you much more than ‘yes’ or ‘no’

They assign types to every element and attribute information item they validate

This is done by adding properties to the Infoset

To produce what’s called the post schema-validation Infoset (or PSVI)

So schema-aware processing is a mapping from Infosets to Infosets


The Infoset is transformed

XSLT 1.0 defined its own data model

And distinguished between source and result models

XSLT 2.0 will unify the two

And make use of the Infoset abstraction to describe them

So XSLT will properly be understood as mapping from one Infoset to another


The Infoset is composed

XLink resources (the things pointed to by XPointers) can now be understood as items in Infosets

The XInclude proposal in particular fits in to my story

It provides for the merger of (parts of) one Infoset into another


The Infoset is accessed

XML Query of course provides for more sophisticated access to the Infoset

It also allows structuring of the results into new Infoset items


The Infoset is transmitted

And finally XML Protocol can best be understood as parcelling up information items and shipping them out to be reconstructed elsewhere


A big step forward

This is so much better than the alternative

Either

Pretending to talk about character sequences all the time

Or

Requiring each member of the XML standards family to define its own data model


Schemas at the heart

I would say that, wouldn’t I :-)

Seriously, schema processing can be integrated into this story in a way DTDs could not

You may want to schema-process both before and after XInclude

Or between every step in a sequence of XSLT transformations

We actually are missing a piece of the XML story

How do we describe Infoset pipelines?


Types and the Infoset

The most important contribution to the PSVI

Every element and attribute information item is labelled with its type

Integer, date, boolean, …

Address, employee, purchaseOrder

XPath 2.0 and XML Query will be type-aware

Types will play a key role in the next generation of XML applications


XML is ASCII for the 21st century

ASCII (ISO 646) solved a fundamental interchange problem for flat text documents

What bits encode what characters

(For a pretty parochial definition of 'character')

UNICODE/ISO 10646 extends that solution to the whole world

XML thought it was doing the same for simple tree-structured documents

The emphasis in the XML design was on simplifying SGML to move it to the Web

XML didn't touch SGML's architectural vision

flexible linearisation/transfer syntax

for tree-structured prose documents with internal links


The alternative take on XML?

It's a markup language used for transferring data

It is concerned with data models

to convert between application-appropriate and transfer-appropriate forms

It is not concerned with human beings

It's produced and consumed by programs


Application data


Structured markup

<POORDERHDR>
 <DATETIME qualifier="DOCUMENT">
  <YEAR>1996</YEAR>
  <MONTH>06</MONTH>
  <DAY>30</DAY>
  <HOUR>23</HOUR>
  <MINUTE>59</MINUTE>
  <SECOND>59</SECOND>
  <SUBSECOND>0000</SUBSECOND>
  <TIMEZONE>+0100</TIMEZONE>
 </DATETIME>
 <OPERAMT qualifier="EXTENDED" type="T">
  <VALUE>670000</VALUE>
  <NUMOFDEC>2</NUMOFDEC>
  <SIGN>+</SIGN>
  <CURRENCY>USD</CURRENCY>
  . . .

What just happened!?

The whole transfer syntax story just went meta, that's what happened!

XML has been a runaway success, on a much greater scale than its designers anticipated

Not for the reason they had hoped

Because separation of form from content is right

But for a reason they barely thought about

Data must travel the web

Tree structured documents (Infosets) are a useable transfer syntax for just about anything

So data-oriented web users think of XML as a transfer mechanism for their data


The new challenge

So how do we get back and forth between application data and the Infoset

Old answer

Write lots of script

New answer

Exploit schemas and types

A type may be either

simple, for constraining string values

complex, for constraining elements which contain other elements

Mapping between layers

We can think of this in two ways

In terms of abstract data modelling languages

Entity-Relation

UML

RDF

In concrete implementation terms

Tables and rows

Class instances and instance variables

The first is more portable

The second more immediately useful


Mapping between layers 2

Regardless of what approach we take, we need

A vocabulary of data model components

An attachment of that vocabulary to types

Sample vocabularies

entity, relationship, collection

table, row, column

instance, variable, list, dictionary

Where should attachment be specified?

In the schema

convenient

Outside it

modular


Overall Conclusion

Think about things in terms of Infosets and Infoset pipelines

Modular

Powerful

Scalable

Use XML Schema and its type system to facilitate mapping

Unmarshalling is easy

Marshalling takes a little longer