Introduction to XML Schema

Henry S. Thompson

HCRC Language Technology Group
University of Edinburgh

World Wide Web Consortium

Markup Technology Ltd



Copyright © 2001 Henry S. Thompson


Basic Concepts and Vocabulary

What is an XML application?

We define an XML application as having

A form:  what do all the documents involved in this application share?

A vocabulary (elements and attributes)

A grammar (how they are allowed to combine)

A function:  what those elements and attributes mean

You already know the basic story about defining a syntax

You can use English (or French or . . .)

You have used a DTD

Now you can use an XML Schema


Components of the XML family

XML Namespaces

Managing multiple vocabularies

XSLT

Transforming XML

XLink/XPointer

Connecting XML documents

XML Schema

Defining XML document families

XML Query

Database-style query language

XML Protocols

XML-based communication


Namespaces for XML

First, an example

<xh:p xmlns:xh='http://www.w3.org/1999/xhtml' > So the result can be expressed as <!-- (a+b)2 -->
  <mml:apply
 
xmlns:mml='http://www.w3.org/TR/REC-MathML' >
  <mml:power/>
  <mml:apply>
   <mml:plus/>
   <mml:ci>a</mml:ci>
   <mml:ci>b</mml:ci>
  </mml:apply>
  <mml:cn>2</mml:cn>
  </mml:apply>
</xh:p>


Namespaces for XML, cont’d

Where did those colons come from?

xh:this, mml:that, xml:the_other

Two communities pushed for namespaces

Vendors, to manage the composition of document fragments

E.g. the inclusion of mathematical formulae in a document

Working groups, to reserve names without compromising  users' freedom to name things

E.g. it wouldn't do for XML-link to reserve <link> for simple links, or XSL to reserve <text>


Namespaces, cont'd

A W3C Recommendation was endorsed in January 1999

There was a lot of vendor pressure to get something in place, which caused political tension and at least one resignation from the WG

The example illustrates how namespaces are declared, scoped and used


Namespaces defined

You can use qualified names, consisting of two simple names separated by a colon (:)

The namespace prefix is an abbreviation for a URI which uniquely identifies the owner/meaning/identity of the source of the name

Using a namespace essentially cedes responsibility for the meaning of the qualified names to the owner of the URI


Declaring a namespace

The association between namespace prefixes and URIs is declared using reserved attributes

<doc xmlns:mml='http://www.w3.org/TR/REC-MathML/'>
...</doc>

  Anywhere inside the above doc element mml is a legal namespace prefix, standing for the URI given

  There is also a mechanism for defining the default (unprefixed) namespace

Declarations are scoped

Prefixed names can be used for

Element type names

Attribute names


XML Schema: some details

XML Schema is a language for defining the structure of XML documents

  Notated in XML itself

So there are elements defined for use in schemas to define. . .

Elements :-)

Attributes

Types


Terminology

Documents have structure

Document types

Document instances

Structure can be defined

Informally (D. S. D.)

SGML DTD

XML DTD

Schema using XML


Why validate?

A D. S. D. is a contract between producers and consumers

It provides a guaranteed interface

Producers validate to ensure they are providing what they promised

Consumers validate to check up on producers

and to protect their applications

Application authors validate to simplify their task

Leave error detection and analysis to the validating parser


Why validate? cont'd

Validation is fundamental to the distributed application

It guarantees a minimum level of data integrity

Validate early, validate often

Localise the source of error

Schema-based validation gives you more

Type assignment


A simple example

<!ELEMENT text (#PCDATA|emph|name)*>

<!ATTLIST text
        timestamp NMTOKEN #REQUIRED>

<xs:element name="text">

  <xs:complexType mixed="true">

  <xs:choice minOccurs="0"
          maxOccurs="unbounded">

   <xs:element ref="emph"/>

   <xs:element ref="name"/>

  </xs:choice>

  <xs:attribute name="timestamp"
      type="xs:date" use="required"/>
</xs:complexType></xs:element>


The Schema Architecture:  Static

A document or an application or a user identifies a schema document

Document and schema document are well-formed XML

The document is schema-valid w.r.t the schema

(The schema document is schema-valid wrt the schema for schemas)


The Schema Architecture:  Dynamic

An XML application (XSP) which schema-validates

And augments the information with defaults, types, etc.


The state of play

Chartered in the autumn of 1998

Requirements document out in February of 1999

Three component documents

Primer (non-normative)

Structures

Datatypes

8 public working drafts so far

May, September, November 1999

February, April, September, October 2000

March 2001:

http://www.w3.org/TR/xmlschema-1/

[contains pointers to previous drafts]

Proposed Recommendation

Member comments due by 16 April 2001


XML Schema: Four requirements

Reconstruct DTD functionality using XML

'Eat your own cooking'

Integrate Namespaces

Modular schemas for modular document types

Provide a usable inventory of basic datatypes

For elements as well as attributes

Support object-oriented design

Kind-of as well as part-of


Modular design

Schemas are about elements and attributes named by qualified names

A pair of namespace name and local name

A schema may include components for multiple namespaces

Schema documents are primarily about one namespace

But you can assemble multiple schema documents to build a single schema

include a schema document for the same namespace

import a schema for another namespace


Simple Type Definitions

Treats attributes and sub-elements the same

A frequently-expressed requirement for XML

We need an inventory of simple types for strings

<xs:attribute name='birthday' type='xs:date'/>

Other built-in simple types:

boolean, number, uriReference, hexBinary, dateTime, duration, . . .

QName, NOTATION, . . .

integer, NCName, ID, IDREFS, . . .


Object-oriented design

Type definitions are distinct from attribute and element declarations

The tag-type distinction

Type definitions can be based on other definitions

restriction

extension

list

union


The XML Schema worldview

Validity and well-formedness are XML 1.0 concepts

They are defined over character sequences

Namespace-compliant is a Namespace concept

It's defined over character sequences too

Schema-validity is the XML Schema concept

It is defined over XML document Infosets

So the whole XML Schema exercise is predicated on and layered on top of XML 1.0 well-formedness plus Namespaces

Because they are constitutive of the Infoset


What's the Infoset?

The XML 1.0 plus Namespaces abstract data model

Defines a modest number of information items

Element, attribute, namespace declaration, ...

Each has required and optional properties

Name, children, …


The Schema and the Infoset

So crucially, schemas are about infosets, not character sequences

You could schema-validate a DOM tree you built by hand!

Using a schema which exists only as a DOM tree ditto

This simplifies things tremendously

but is hard to get your head around at first


Where did the Infoset come from?

In the interests of time, XML 1.0 did not define its own data model

So XPath had to define it

And XLink had to define it

And the DOM had to define it

Finally, later than we’d have liked, we’re about to get

The XML Information Set

Or Infoset

(now in Last Call)


What’s the Infoset? Take two.

The XML 1.0 plus Namespaces abstract data model

What’s an ‘abstract data model’?

The thing that a sequence of start tags and attributes and character data represents

A formalization of our intuition of what it means to “be the same document”

The thing that’s common to all the uninterestingly different ways of representing it

Single or double quotes

Whitespace inside tags

General entity and character references

Alternate forms of empty content

Specified vs. defaulted attribute values


What does it mean to be ‘abstract’?

The Infoset is a description of the information in a document

It’s a vocabulary for expressing requirements on XML applications

It’s a bit like numbers

As opposed to numerals

If you’re a type theorist

It’s just the definition of the XML Document type


What the Infoset isn’t

It’s not the DOM

Much higher level

It’s not about implementation or interfacing at all

But you can think of it as a kind of fuzzy data structure if that helps

It’s not an SGML property set/grove

But it’s close


Infoset details

Defines a modest number of information items

Element, attribute, namespace declaration, comment, processing instruction, document ...

Each one is composed of properties

Which in turn may have information items as values

Both element and attribute information items have [local name] and [namespace name] properties

Element information items have [children] and [attributes]

Attribute information items have a [normalized value]


The Infoset Revolution

We’ve sort of understood that XML is special because of its universality

Schemas and stylesheets and queries and … are all notated in XML

But now we can understand this in a deeper way

The Infoset is the common currency of all the XML specs and languages

XML applications can best be understood as Infoset pipelines

Angle brackets and equal signs are just an Infoset’s way of perpetuating itself


The Infoset Pipeline begins

An XML Parser builds an Infoset from a character stream

A streaming parser gives only a limited view of it

A validating parser builds a richer Infoset than a non-validating one

Defaulted values

Whitespace normalisation

Ignorable whitespace

If a document isn’t well-formed or isn’t Namespace-conformant

It doesn’t have an Infoset!


The XML Schema comes next

Validity and well-formedness are XML 1.0 concepts

They are defined over character sequences

Namespace-compliant is a Namespace concept

It’s defined over character sequences too

Schema-validity is the XML Schema concept

It is defined over Infosets


The Infoset grows

Crucially, schemas are about much more than validation

They tell you much more than ‘yes’ or ‘no’

They assign types to every element and attribute information item they validate

This is done by adding properties to the Infoset

To produce what’s called the post schema-validation Infoset (or PSVI)

So schema-aware processing is a mapping from Infosets to Infosets


The XML Schema Type System

DTD-based validation is based entirely on element types

XML Schema adds attribute types, simple and complex types to this

Simple types consist of strings

Complex types consist of AII sets plus sequences of characters and EIIs


More terminology

Types are (usually infinite) sets

Type definitions (and element and attribute declarations) are the characteristic functions for such sets

Expressed as necessary and sufficient conditions on membership


Attribute Declarations

The simple case

An association between a qualified name (local name plus optional namespace URI) and a simple type definition

Determines a set of AIIs

[local name] and [namespace URI] must match

[normalized value] must satisfy the simple type def’n

May be scoped by a particular complex type definition

I.e. two AIIs with the same name may have different types if they occur within different EIIs

May include default/fixed value


Element Declarations

An association between a qualified name and

A type definition (simple or complex)

A set of identity constraints

A substitution group head (optional)

Determines a set of EIIs

[local name] and [namespace URI] must match*

[children] must satisfy the type definition

[attributes] must satisfy the type definition

May be scoped by a particular complex type definition


Element Declaration, cont’d

Subtree of IIs rooted at the EII must satisfy the identity constraints, if any

Three kinds of identity constraints, over (sequences) of values identified by XPath expressions:

Uniqueno duplicates allowed

Keyno duplicates, must exist

Keyrefmust match some value of a named key

*EIIs which satisfy element declarations which name this one as their substitution group head (transitively) are also allowed

May include default/fixed value


Simple Type Definitions

Based on ISO 11404

Distinguishes between lexical and value spaces

Identifies fundamental and constraining facets

For example, the number type has

Lexical space: ([+-]?[0-9]*)?(.[0-9]*)?

Value space: the real numbers

Fundamental facets: Ordered: yes; Cardinality: countably infinite; Bounded: no; etc

Constraining facets: min, max, enumeration, …


Simple type definition example

<xs:simpleType name='bodytemp‘>
<xs:restriction base='xs:number'>

  <xs:totalDigits value='4'/>

  <xs:fractionDigits value='1'/>

  <xs:minInclusive value='97.0'/>

  <xs:maxInclusive value='105.0'/>
</xs:restriction>

</xs:simpleType>


Complex Type Definitions

Constrains [attributes]

Required/optional

Local or global declarations

Constrains [children]

Finite-state grammar for EII sequence

Interpolated characters allowed or not

Simple type for text-only case

Local or global declarations


Complex Type Definition, cont’d

Membership assessment is two-part

Locally valid

All required attributes present

No non-declared attributes present

Sequence of names of EII children, if any, satisfies content model

Recursively valid

All attributes/children not exempted have known types

All attributes with known types satisfy them

All EII children with known types satisfy them


Wildcards

The <any/> content model particle, in all of its forms, allows EIIs regardless of local name

A true ‘any’, i.e. any well-formed XML

<any/> allows a single well-formed element information item

the namespace attribute allows finer control

##any

##other

##targetNamespace

##local

<anyAttribute/> has a similar semantics for attributes


Type definition by derivation

XML Schema makes it easy to construct type definitions which restrict or extend other type definitions, by specifying only the method of derivation and the differences between the base and derived type definitions.


Derived type definition

<xs:simpleType

   name='healthyBodytemp‘>
<xs:restriction base='bodytemp'>
  <xs:maxInclusive value='99.5'/>

  </xs:restriction

</xs:simpleType>

The healthyBodytemp type definition is defined by closing down the permitted range of bodytemp.  We say it 'inherits' the other facets of bodytemp, so the 'effective type definition' of healthyBodytemp  is


Effective type definition

<xs:simpleType

   name='healthyBodytemp‘>
<xs:restriction base=‘xs:number>

  <xs:maxInclusive value='99.5'/>

  <xs:totalDigits value='4'/>

  <xs:fractionDigits value='1'/>

  <xs:minInclusive value='97.0'/>

  </xs:restriction>

</xs:simpleType>


Extension for complex types

The next simplest case is extension for complex types

Start with this base type

<xs:complexType name='name'>

  <xs:sequence>

  <xs:element name='title‘

             minOccurs='0'/>

  <xs:element name='forename'

             minOccurs='0'

             maxOccurs='*'/>

  <xs:element name='surname'/>

  </xs:sequence>

</xs:complexType>


Derived type definition

<xs:complexType name='fullName‘>

  <xs:extension base='name'>

  <xs:sequence>

  <xs:element name='genMark'

             minOccurs='0'/>

  </xs:sequence>

</xs:complexType>


The effective type definition

<xs:complexType name='fullName'>

  <xs:sequence>

  <xs:element name='title'

              minOccurs='0'/>

  <xs:element name='forename'

              minOccurs='0'

              maxOccurs='*'/>

  <xs:element name='surname'/>

  <xs:element name='genMark'

              minOccurs='0'/>

</xs:sequence></xs:complexType>


Restriction for complex types

Restriction for complex types is harder to handle syntactically, because of the significance of linear order in content models, but the semantics are completely parallel to the simple type case:


Restriction example

<xs:complexType name='simpleName'>

  <xs:restriction base='name'>

  <xs:sequence>

   <xs:element name='forename'

               minOccurs='1'/>

   <xs:element name='surname'/>

  </xs:sequence>

  </xs:restriction>

</xs:complexType>


Restriction and Inheritance

There must be a one-to-one line-up between the particles in the restriction and the particles in the base

Unlike <simpleType> case, what you see is what you get, so the effective type definition of simpleName is just the same

But for attributes, it works like the <simpleType> case, with unmentioned attributes being inherited unchanged


Element Substitution Groups

An element declaration can identify another declaration as something it wants to be equivalent to

<xs:element name='cat'

            substitionGroup='pet'>

Two things follow from this:

The type of cat must be derived from the type of pet

Whereever a pet is allowed, so is a cat:

<element ref='pet'/>

is equivalent to

<choice><element ref='pet'/>

        <element ref='cat'/></choice>


Union types

<simpleType name="maxType">

  <union
  memberTypes="nonNegativeInteger">

  <simpleType>

   <restriction base="token">

    <enumeration value="unbounded"/>

   </restriction>

  </simpleType>

  </union>

</simpleType>


Open Enumerations

<simpleType name="color">

  <union>

  <simpleType>

   <restriction base="token">

     <enumeration value="red"/>

     <enumeration value="green"/>

     <enumeration value="blue"/>

   </restriction>

  </simpleType>

  <simpleType>

   <restriction base="token"/>

  </simpleType>

  </union>

</simpleType>


Conclusions

XML Schema has a substantial inventory of mechanisms for defining the structure of documents

Its type system is the basis for the interface between application semantics and transfer syntax

The Infoset is the abstraction which application developers should think in terms of

Start learning more with the XML Schema Primer.