Adding simple type definitions by union to XML Schema

Paul V. Biron
Allen Brown
Martin Gudgin
Ashok Malhotra
Henry S. Thompson
26 June 2000

1.   Introduction and Background

The issue of union types, often discussed in the past by this group, was raised again as a Last Call issue (LC-2: Conjunction types? and others). In its simplest form, the requirement at the core of this and other comments is for a simple type definitions which disjunctively combines two or more other simple type definitions, with a result whose lexical and value spaces are the unions of those of its input types. A simple and indeed prototypical example is the maxOccurs attribute on the element element in XML Schema itself: it should be constrained as a union of non-negative-integer and an enumeration with the single member unbounded, but the current mechanisms for defining simple types do not allow for this.

In response to this issue, the above-named examined a number of approaches, and present the following design for consideration by the group. We believe it is simple both to explain and to implement, and will address the stated requirements well. The opportunity to reconsider certain aspects of the existing design has allowed for some overall simplification and cleanup, as well, with an increase in parallelism at the level of XML representation with the definition of complex types.

2.   Changes to the simple type definition schema component

From the current unitary form of simple type definitions, with six properties: name, base type definition, facets, fundamental facets, variety and target namespace, we move to a core+variants structure, as follows:

core
Properties name (optional), target namespace, base type definition and variety, with possible values for the latter of atomic, list or union, which in turn determine which of the subsequent variants are filled in:
atomic
primitive type definition, facets and fundamental facets
list
item type definition (an atomic or union simple type definition) and facets (limited to enumeration, length, min/maxLength and pattern)
union
constituent type definitions (atomic or list simple type definitions) and facets (limited to enumeration and pattern)

The semantics are straightforward: a string is schema-valid per a union type definition iff it satisfies the specified facets and is valid per at least one of the constituent type definitions. Processors must not check beyond the first constituent they find which the string satisfies. The type definition outcome property in the PSV infoset reports both the union type definition and the constituent type definition which matched.

We considered a range of other strategies and constraints, particularly on whether to allow overlapping lexical spaces or not, and in the end concluded that attempting to rule out overlapping was a bad idea, as it would rule out e.g. float and double as members of a union. We also considered allowing nested spaces to be disambiguated in favour of the most specific, but in the end concluded that given that user-order would have to play a role in the case of non-nested overlap, it was better to use it for everything. A note will be needed that having e.g. double before float will be pointless given our choice. The constraint on the constituents of union types to be atomic or list does not rule out unions of unions at the XML representation level, it simply requires them to be unfolded at schema construction time. We also considered making binary a fourth possible value of variety, to reflect both its parameterisation by an encoding type and its (very) limited inventory of allowed facets, but didn't reach consensus on this point.

3.   Changes to the XML representation of simple type definitions

We thought we would take this opportunity to move <simpleType> more in line with the (newly simplified itself) form of <complexType>. Accordingly the three basic ways of producing new simple type definitions from old all have a common shape: a simpleType element with an optional name and a choice between restriction, list or union as the single required child (after optional annotation).

The restriction option has either a base attribute (a QName) or a simpleType child, and allows the facets appropriate to that base type as children. Also, if the base is a list, then a list child whose type restricts the base's type is alternatively allowed. Similarly, if the base is a union, a union child whose types restrict the base's types is a possibility.

The list option has either a type attribute (a QName) or a simpleType child.

The union option has a types attribute (a list of QNames) and any number of simpleType children.

This design tightens the content models and matches them better to their use, without completely eliminating semantic dependencies. So although we can now do much better at allowing only the appropriate facets for lists, the allowed facets for the restriction case are still a function of the base type, which cannot be expressed in the schema for schema documents.

An example taken from XHTML (I gather) of an attribute definition would be as follows on this account:

<xs:attribute name="size">
 <xs:simpleType>
  <xs:union>
   <xs:simpleType>
    <xs:restriction base="xs:positive-integer">
      <xs:maxInclusive="10"/>
    </xs:restriction>
   </xs:simpleType>
   <xs:simpleType>
    <xs:restriction base="xs:NMTOKEN">
     <xs:enumeration value="small"/>
     <xs:enumeration value="medium"/>
     <xs:enumeration value="large"/>
    </xs:restriction>
   </xs:simpleType>
  </xs:union>
 </xs:simpleType>
</xs:attribute>

This example uses only embedded anonymous simple types, but a list of QName-references can be used for named constituents, or the two combined as required. The parallelism of the cases means that you are never forced to name a type definition just in order to use it as part of another definition, so for instance to constrain the length of a list of constrained strings without exposing the string type itself, the following will work:

<simpleType name='fourTuple'>
 <restriction>
  <simpleType>
   <list>
    <simpleType>
     <restriction base='string'/>
      <enumeration value='1'/>
      <enumeration value='one'/>
     </restriction>
    </simpleType>
   </list>
  </simpleType>
  <length value='4'/>
 </restriction>
</simpleType>

4.   Changes to XML representation of complex type definitions

We strongly recommend that a further change to the content model of <complexType> should be made to bring the two completely in to line: eliminate the derivedBy attribute here as well, in favour of a required child, either <restriction> or <extension>, with a choice between base attribute or nested type definition, as above. Only <restriction> would be allowed to have neither base nor a nested type definition, in which case the actual base would default to the appropriate flavour of urType, as it does now.

5.   A note on 'cost'

This does represent a backward incompatible change: existing schema documents will become invalid. To reduce the practical impact of this, Martin Gudgin has produced XSLT stylesheets which do the necessary forward conversions, and we'll make these available if these changes are agreed.