Keywords: Content Model, Schema, W3C XML Schema, XML
Biography
Henry S. Thompson is Reader in Artificial Intelligence and Cognitive Science in the Division of Informatics at the University of Edinburgh, based in the Language Technology Group of the Human Communication Research Centre, and Managing Director of Markup Technology Ltd. He received his Ph.D. in Linguistics from the University of California at Berkeley in 1980. His university education was divided between Linguistics and Computer Science, in which he holds an M.Sc. His current research is focussed on articulating and extending the architectures of XML. He was a member of the SGML Working Group of the World Wide Web Consortium which designed XML, is the author of the XED, the first free XML instance editor and co-author of the LT XML toolkit and is currently a member of the XSL and XML Schema Working Groups of the W3C. He currently splits his time between the World Wide Web Consortium, the University of Edinburgh and Markup Technology, and is lead editor of the Structures part of the XML Schema W3C Recommendation, for which he co-wrote the first publicly available implementation, XSV. He has presented many papers and tutorials on SGML, DSSSL, XML, XSL and XML Schemas in both industrial and public settings over the last five years.
Biography
Richard Tobin is a researcher in the Language Technology Group (LTG) in the Human Communication Research Centre, University of Edinburgh. He received his BA in Mathematics from Cambridge in 1983. His current research is in the area of XML, which is used in LTG for representing and processing linguistic data, including large corpora of natural language. He is the author of RXP, a widely-used validating XML parser, and is co-author of XSV, an XML Schema validator. He is a member of the W3C's XML Core Working Group, and was co-editor of the XML Infoset specification. He is also editor of the proposed 1.1 revision of the XML Namespaces specification.
Implementing validation and restriction checking for W3C XML Schema content models is harder than for DTDs. This paper gives complete details on how to convert W3C XML Schema content models to Finite State Automata, including handling of numeric exponents and wildcards. Enforcing the Unique Particle Attribution constraint and implementing restriction checking in polynomial time using these FSAs is also described.
The Problems and an Overview of our Solutions
Finite State Automata for Content Model computations
Algorithm 1: Conversion to FSA
XML Schema abstract data model
Algorithm details
Algorithm Tp(S)
Algorithm Tt(S)
Comments on the algorithm
Algorithm 2: UPA
Algorithm details
Comments on the algorithm
Algorithm 3: Subsumption
Algorithm details
Matches
A problem and a solution
Additional issues
Further properties of element declarations
Another problem with wildcards
Notes on complexity
Acknowledgements
Bibliography
W3C XML Schema content models [XML Schema] are richer and more powerful than their XML DTD [XML] equivalent in several ways:
Content models are also required to obey the Unique Particle Attribution constraint, a weak form of determinism requirement, and, if derived by restriction, to stand in a certain relationship to the content model of the base from which they are derived.
In this paper we present a complete approach to all these issues, based on the translation of content models into augmented Finite State Automata (hereafter FSAs).
Three algorithms are presented:
0
,
1
or unbounded
by unrolling the sub-expression governed by the indicator;
XML DTD content models are regular expressions over an atomic term vocabulary of
element names. Alternation and concatenation (indicated with vertical bar
(|
) and comma (,
) respectively) are the basic
term constructors for composing complex terms from other terms. In addition
the three Kleene operators ?
, +
and *
,
indicating optionality, one-or-more and zero-or-more respectively, can be added
to any term.
XML Schema goes beyond this in three ways:
The impact of these three additions is that the schema-validation process cannot use existing approachs to regular-expression matching to check that a content model is satisfied. The first algorithm we present addresses this problem, by augmenting the normal algorithm for converting a regular expression to an FSA in ways which accommodate the three differences outlined above. The result can not only serve to check local schema-validity (that is, check that the sequence of element children of a parent in an instance are accepted by the parent's content model), but also as the basis for solving our other two content-model related problems.
The algorithm that follows is a modification of the one beginning on page 95 of [Aho & Ullman] .
We need to explain the abstract data model which XML Schema uses for content models in a bit more detail before setting out the algorithm in detail.
The {content model} of a complex type definition is a single particle.
0
, the {term}
is optional.
XML Schema also defines an all term, but its use is heavily constrained -- in particular it cannot occur within other terms, but only as the {term} of the particle which is a content model. Given this, and the fact that it does have any simple translation into an ordinary FSA, it is appropriate that it should be handled specially. Accordingly we will have nothing further to say about such particles in this paper.
The algorithm for converting a content model to an FSA has two parts, one for particles and one for terms.
To translate a particle to an FSA ending at a state S
<------lambda------< / \ n>--[term machine]-->t S \ / >---------lambda------> |
min=2 max=4
:
n-->[term machine]-->x>--[term machine]-->S \ \ / \ >-----lambda-----> \ / >--------------lambda--------------> |
To translate a term to an FSA ending at a state S
It should be noted that the non-lambda edges of the resulting automata are all labelled either with wildcard or element declaration terms.
There are two circumstances in which the algorithms above produce a state with no edges leaving it, which may (although it need not) render the whole FSA unsatisfiable:
<xs:choice/>
. This results from an empty particles.
substitutionGroup
of any other element
declaration. This results from an empty substitution group at the component level.
It should be obvious that FSAs produced by the above algorithm can be used either as-is or after determinisation and/or minimisation to do local schema-validity assessment.
XML Schema's UPA (Unique Particle Attribution) constraint on content models (http://www.w3.org/TR/xmlschema-1/#cos-nonambig in [XML Schema] ) reconstructs a constraint inherited by XML DTDs from SGML. Given an FSA constructed from a content model using Algorithm 1: Conversion to FSA , it is easy to enforce this constraint by simple manipulations and checks.
Since step 3.1 above is by construction operating on an FSA which has been determinised with respect to terms, any name collision is necessarily between two distinct terms from the original content model, thus violating UPA. Steps 3.2 and 3.3 are the obvious additions to handle wildcard-wildcard and wildcard-element declaration overlaps respectively.
Note there's nothing special in any of the above for numeric ranges. That is all handled by Algorithm 1: Conversion to FSA .
The XML Schema REC [XML Schema] defines two ways by which new complex type definitions can be derived from old, extension and restriction. Restriction, as its name suggests, is intended to define a new type, call it D for 'derived', which accepts a subset of what its base type definition (call this B) accepts. The REC as it stands does not, however, base the formal definition of restriction by appeal to subsetting, but rather by a set of detailed constraints on the allowed relationship between the content models of B and D (see http://www.w3.org/TR/xmlschema-1/#cos-particle-restrict in [XML Schema] ). Not only is the current approach complex and difficult to understand (and implement), but it is also inaccurate, in that it both allows some defintions as restrictions which accept more than their base, and bars some definitions which accept lest. There is therefore strong interest in moving to a definition which accurately correctly implements the subset story.
Two broad classes of approach to this can be imagined: One which deals explicitly with types, defined as (infinite) sets of infosets, and one which deals only with type definitions. We adopt the latter approach, as it fits neatly with the use of FSAs set out above, which we have already adopted in our XML Schema implementation [XSV] .
Since we are working in the domain of type definitions, it's appropriate to describe our goal as checking whether a the content model of a type definition B subsumes that of a derived type definition D. Here's a detailed sketch of an algorithm which attempts to check subsumption of two FSAs B and D, constructed with Algorithm 1: Conversion to FSA and determinised and checked with Algorithm 2: :
<B.initialState,D.initialState>
.base
for the first member and .derived
for the second
P.derived
P.base
which Matches DE, then add a pair
<BE.end,DE.end>
to SS (unless it's already there).
*There can be only one such edge because B satisfies Unique Particle Attribution .
An edge BE matches another edge DE iff one of the following three cases holds:
The above algorithm is too strict in one area, involving wildcards. A single wildcard in D may require two wildcards in B to cover it, as it were. Consider the following:
B: <xs:choice> <xs:any ns='a'/> <xs:any ns='b c d'/> </xs:choice> |
vs.
D: <xs:choice> <xs:any ns='a b'/> <xs:any ns='c'/> </xs:choice> |
The second clearly accepts a subset of what the first accepts, but
Algorithm 3: Subsumption will reject it, because no edge in the top choice subsumes the 'a b'
edge in the bottom one.
Here's the fix:
In all cases, then proceed with Algorithm 3: Subsumption as given above.
Note that to avoid violating the UPA there cannot be more than one negated wild edge leaving any state, and when there is one all the non-negated wild edges can only allow namespaces which are negated by the negated one.
The simple example above is covered by case (2) in the fixup above. Here's an example that is covered by (3) and so requires unpacking:
B: <xs:choice> <xs:seq> <xs:any ns='a'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='not(a)'/> <xs:elt ref='z'/> </xs:seq> </xs:choice> |
versus
D: <xs:choice> <xs:seq> <xs:any ns='b'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='not(b)'/> <xs:elt ref='z'/> </xs:seq> </xs:choice> |
The rules for unpacking give us a transformation in two steps:
First D becomes
<xs:choice> <xs:seq> <xs:any ns='b'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='q a'/> <xs:elt ref='z'/> </xs:seq> </xs:choice> |
then
<xs:choice> <xs:seq> <xs:any ns='b'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='q'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='a'/> <xs:elt ref='z'/> </xs:seq> </xs:choice> |
and it should be clear that the subsumption check will now succeed, as it should.
So far, we've been looking at very local properties of content models, looking at the elements accepted only in terms of their names. To complete the picture, we need to check a few other properties of the element declarations involved, and place a recursive constraint on their type definitions. For this we need to look at some other properties of element declarations:
Each of these properties needs to be checked to ensure valid restriction. The following should be considered as additions to clause (1) in the definition of Matches above:
Clause (1) above is the point at which the necessary recursion into the children of an element and their types is invoked---each step of the derivation chain will itself be checked by Algorithm 3: Subsumption .
There is one place where the move beyond local issues discussed immediately above interacts with our earlier treatment of wildcards, and the fact that that treatment ignores the {process contents} property. This leads to problems in certain cases. Consider the case where B has a wildcard and D has one or more explicit
However further thought suggests some interesting possible corner cases which arise when B has a wildcard and D has one or more explicit element declarations:
B: <any ns='xyzzy' processContents='strict'> |
D: <elt name='{xyzzy}:foo' type='xs:decimal'/> |
when there is a global
<elt name='{xyzzy}foo' type='xs:integer'/> |
element declaration available.
As defined above, Algorithm 3: Subsumption allows this, via clause (2) of
the definition of Matches . But this is wrong, since
D accepts things not
accepted by B, for example <foo>3.1415</foo>
.
To fix this we need to both take account of {process contents}, and extend the consideration of type definitions from clause (1) to clause (2) of Matches .
In clause (2) of Matches , we have a wildcard in B which corresponds to an element declaration in D. Let EEN be the expanded name (that is, the {local name} and {namespace name}) of the explicit element declaration, and ET be its {type definition}.
If the wildcard doesn't allow EEN's namespace name, then clause (2) is fails right away.
If it does allow it, then we need to consider sub-cases based on the wildcard's {process contents}:
Basically this addition to (2) takes account of the fact that lax and skip wildcard may function as if they were an element declaration.
Algorithm 1: Conversion to FSA is polynomial in time and exponential in space with respect to the size of the original content model considered as a regular expression.
The time complexity of determinisation is quadratic in the number of states. The time complexity of the UPA checking part of Algorithm 2: has a linear term with respect to the size of the input FSA and an exponential term with respect to the maximum fanout of the input FSA.
The time complexity of Algorithm 3: Subsumption is linear with respect to the size of the FSA for D and the product of the maximum fanouts of the two FSAs. It's not exponential because both FSAs are known to respect the Unique Particle Attribution constraint, which is sufficient to render the pseudo-parsing deterministic in all cases. The exponential space requirement arises because of the unfolding approach to numeric exponents of Algorithm 1: Conversion to FSA .
The work reported here was supported in part by a W3C Fellowship for the first author and a research contract from Microsoft.
The algorithms presented here have been discussed by the W3C XML Schema Interest Group [XML Schema IG] , and a number of improvements derive from comments made there. Particular thanks to Sandy Gao of IBM for identifying several flaws in an early version.