Biography
Henry S. Thompson is Managing Director of Markup Technology Ltd. He is also Reader in Artificial Intelligence and Cognitive Science in the Division of Informatics at the University of Edinburgh, based in the Language Technology Group of the Human Communication Research Centre. He received his Ph.D. in Linguistics from the University of California at Berkeley in 1980. His university education was divided between Linguistics and Computer Science, in which he holds an M.Sc. His research interests have ranged widely, including natural language parsing, speech recognition, machine translation evaluation, modelling human lexical access mechanisms, the fine structure of human-human dialogue, language resource creation and architectures for linguistic annotation. His current research is focussed on articulating and extending the architectures of XML. Markup Technology is focussed on delivering the power of XML pipelines for application development. Henry was a member of the SGML Working Group of the World Wide Web Consortium which designed XML, is the author of the XED, the first free XML instance editor and co-author of the LT XML toolkit and is currently a member of the XSL and XML Schema Working Groups of the W3C. He currently is a member of the W3C Team, and is lead editor of the Structures part of the XML Schema W3C Recommendation, for which he co-wrote the first publicly available implementation, XSV. He has presented many papers and tutorials on SGML, DSSSL, XML, XSL and XML Schemas in both industrial and public settings over the last five years.
This paper defines a logic in which to express constraints on W3C XML Schema components and relationships between components and XML infoset items, for use in a formal rewriting of the W3C XML Schema Recommendations. The logic is essentially a constraint language over path expressions, interpreted equally with respect to an Infoset graph or a schema component graph. By 'logic' I mean the traditional three-part story comprised of a sentential form, a model theory and an interpretation.
Introduction
Illustrative rewrite
A preliminary sketch of the logic
Sentential form
The Model
Signatures
Interpretation and satisfaction
Open questions and next steps
Component-valued properties identified by name
Focussing on Infosets
Bibliography
In the period after the publication of [XML Schema] the W3C's XML Schema Working Group received frequent requests to tighten up the normative definitions of schema components, both in terms of the constraints on their well-formedness and their validation semantics. And as our experience has grown the Working Group has wished to clarify the fundamental nature of schemas as such, for instance when discussing a possible move to subsumption-based restriction checking (see [FSA for Schema Restriction] ).
I have thought for some time that the basis for a response to these requests lies in the work of Rounds, Moshier, Johnson, Carpenter and others on a logic of feature structures. This paper is a preliminary sketch of what such a logic might look like, and how it might be used, intended to provide sufficient detail to assess the likely value of the idea, without being anywhere near complete.
The basic idea is to define a logic in which to express constraints on components and relationships between components and infoset items. By 'a logic' I mean a pretty traditional three-part story:
A brief trivial example may help explain what these three things are. First a BNF to define our WFFs :
Some examples of WFFs are then
Our models are very simple -- a model consists of a set of individuals (the domain) and a set of named disjoint subsets of that set (the colourings), e.g.
Finally the constructive definition of validity. This is expressed in terms of a mapping from WFFs to conditions on the model. We use double square-brackets 〚〛to denote this mapping. We treat our birds and stuff as variables, that is, they have an arbitrary association with individuals in a model. We denote the variable assignment function which establishes this assignment using the Greek letter iota (ι). The mapping rules then are as follows, using X and Y is meta-variables over parts of WFFs:
So, given the specific model above and a variable assignment which maps 'doves' to 'f', we get, for the interpretation of the last example WFF above, the interpretation:
( f ∈ {a, b, c} ) ∨ ( f ∉ {a, b, c} )
which is evidently true. Furthermore, for models in which every individual has a colour, it should be clear that it's a tautology, that is, it's true in all such models for any variable assignment.
This may serve to explain why Frege was not being childish when he, famously, said
〚Snow is white〛 is true just in case 〚snow〛 is 〚white〛 [punctuation added :-]
So the idea for XML Schema is to define a sentential form which can express all component constraints and validation rules, a model which formalises components and the PSVI, and an interpretation which maps WFFs from the sentential form onto set-theoretic constraints on the model. We won't do anywhere near all the work ourselves -- the sentential form is a modest extension of the standard sentential form for first-order predicate calculus with equality, and the model and interpretation bootstrap from both first-order logic and graph theory.
Before beginning the formal definitions required, the next section provides an extended worked example.
There follows hereafter a rewrite in the proposed new approach of the Schema Component Constraint: Complex Type Definition Properties Correct from [XML Schema] . The original looks like this:
The rewrite would look something like this, with informal
glosses on formal statements of the constraints. We use a variable
C
to stand for the type definition node itself, and
double vertical bars to signal de-referencing of a QName, with the
sort of component indicated by a subscript, e.g. t
for
a type definition. We miss out the first constraint, which will be
handled elsewhere via signatures for the sorts:
We use the same sentential form as standard first order predicate calculus with equality, replacing the standard atoms with paths. A path is a specification of a route through the component/PSVI graph, denoting a (possibly empty) set. It starts from a graph node, and takes any number of property-name-labelled steps (∙ is used to join steps, * indicates transitive closure):
For example, here's a familiar path, assuming S is a node variable which contains a simple type definition:
S∙base type definition∙variety
This denotes the {variety} (atomic, list or union) of the {base type definition} of S
The optional star indicates transitive closure, so for example
S∙base type definition*∙name
would denote a set consisting of the {name}s of every type in the {base type definition} chain up from S.
The double vertical bar indicates QName lookup, with the sort
key identifying the sort of the named component. The question of
where namespace bindings come from for this remains to be
addressed. For example, here's a path from a top-level element
named my:doc
to its {type definition}:
‖my:doc‖e∙type definition
The model starts from the standard formulation of a labelled
directed graph, and adds some bells and whistles we'll need. A
model is a model for a component language. A component
language is four-tuple <
S, P,
A, G
>
:
Given a component language, a model for that language is a
triple <
N, E,
L
>
:
This model does not represent sets as such, but rather by
allowing multiple edges with the same label. That is, a property
such as [children] will appear in our model as zero or more
child
edges. The presence of indexed edge labels is to
provide for sequences, so that for instance a model group with two
particles would be represented in the model by a node with edges
labelled <particle,1>
and
<particle,2>
.
We constrain our models to use 1-origin indexing and to make coherent use of indexed labels as follows:
∀ <n, <p, i>, m> ∈ E , (i > 0) ∧ (∀ j ∋ 0 < j < i , <n, <p, j>, m> ∈ E)
We constrain node labels so that atomic values label leaf vertices and sorts label the rest:
∀ n ∈ N , (L(n) ∈ A) ≡ ¬ (∃ l, m, <n, l, m> in E)
This approach (instead of just partitioning V into sorted internal vertices and atomic values) is perhaps overkill -- it allows us to represent bags of atomic values, which I don't believe is required by anything in the PSVI or component inventory.
Signatures are the way we specify what properties go with what sorts, and what we can expect about their values. For example,
corresponding to the definition of the Attribute Use component in the [XML Schema] .
We define the relationship between the sentential form and model in terms of the normal satisfaction relation, written with the double-turnstile character (⊧). Here are enough examples of what we add to the rules we inherit from first-order predicate logic with equality to make the project clear, I hope, without pretending to be complete.
global
.
This section surveys the areas in which work remains to be done before the logic proposed here might be ready for use in a revised version of [XML Schema] .
As presented above, the logic makes no provision for anything less than complete schemas, with no dangling references -- indeed, the whole issue of schema construction has been ignored. If we want to use this approach to address schema construction and composition, we need to look more closely at this area.
Ignoring annotation-related properties and the schema component itself, [XML Schema] has the following component-valued properties today (* means that when a schema is constructed from a schema document, the value is always established by dereferencing a QName, ? means it may be established by dereferencing):
There are actually only a small number of cases here:
There are a number of ways we could deal with QName resolution and laziness -- I've picked one to illustrate how it would be formalised.
Suppose the signature of all * and ? properties has in fifth (value sort) place not only every 'real' value sort but also an expanded-name micro-component placeholder sort with a related name (e.g. Element Declaration Name goes with Element Declaration). Now we can actually define what it means for a schema to be complete:
We say a model <
N, E,
L
>
is complete if and only if
∀ o ∈ N , L(o) ∈ S ⇒ ¬ ∃ p x a s, (<p, x, component, a, s>) ∈ G(L(o)) ∧ s ∈ S N
where by S N I mean the set of 'placeholder' components defined above. This formalises the notion of completeness as the state in which no component-valued property of any component has as its value a placeholder component.
The design presented here intentionally covers both schema components and infoset infoitems. Some of its complexity derives from needing to cover both -- it would be worth exploring a logic for the infoset alone, comparable to [The Essence of XML] .
XHTML rendition created by gcapaper Web Publisher v2.1, © 2001-3 Schema Software Inc.