XPath for lxtransduce users

Datatypes and expressions

XPath has four datatypes: string, number, boolean and node-set. A node is an XML element, attribute, or piece of text (pcdata). The absence of a node datatype may seem strange, but you can usually ignore the fact; where you might an expression to return a node, it instead returns a node-set with only one member.

All expressions are evaluated with respect to a context node. In lxtransduce, the context node is the element currently being matched or, in character-level grammars, a fake text node containing the characters matched by a regular expression.

Operators

The XPath arithmetic operators are + - * and div. Slash (/) is not used for division because of its use in paths. All arithmetic is done in floating point.

The comparison operators are = != < <= > and >=. When comparisons are done, the operands are converted to a common datatype: in particular, nodes are often compared as strings. There is a peculiarity about comparisons involving node-sets: the comparison is true if it is true for any of the nodes in the set. So if a and b are node-sets, both a = b and a != b may be true! This is never a problem if both node-sets have only one member.

Lxtransduce provides an extension to standard XPath for matching strings against regular expressions; this is not available in XSLT. The tilde operator (~) takes a string as its left-hand operand and a regular expression (also written as a string) as its right-hand operand.

Paths

Nodes from an XML document are matched or selected by paths. Paths look like unix filenames: a/b/c for example. When used to match a node (in the match attribute of an XSLT <template> or an lxtransduce <query> rule for example), a/b/c matches any <c> element that is the child of a <b> element that is the child of a <a> element. When used to select nodes (in the select attribute of an XSLT <apply-templates> call or the test attribute of an lxtransduce <constraint> for example) it selects all the <c> children of all the <b> children of all the <a> of the context node. Star (*) matches any element name.

The parts of a path separated by slashes are referred to as steps.

Attributes are distinguished from elements by the use of an @ sign. @x selects the x attribute of the context node. a/@x selects the x attributes of all the <a> children of the context node. Since attributes don't have children, only the last step in a path can be an @ step.

The context node is referred to by a dot (.). The parent of a node is referred to by dot-dot (..). Again, this is similar to unix filenames. When a node is compared with a string, its string-value is used, which means the concatenation of all its text descendants. In lxtransduce, the elements that rules apply to typically contain just text, so . can be used to refer to that text in comparisons.

Predicates

The nodes matched or selected by a path can be restricted by predicates. These are tests that can be applied to a step by placing them in square brackets. For example, a[@b = "foo"] matches <a> elements with a b attribute whose value is "foo". Within a predicate, dot refers to the node matched by the step, so w[. = "hello"] matches a <w> element whose text content is hello.

A step can have multiple predicates, and predicates can contain several tests combined with and and or. For example, a[. = "may"][@cat = "verb] and a[. = "may" and @cat = "verb"] both match <a> elements whose text content is may and whose cat attribute is verb.

The function position() returns the position of an element amongst the other matching elements. So for example w[position() = 1] matches only the <w> element that is first amongst its <w> siblings. The function last() returns the number of matching elements, so that w[position() = last()] matches only the last <w>. As an abbreviation, when the value of a predicate is a number, it is implicitly compared against the value of position(), so that w[1] and w[last] are equivalent to the earlier versions. (If you want to combine a position test with something else, you will have to write it out in full, for example w[position() = 1 and @cat = "det"].)

Note that positions are among the nodes matching any previous predicates, so w[@cat="verb"][1] means the first <w> whose category is verb, but w[1][@cat="verb"] means the first <w>, provided its category is verb.

Functions

There are numerous other functions in addition to position() and last() which have already been mentioned. string(nodes) converts its node-set argument to a string by returning the string value of the first node in the set. not(x) returns the negation of x, so not(. ~ "^[a-z]+$") is true if the node's string value does not consist of lower-case letters. count(nodes) returns the number of nodes in the set. Lxtranduce has an extension function join(nodes, separator) which returns a string consisting of the string values of all the nodes in the set joined by the string separator. There are several XPath functions that have not yet been implemented in lxtransduce.

Quoting

A literal string can be written with either kind of quote. XPaths usually appear in attributes, and of course these have to be quoted too. You can use the usual XML entities " and ' for quotes in your XPaths, but remember that these are expanded by the XML parser, before the XPath string is parsed. There is no way to escape characters at the XPath level, so you can't have a literal XPath string containing both kinds of quote.

If you need a quote in a string or regular expression, you will have to use the other kind of quote around the string. You will also have to use an entity reference for whichever one of them you use around the whole attribute value. So if you want to compare an element against the word don't, you will need to use double quotes around the string, and escape it either as match="w[. = "don't"]" or match='w[. = "don't"]'

Examples

Match elements called w :

<query match="w"/>

Match elements that have a cat attribute:

<query match="*[@cat]"/>

Match elements called w that have a cat attribute (a predicate whose value is a node-set is true if the node-set is non-empty):

<query match="w[@cat]"/>

Match elements called w that have a cat attribute whose value is noun:

<query match="w[@cat = 'noun']"/>

Match elements called w whose text content is hello :

<query match="w[. = 'hello']"/>

Match elements called w whose text content is letters:

<query match="w[. ~ '^[a-zA-Z]+$]'"/>

Match the first w element:

<query match="w[1]"/>

Match the last w element:

<query match="w[last()]"/>

Match the first w element, if it its cat attribute is det:

<query match="w[1][@cat = 'det']"/>or<query match="w[position() = 1 and @cat = 'det']"/>

Bind an lxtransduce variable to the cat attribute of the current element:

<var name="c" value="@cat"/>

Lexicon examples

The lxtransduce <lexicon> element defines a new XPath function that looks up the current element's text content in the lexicon. The name of the function is given by the name attribute of the <lexicon> element. The function returns (a node-set containing) the element in the lexicon for that word; it usually has one or more <cat> children whose text contents are the names of the lexical categories applying to the word.

Test that the word has lexical category noun (and possibly others), that is, check that it has a <cat> child with text value noun :

<query match="w" constraint="lex()/cat = 'noun'"/>or<query match="w" constraint="lex()[cat='noun']"/>

(The XPath in the first version tests whether any <cat> children of the entry have a text value noun; the XPath in the second version returns a node-set containing the entry if it has a <cat> child whose text value is noun, which has the same effect since the constraint will be satisfied if the node-set is non-empty.)

Test that the word has lexical category noun and no others, that is, check that it has a <cat> child with text value noun and that it has only one <cat> child:

<query match="w" constraint="lex()[cat='noun' and count(cat) = 1]"/>

Test that the word has a lexical category that is not noun (maybe having noun as well), that is, check that it has a <cat> child whose text value is not noun :

<query match="w" constraint="lex()[cat != 'noun']"/>

Test that the word does not have lexical category noun, that is, check that it has no <cat> children with text value noun :

<query match="w" constraint="not(lex()/cat = 'noun')"/>