Lxtransduce, a replacement for fsgmatch

Richard Tobin, 2004–2005

"Safe if used as directed."

What is it?

Lxtransduce is a replacement for fsgmatch. We need a replacement so that we have control of the source and can extend it for our own needs.

Status

Lxtransduce now provides most of the functionality of fsgmatch, and more.

The program lxconvfsgmatch can be used to convert old fsgmatch rule files to the new format. There are several constructs it does not handle, but it reports when this is the case. Please check its output carefully.

New!

You can now use the mult attribute on all match elements (except ones that are themselves repetition constructs).
There is an explicit <repeat> element.
<all> is now a synonym for <and>.
<or> is now a synonym for <first>.
There is a <not> element that succeeds if its child rule fails.

To Do

Handle encodings reasonably for non-XML files.
Improve conversion tool for old fsgmatch rule files.
Unify lxtransduce variables with XPath variables. Done!
Allow arguments to be passed in to rules. Done!
Add and element. Done!
Add not element. Done!
Improve efficiency:
- Don't copy nodes so much.
- Don't construct rewrites that are never used.
- Do more at compile time: parsing rewrite and attrs strings.
Check for memory leaks.
Provide a way to pass in parameters on the command line.

Processing model

The input to lxtransduce may be either a plain text file or an XML file. Grammars may be defined over characters (character-level grammars) or over XML elements (xml-level grammars).

Only character-level grammars can be applied to plain text files. The input is taken a line at a time, and the top-level rule applied starting at successive character positions of each line, skipping over matched text (except for suppressed right context). Text matched by a rule is replaced by the rule's rewrite; unmatched characters are output unchanged.

Both kinds of grammar can be applied to XML files. When processing an XML file, a query is specified on the command line (with -q) to select the elements on which the processor works - the equivalent of lines. When a character-level grammar is applied to an XML file, the text children of the matching elements play the role of lines. In the common initial case of a file of text wrapped in a single element, the text is thus effectively a single line.

When an XML-level grammar is applied, the child elements of the selected elements play the role of characters. The top-level rule is applied starting at successive child elements.

There is usually no backtracking between rules: a <ref> with * or + multiplicity matches the longest sequence it can, without regard for later matches. However, the <backtrack> construct allows a <ref> to backtrack until another rule succeeds, and there is an explicit <repeat-until> construct that consumes matching items until an end condition is met.

Whether the query is applied recursively is controlled by a command-line option (-r). For example, if the query is s and the input contains nested <s> elements, only the children of the outermost <s> elements are processed if -r is not specified, but the children of the inner <s> elements are processed if -r is specified. Outer matches are processed before inner ones: this makes a difference if the rewrites introduce new matching elements.

Rule file format

The fsgmatch rule file format is rather inconsistent, and has been changed. For example, in fsgmatch almost everything is a <REL>, and the match attribute means various different things depending on the value of the type attribute.

A rule file (or grammar file) contains a set of rules, each of which may match part of the input and has a rewrite which replaces the matched text. Rules may have simple regular-expression matches, or they may contain references to other rules in sequences and disjunctions. The rewrite of a rule may use variables to refer to the rewrites of the rules it references.

The rule file DTD is called lxtransduce.dtd. You can put a copy of it in the directory with your rule files, or you can set an environment variable to enable XML catalog processing in which case it will be found from the catalog directory. In either case, use <DOCTYPE rules SYSTEM "lxtransduce.dtd"> to refer to it. See running lxtransduce for more details.

<rules>

The top-level element of the rule file must be <rules>, and it must have an apply attribute specifying the name of the top-level rule. To apply more than one rule, specify a rule containing a disjunction. The apply rule can be overridden on the command line with the -a rule argument. The optional type attribute controls whether the file contains character- or xml-level rules. It takes the values xml and plain, defaulting to plain. The optional name attribute has no effect whatsoever.

<rule>

The children of the <rules> element are <rule> and <lexicon> elements. Each <rule> must have a name attribute. Names must be unique and follow the rules for XML names (essentially letters and digits, starting with a letter). A <rule> typically has a rewrite attribute, specifying the output of the rule. If omitted, the rewrite is the rewrite of the <rule>'s child element. A rule may have match and constraint attributes, which are just a shorthand described below.

The (single) child of a <rule> specify what the rule matches. It can be a <ref>, <regexp>, <query>, <seq>, <first>, <best>, <and>, <lookup>, <repeat-until> or <backtrack>; these are called match elements. All match elements may have a rewrite attribute. They may also have the match-var, rewrite-var and suppress attributes described under variables and matching in context below. In xml-level grammars, <rule> elements and match elements may also have the wrap and attrs attributes described under wrapping the rewrite below.

<lexicon>

A <lexicon> element identifies a lexicon which can be referenced in the constraint queries of <regex> and <query> elements. It must have an href attribute giving the location of the lexicon, and a name attribute which is the name of the function which will be used to access it in queries. Lexicon locations can also be specified on the command line.

Lexicons are described below.

<regex>

<regex> is the most basic kind of match for character-level grammars. It must have a match attribute whose value is the regular expression to match. The regular expression syntax is that of XML Schemas, extended so that ^ matches the start of the input and $ matches the end. If the match and, optionally, constraint attributes are specified on a <rule> element in a character-level grammar, this is equivalent to a <regex> child with the same attributes. If the rewrite attribute is not specified, the rewrite is the same as the matched text.

The characters which have syntactic significance in regular expressions, and which consequently need to be escaped (by preceding them with \) in some contexts, are \ | . - ^ $ ? * + { } ( ) [ ]. In addition, the escapes \n and \t are used to represent linefeed and tab respectively. Several other escapes defined in the XML Schema specification are not yet implemented.

A <regex> element may have a constraint attribute, and <constraint> and <var> child elements. These are described under the <query> element below.

<query>

<query> is the most basic kind of match for xml-level grammars. It must have a match attribute whose value is the query to match. The query syntax is that of XPath match patterns. When an element matches a query, any immediately following non-element siblings are implicitly attached to it; this has the effect of preserving whitespace between elements in the output (unless the order of elements is changed, in which case there attached whitespace will be moved correspondingly). If the match and, optionally, constraint attributes are specified on a <rule> element in an xml-level grammar, this is equivalent to a <query> child with the same attributes. If the rewrite attribute is not specified, the rewrite is the same as the matched element(s).

Queries can have standard XPath-style predicates. For example, match="w[1]" will match the first <w> element, match="w[last()]" will match the last, and match="w[@type='noun']" will match a <w> element with a type attribute whose value is noun.

The XPath syntax is extended to include a ~ operator, which expects a string as its left operand and a regular expression (as a string) as its right operand. So to match a <w> element containing exactly the word is, you can use match="w[. ~ '^is$']". (For an exact match though it is more efficient simply to use match="w[. = 'is']"). Note that predicate order is important: match="w[. ~ '^is$'][1]" matches the first is, but match="w[1][. ~ '^is$']" matches the first word, if it is is.

<query> and <regex> elements may have <constraint> and <var> children. An attribute constraint="xpath-expression" is equivalent to a child element <constraint test="xpath-expression">. Any <constraint> and <var> children are evaluated in order.

A <constraint> child is the similar to the match attribute: its test attribute is an XPath expression whose value must be true for the rule to match. It is intended to be used to impose extra constraints on the match, in particular for looking up words in lexicons. The same effect could be obtained by using the query as a predicate in the match attribute; for example, the following are all equivalent:

<query match="w" constraint="lex()/cat='noun'"/>

<query match="w"> <constraint test="lex()/cat='noun'"/> </query>

<query match="w[lex()/cat='noun']"/>

When a constraint attribute is used in a character-level grammar (on a <regex> element), there is no XML element in the input for it to apply to. Instead, a dummy XML text node is constructed containing the text matched by the match element, and the constraint query is applied to that.

A <var> child binds a variable to the value of an XPath expression. The name attribute specifies the variable name, and the value attributes specifies the XPath expression whose value is assigned to the variable. See variables below for more details.

<lookup>

A <lookup> element combines the effect of a query and a lexicon lookup. The match attribute specifies a query to match words, and the lexicon attribute specifies a lexicon in which the word(s) must appear. The simplest form <lookup match="w" lexicon="lex"/> it is equivalent to a query with a constraint: <query match="w" constraint="lex()"/>.

The optional case attribute allows you to override the case-sensitivity specified in the lexicon. If it is present it must have the value yes (case-sensitive) or no (case-insensitive). If it is absent, the sensitivity specified by the lexicon is used.

If the optional phrase attribute has the value true, the rule will match the longest phrase (starting at the current point) that appears in the lexicon. The phrase is constructed by taking the text value of the matching elements and joining them with spaces. So if the lexicon has an entry for "Edinburgh University", the rule <lookup match="w" lexicon="lex" phrase="true"/> will match the input <w>Edinburgh</w> <w>University</w>.

Like <query> and <regex>, <lookup> elements may have <constraint> and <var> children. The context node for XPath expressions in these elements is the entry from the lexicon (not the word matched), so you can easily test the lexical category or store it in a variable, for example <var name="a" value="cat"/>. Note that if you have a case-sensitive lexicon containing two case-variants of a word, and you do a case-insensitive lookup, the first entry will be used as the context node.

For a phrase lookup, the longest phrase satisfying the constraints is found. This is a change: before, if the longest phrase didn't satisfy the constraints, the lookup failed even if a shorter phrase would have satisfied them.

<ref>

A <ref> element refers to another rule, and must have a name attribute specifying it. Only references to rules earlier in the rule file are allowed; this prevents circularities. A <ref> matches the input if the referenced rule matches. If the rewrite attribute is not specified, the rewrite is that of the referenced rule.

A <ref> element may have one or more <with-param> children specifying values to be passed to the referenced rules. These elements bind variables in the referenced rules, and take the same attributes as <var> elements. Note that because the value is specified as an XPath, to pass a constant string you need to put quotes around the string as well as around the attribute, resulting in something like <with-param name="cat" value="'noun'"/>

<repeat>

A <repeat> element contains a single match element, which it applies repeatedly. The repetition can be controlled by a mult attribute, with the value *, +, or ?, or more generally by min-matches and max-matches attributes. The default rewrite is the concatenation of all the rewrites of the child element.

It is usually simpler just to use a mult attribute on the element directly than to enclose it in a <repeat> element.

<repeat-until>

Like <ref>, a <repeat-until> element refers to another rule, called the repeated rule, and must have a name attribute specifying it. It also contains a match element specifying an end-condition. The repeated rule is matched until either the end condition matches or no more input matches the rule. The input matched by the end condition is not considered part of the match, and the default rewrite is the concatenation of all the rewrites of the repeated rule.

Note that the rule can succeed without the end condition matching if there is no more input matching the repeated rule. If you want to require the end condition to match, use a <seq> containing the <repeat-until> followed by another copy of the end condition.

By default, a <repeat-until> will succeed even if there are no matches of the repeated rule. The min-matches attribute can be used to specify a minimum number of matches (typically 1).

<backtrack>

A <backtrack> element contains a <ref>, whose mult must be * or +, and another element which is matched after the <ref>. If this fails to match, successively shorter sequences matching the <ref> are tried until it succeeds (or if it never does, the whole <backtrack> fails). It is therefore something like a backtracking <seq>.

<seq>

A <seq> element contains a number of match elements, and matches the input if the child elements match in order. If the rewrite attribute is not specified, the rewrite is the concatenation of all the child rewrites.

<and>

An <and> element contains a number of match elements, and matches the input if all of the child elements match (all at the current point). If the rewrite attribute is not specified, the rewrite is the rewrite of the last child. <all> is a synonym for <and>.

<not>

A <not> element contains a single match element, and matches the input if the child element does not match. A successful <not> consumes no input; its match and rewrite are always empty.

<first> and <best>

<first> and <best> elements contain a number of match elements, and matches the input if one of the child elements matches. In the case of <first> the match is the first child element to match; in the case of <best> it is the child element that matches the longest stretch of the input. If the rewrite attribute is not specified, the rewrite is the rewrite of the matching child. <or> is a synonym for <first>.

<start>, <end>, <not-start>, and <not-end>

These special match elements do not consume any of the input data, but succeed at the start, at the end, not at the start, and not at the end of the line or container element. They are typically used at the start or end of a <seq> to constrain it to match only (say) at the start of a sentence. No attributes are allowed on these elements, and their match and rewrite strings are always empty.

Repetition

All match elements, except the repetition elements and those that never consume input, can be have a mult attribute specifying the multiplicity with which the rule must match: one of *, +, and ?. These have their familiar meanings. When an element is repeated in this way, the default rewrite is the concatenation of all the rewrites.

Variables

Match elements may have a match-var attribute specifying a variable name that can be used to refer to the matched text and a rewrite-var attribute specifying a variable name that can be used to refer to the rewrite. <query> and <regex> elements may contain <var> elements that bind a variable to the value of an XPath expression. Variable names should consist of letters and numbers, beginning with a letter. Case is significant: A and a are different variables. An element's rewrite and attrs attributes may contain references to the variables bound in its descendant match elements. These references take the form $name or ${name}; the latter form is useful if the variable name is followed by a character that would otherwise be treated as part of the name. The values of variables used in a rewrite attribute in a character-level grammar, and those used in an attrs attribute, are converted to strings before use. Variables used in a rewrite attribute in an xml-level grammar must have node-list values.

Variables may also be referred to in XPath expressions so, for example, a constraint can depend on a variable bound earlier in a <seq>.

The special variable references $- and $+ can be used to refer to the complete matching string and the default rewrite respectively. These can't be used in XPaths because the XPath syntax doesn't allow them.

Matching in context

Left and right context for a <seq> match may be specified by use of the suppress attribute on the first and last children of the <seq>. The text matched by a child element with suppress="true" is not counted as part of the match of the parent <seq>, nor is its rewrite included in the default rewrite for the <seq>. Furthermore, if one or more elements at the end of a <seq> are suppressed, matching continues from before the suppressed matches.

Note that left context is consumed by the rule: it must not already have been consumed by another rule. Elements that have already been consumed can be tested with XPath expressions that use the preceding-sibling axis.

Wrapping the rewrite

In xml-level grammars, the rewrite may be wrapped in a new element before it is returned. The name of this element is given in the wrap attribute. Any attributes desired for this wrapper element are given in the attrs attribute, which has the form attrs="name1='value1' name2='value2' ...".

If the attrs attribute is given without a wrap attribute, the attributes are applied to the rewrite itself, which must contain only a single element. If it already has any attributes with the same names as those in the attrs attribute, they are replaced. An attribute may be removed by omitting the value (and its quotes: otherwise you get an attribute whose value is an empty string), for example attrs="name=".

Using lexicons

Lexicons are stored as XML files. The program lxconvlex can be used to translate old fsgmatch-style lexicons to the new format. The lexicon DTD is called lexicon.dtd. As with the rule file DTD, you can either put it in the directory containing your lexicons or use the XML catalog.

The top-level element of a lexicon is <lexicon>. It has an optional attribute case whose value must be either "yes" or "no". If it is "yes", word-matching will be case-sensitive unless over-ridden in a particular lexical entry. The default is "no".

The children of the <lexicon> element are <lex> elements. These have a required word attribute and an optional case attribute. If the case attribute is present, it over-rides the default specified on the <lexicon> element.

Lexicon entries are matched in order; the first entry matching a word (taking into account the specified case-sensitivity) applies. Lxtransduce gives a error if an entry can never match because it is a duplicate of an earlier one.

The content of a <lex> element is not interpreted by lxtransduce itself, but rather by the constraint queries of <regex> and <query> elements. Typically it is one or more <cat> elements, whose text content is the lexical category.

Consider the following example:

<lexicon name="dates" href="lexicons/dates.lex">

 <lex word="May" case="yes">
  <cat>month</cat>
  <cat>verb</cat>
 </lex>

 <lex word="may">
  <cat>verb</cat>
 </lex>

</lexicon>

The capitalized word "May" matches the first entry; uncapitalized it matches the second. In the first case it has both the lexical categories "month" and "verb", in the second only "verb".

Input may be matched against the lexicon either by using a <lookup> rule or by using the XPath function whose name is that of the lexicon, in this case "dates". It takes an argument whose string-value is looked up in the lexicon, and returns the matching <lex> element. The argument may be omitted, in which case the current node is used. The children of the returned value may be examined in the query in the usual way. A typical xml-level example is <query match="w" constraint="dates()/cat='month'"/> which matches w elements whose content appears in the dates lexicon with category "month".

In character-level grammars the constraint is applied to the text matched by the regular expression, so to match words whose category is "month" you might use <regex match="[a-z]+" constraint="dates()/cat='month'"/>.

Data from a lexicon can be incorporated in the rewrite of a rule by using variables. For example, if lex has been declared as a lexicon, <var name="lexcat" value="string(lex()/cat)"/> will bind the variable lexcat to the first lexical category of the word matched by the containing <query>. The function join() (an extension to XPath) is useful in this context. Its first argument is a node-set, and its second is a string to use as a separator. It returns the string-values of the nodes in the node-set joined by the separator. So if a word has lexical categories "noun" and "verb", <var name="lexcat" value="join(lex()/cat, ',')"/> will bind lexcat to the string "noun,verb".

Adding XML markup in character-level grammars

As an early stage of processing, it is common to use a character-level grammar to identify words and wrap them in individual elements. Just using < and > to create XML elements doesn't work when the output is XML because the serializer will escape them as < and >. Lxtransduce provides three "magic entities" that are handled specially by the serializer:

&xlt; is output as <
&xgt; is output as >
&xamp; is output as &

This facility should be used with care as it allows ill-formed XML to be output.

Rule file examples

This rule file translates some American English words to British English:

<!DOCTYPE rules SYSTEM "lxtransduce.dtd">

<rules apply="all">

 <rule name="col" rewrite="colour">
  <regex match="color"/>
 </rule>

 <rule name="egg" rewrite="aubergine">
  <regex match="eggplant"/>
 </rule>

 <rule name="all">
  <first>
   <ref name="col"/>
   <ref name="egg"/>
  </first>
 </rule>

</rules>

The simple regular expression rules could have been written more compactly:

 <rule name="col" match="color" rewrite="colour"/>

 <rule name="egg" match="eggplant" rewrite="aubergine"/>

Or they could have been written directly within the disjunction:

 <rule name="all">
  <first>
   <regex match="color" rewrite="colour"/>
   <regex match="eggplant" rewrite="aubergine"/>
  </first>
 </rule>

In this example of using right context, the word "on" and any following space are removed, but only if followed by a day of the week. So "I went on the bus on tuesday" will be replaced by "I went on the bus tuesday":

<!DOCTYPE rules SYSTEM "lxtransduce.dtd">
 
<rules apply="on">
 
 <rule name="day" match="(mon|tues|wednes|thurs|fri|satur|sun)day"/>
 
 <rule name="on" rewrite="">
  <seq>
   <regex match="on[ \n]+"/>
   <ref name="day" suppress="true"/>
  </seq>
 </rule>
 
</rules>

This example, which puts square brackets around upper-case words, illustrates some uses of variables. The rules "caps" and "token" match words containing only upper-case letters and all word-like tokens respectively. In each case the following white space is matched but the rewrite uses a variable to select only the word part. The reference to "caps" in the top-level rule rewrites upper-case words to have brackets around them. The top-level rule then adds a space after each rewritten word. (In practice it might have been simpler to use suppress="true" on the whitespace, so that we would not need to add it back at the end.):

<!DOCTYPE rules SYSTEM "lxtransduce.dtd">

<rules apply="all">

 <rule name="caps" rewrite="$A">
  <seq>
   <regex match-var="A" match="[A-Z]+"/>
   <regex match="[ \n]+"/>
  </seq>
 </rule>
 
 <rule name="token" rewrite="$A">
  <seq>
   <regex match-var="A" match="[^ \n]+"/>
   <regex match="[ \n]+"/>
  </seq>
 </rule>
 
 <rule name="all" rewrite="$+ ">
  <first>
   <ref name="caps" rewrite="[$+]"/>
   <ref name="token"/>
  </first>
 </rule>
 
</rules>

Running lxtransduce

To run lxtransduce on a plain text file mytext.txt with rule file myrules.gr type:

lxtransduce myrules.gr mytext.txt

To run it on an XML file, use the -q option to specify a query to select the elements whose content is used. These queries are "streamable" XPath match patterns: patterns restricted to refer only to the node being tested (including its attributes) and it ancestors. In particular, you can't use queries like s[1] or s[count(w) > 5] since these would require access to the preceding siblings and children of the node, and they are not available when the match is tested. Typically the query is just an element name:

lxtransduce -q para myrules.gr mytext.xml

If the input filename is omitted, lxtransduce reads from standard input:

plain2xml.perl mytext.txt | lxtransduce -q TEXT myrules.gr

Lexicons may be defined on the command line using the -l option. This is useful when the lexicon is dynamically generated and its name is not known until run-time:

lxtransduce -q para -l mylex=mylexicon.lex myrules.gr mytext.xml

There is an undocumented debugging option that allows rule invocations to be traced. If you think you need to use it, ask Richard.

Files under Solaris

The program lxtransduce is in /usr/contrib/bin.
The DTD is in /usr/contrib/share/lib/xml/LTXML2/lxtransduce.dtd.
To use the XML catalog for entity resolution set the environment variable XML_CATALOG_FILES to /usr/contrib/share/lib/xml/catalog.xml.

Files under DICE

The program lxtransduce is in /group/ltg/projects/lcontrib/bin.
The DTD is in /group/ltg/projects/lcontrib/share/lib/xml/LTXML2/lxtransduce.dtd.
To use the XML catalog for entity resolution set the environment variable XML_CATALOG_FILES to /group/ltg/projects/lcontrib/share/lib/xml/catalog.xml.