Name

ltxml2 — Overview of the LTXML2 toolkit

Overview

The LTXML2 toolkit is a set of programs for manipulating XML documents. It was developed by Richard Tobin in the Edinburgh University Language Technology Group (LTG). Though designed to support LTG's natural language processing, most of the programs are useful for general-purpose XML processing.

The programs are designed for use on the Unix command line, either singly or in pipelines, and in shell scripts. Most programs read a document named on the command line or provided on standard input, and write the result to standard output.

Many programs take arguments specifying elements or attributes to work on; these are referred to as queries and are expressed as XPaths. For more information on XPaths see the W3C XPath specification. The LTXML2 XPath implementation includes an extension for regular expression matching using the operator ~ (tilde), for example meeting[@date ~ '[Mm]onday'].

The programs are:

ltguessencoding

A program to determine the encoding of a file.

lxaddids

A program to add IDs to an XML document.

lxcount

A program to count elements in an XML document.

lxdiff

An XML version of the unix "diff" program.

lxgrep

An XML version of the unix "grep" program, using XPaths.

lxuniq

An XML version of the unix "uniq" program.

lxplain2xml

A program to convert a text file to XML by wrapping it in an element.

lxprintf

A program to format text extracted from an XML document.

lxsort

An XML version of the unix "sort" program.

lxviewport

A program to run another program on selected parts of an XML document.

lxinclude

An XInclude processor.

lxconvert

A program to convert data between XML and other formats, using XSLT-style templates.

lxreplace

A program to make replacements or deletions in an XML document.

lxt

An XSLT 1.0 processor.

lxtransduce

An XML transducer.

Streaming processing

Several of the tools are designed so that they can process documents too large to be held in memory. They do this by streaming, which means processing the document sequentially as it is read. Often a document is read until some relevant element is encountered, and the element is then processed.

A consequence is that XPaths identifying such elements cannot refer to the siblings or descendants of the element, though it can refer to its attributes and ancestors. In some cases XPaths that operate on the chosen element can refer to its descendants, since the subtree has been read by then; in other cases they are processed without reading the subtree. Queries affected by streaming are noted in the manual pages.

XML Catalogs

RXP, the XML parser used in LTXML2, supports XML Catalogs and uses the XML_CATALOG_FILES environment variable to locate catalogs. If set, it should should be a colon-separated list of catalog files.