LT XML: Enabling Access to Structured Documents
Henry S. Thompson

18 January 1999

1. What is XML?

XML stands for eXtensible Markup Language [1]. It is a simple standard way to mark up the structure of documents, and is the responsibility of the W3C (the World Wide Web Consortium). It combines the simplicity and ease of use of HTML, with the power and flexibility of SGML. Here's a simple example of what it looks like (we'll call this sample.xml further down):

<p id='p1'>
 <s id='s1'>
  <w pos='prep'>In</w><w pos='art'>the</w><w pos='n'>beginning</w>
  <w pos='v'>was</w><w pos='art'>the</w><w pos='n'>word</w><c>.</c>

XML itself determines the syntax which distinguishes markup from the text marked up---the angle brackets, equal signs and quotes in the above. The document designer determines the markup vocabulary: the element types (p, s, w and c in the above), attribute types (id and pos) and their grammar, e.g. the fact that a p may contain a number of s elements.

The tremendous advantage of XML lies in its very simple low-level syntax, which makes it possible to write very fast and light-weight XML parsers (see [2] for pointers to a number of them). Since XML provides a mechanism in XML for specifying the markup grammar of a document or family of documents, an XML parser can be used for many different types of document without modification.

For language resources, this is a great step forward, as it means an end to the all-too-common necessity of writing yet another parser each time you get a new resource to work with. Already major providers of language resources such as the LDC ([8]) and ELRA ([9]) are delivering resources marked up using XML.

2. What is LT XML?

LT XML ([3]), developed by the Language Technology Group ([4]) of the Human Communication Research Centre ([5] at the University of Edinburgh, is an XML parser with a flexible API, together with a large collection of pre-built tools for processing XML-marked-up material. LT XML is free for non-commercial use, and is available in both source (for UN*X, WIN32 and Macintosh) and binary (for WIN32) distributions. Over 3000 licenses have been issued to a wide range of institutitions in both Europe and further afield.

LT XML's pre-built command-line tools include the following:

textonlyExtracts the text content and adds separators:
cmd> textonly -s ' ' < sample.xml In the beginning was the word .
sgcountTabulates element type usage:
cmd> sgcount < sample.xml p 1 s 1 w 6 c 1
sggrepProvides powerful search and filtering:
cmd> sggrep -q '.*/w[pos="n"]' sample.xml <w pos='n'>beginning</w> <w pos='n'>word</w>
sgrpgCombines complex searching with reformatting (For sophisticated use a control file, itself written in XML, is required. The example below illustrates the restricted subset of functionality available from the command line).
cmd> sgrpg -q '.*/w' -f '%s/%s ' '<CDATA>' 'pos' < sample.xml In/prep the/art beginning/n was/v the/art word/n
sgsortSort sub-elements by content
sgmltransProduction-rule-style reformatting (comes with sample trivial XML-to-LaTeX downtranslator)

Each of these tools is reasonably powerful in its own right, but a crucial property of the LT XML architecture, made possible by the fact the XML documents can carry their own structure definitions with them, is that pipelines of tools can be composed for complex tasks.

The pre-built LT XML tools are based on the LT XML API (Application Programming Interface). Users can define C language programs using this interface to tackle more complex and sophisticated tasks. The API offers both a low-level (event-orientated) and high-level (tree/sub-tree orientated) view of XML documents, and is based on RXP ([6], a very fast XML parser.

The power and flexibility of XML in general, and the LT XML architecture in particular, was crucial to the LTG's ability to put together our first entry in the most recent DARPA Message Understanding Competition (?) (MUC). The LTG's entry was for the named entity recognition task, and was composed of 17 stages in a pipeline ([7]. It came top of all the entries for that task, and generated a lot of interest in the US language technology community because of ease of (re)configuration the pipelined LT XML architecture provided.

3. Where is XML going?

The W3C is sponsoring further development of standards associated with XML, including XSL (eXtensible Stylesheet Language), XML-Link (support for inter-document linking) and XML Schema (extending XML's facilities for defining the structure of XML documents).

4. Where is LT XML going?

We're still expanding and extending LT XML. The next release will include support for validation, a Python-language interface to the API for rapid prototyping, and support for automatically generating graphical user interfaces for common annotation tasks.

5. Acknowledgments

LT XML grew out of work on the MULTEXT project, sponsored by the European Union through the LRE programme. More recent development has been funded by the UK Economic and Social Research Council via HCRC's core funding, by the UK Engineering and Physical Sciences Research Council via project NSCOPE, and by Sun Microsystems and Microsoft.

6. Useful Pointers