There are at least three reasons why separating markup from the material marked up ("standoff annotation") may be an attractive proposition:
In this paper we introduce two kinds of semantics for hyperlinks to facilitate this type of annotation, and describe the LT NSL toolset which supports these semantics.
The two kinds of hyperlink semantics which we describe are
We also address the issues of indexing large files to improve the speed of accessing SGML elements in the base files.
This is not a paper about browsers or rendering. It is a paper about novel uses of hyperlinks to address a range of problems in document management and corpus annotation. The canonical application environments for the kinds of markup I will discuss are information retrieval, message understanding, machine translation, text summarisation---in other words, language and content oriented applications.
This is not a paper about process architecture (that's another paper). I assume a pipelined architecture where individual tools operate on an SGML or XML document stream, augmenting, transforming and modifying it step-by-step. It is paper about document architecture: how to use specialised link semantics to organise information across documents in a sensible and powerful way, while hiding that distribution from applications.
The crucial idea is that standoff annotation allows us to distribute aspects of virtual document structure, i.e. markup, across more than one actual document, while the pipelined architecture allows tools that need to work with the virtual document stream.
This can be seen as taking the SGML paradigm one step further: Whereas SGML allows single documents to be stored transparently in any desired number of entities; our proposal allows virtual documents to be composed transparently from any desired network of component documents.
Note that the examples which follow uses a simplified version of the draft aproposed syntax for links from the XML-LINK draft proposal [1], which notates a span of elements with two TEI (Text Encoding Initiative) extended pointer expressions separated by two dots ('..').
Consider marking sentence structure in a read-only corpus of text which is marked-up already with tags for words and punctuation, but nothing more:
|
With an inclusion semantics, I can mark sentences in a separate document as follows:
|
Now crucially (and our LT NSL and LT XML products [2] already implement this semantics), we want our application to see this document collection as a single stream with the words nested inside the sentences:
|
Note that the linking attribute is gone from the S start-tag, because its job has been done.
We believe this simple approach will have a wide range of powerful applications. We are currently using it in the development of a shared research database, allowing the independent development of orthogonal markup by different sub-groups in the lab.
We use inverse replacement semantics for e.g. correcting errors in read-only material. Suppose our previous example actually had:
|
If we interpret the following with inverse replacement semantics:
|
we mean "take everything from the base document except word 15, for which use my content". In other words, we can take this document, and use it as the target for the references in the sentence example, and we'll get a composition of linking producing a stream with sentences containing the corrected word(s).
The paradigm of pipelined processes communicating via (normalised) SGML, is very suitable where the desired processing only requires localised sequential access to the document. This is the case in many language based algorithms. For example, for an message understanding application, the corpus can be treated as a sequence of messages, each of which can be read into memory (as an SGML document tree) and processed. However, sequential processing is less efficient in cases where random access to large documents is required. In such cases a database solution or separate index files are required.
The LT NSL system contains programs which allow one to create index files for SGML documents to provide a random access mapping between a subset of elements of the document (selectable by a query language) and character offset and file name (since SGML documents can be distributed over several files). A separate program allows one to make content addressable indices, by providing a flexible method of indexing SGML elements by their text contents [3]. Finally, we provide retrieval programs for both of these indexing schemes. The above programs have been used by us and a group at Sheffield University to index the large British National Corpus.
At present these indexing programs are separate from the LT NSL application program interface and the work on hyperlinking described above. It is clear that it would be very useful to develop a single abstraction covering both. Further work is continuing on discovering the most common patterns of usage for referring to SGML elements via a random access index. Two possible uses would be (a) to allow access to the text of a footnote or bibliographic item while processing a paragraph containing a reference to it; and (b) to allow the hyperlinks to refer to an index file, allowing random access to hyperlinked elements (the current implemention is much more efficient if the linked-to elements are not out of order in the target document).
The work described here was conducted in the Language Technology group of the Human Communication Research Centre, whose baseline funding comes from the UK Economic and Social Research Council. The work was initiated in the context of the EU-funded MULTEXT project, and is now being carried forward with support from the UK Science and Engineering Research Council.
[1] T. Bray and S. DeRose, eds, Extensible Markup Language (XML) Version 1.0, World Wide Web Consortium, 1997
[3] A. Mikheev and D. McKelvie, Indexing SGML files using LT NSL, Technical Report, Language Technology Group, HCRC, University of Edinburgh, 1997.