Corpus Creation for Data-Intensive Linguistics
Henry S. Thompson
Language Technology Group
University of Edinburgh

Tuesday 24 February 1998

1. Introduction

The success of the whole data-intensive or corpus-based approach to (computational) linguistics research rests, not surprisingly, on the quality of the corpora themselves. This chapter addresses this issue at its foundation, at the level of methodology and procedures for the collection, preparation and distribution of corpora.

This chapter starts from the assumption that corpus design and component selection are separated from the issues addressed in this chapter. It also considers textual material only---spoken material raises many altogether different issues.

Where possible, issues will be addressed in general terms, but it is in the nature of the enterprise that the majority of the problems and difficulties lie in the details. Accordingly the approach adopted here is an example-guided one, drawing on experience in the production of several large multilingual CD-ROM-based corpora.

A note on terminology: When writing on this subject, one is immediately confronted with the necessity of choosing a word or phrase to refer to text stored in some digital medium, as opposed to text in general or printed or written text. I'm unhappy with all the phrases I can think of, and so will use "eText" throughout for this purpose.

2. Collection

Material suitable for the creation of corpora doesn't grow on trees, and even if it did, the orchard owner would have a say in what you could do with it. It is all too easy to embark on a corpus-based research or development effort without allocating anywhere near enough resources to simply getting your hands on the raw material, with the legal right to use it for your intended purposes. This section covers a number of the issues which may arise in this phase of a corpus creation project, both technical and legal.

2.1. Finding the data

There is a lot of eText out there, more every day, but actually locating eText appropriate to your needs can be quite difficult. Stipulate that your intended application, your corpus design or simply your inclination, have identified financial newspaper data, or legal opinions, or popular fiction, as your target text type. How do you go about locating suitable sources of eText of this type?

It's best to tackle this from two directions at once. On the one hand, virtually any text type has some kind of professional association connected to its production: the association of newspaper editors, or legal publishers, or magazine publishers. In this day and age, such a group in turn almost certainly has a sub-group or a working party concerned with eText (although they are unlikely to call it that). The association concerned will almost certainly be willing to make the membership of such a group available on request, provided it is clear the motivation involves research and is not part of a marketing exercise.

The key to success once you have a list of possible contacts is personal contact. Letters and telephone calls are not the right medium through which to get what you want. In most cases the people involved in such activities will respond positively to a polite request for a face-to-face meeting, if you make it clear you are interested in exploring with them the possibility of making use of their eText for research purposes. A letter, followed up by a telephone call, requesting a meeting, has always worked in my experience. The simple fact of your willingness to travel to their establishment, together with the implied complement your interest represents, appears to be sufficient to gain initial access.

The other way in is through your own professional contacts. The Internet is a wonderful resource, and your colleagues out there are in general a friendly and helpful group. A message or news item directed to the appropriate fora (the linguist and corpora mailing lists and the comp.ai.nat-lang newsgroup) asking for pointers to existing corpora and/or known sources of eText of the appropriate text type may well yield fruit.

By far the best bet is to combine both approaches. If you can get a personal introduction from a colleague to a member of an appropriate professional association, you have a much greater chance of success. We got off to a flying start in one corpus collection project by being introduced to a newspaper editor by a colleague, and then using his agreement to participate as a form of leverage when we went on to visit other editors who we did not have an introduction to.

So what do you say once you're in someone's office, you've introduced yourself, and you actually have to start selling? Here's a checklist that has worked for me:

  1. Enthusiastic description of the science
  2. No commercial impact
  3. Respect for copyright
  4. Visibility
  5. Technology payback
  6. The bandwagon effect

The first of these is easy to forget, but my experience underlines its importance. Usually people, even quite senior people, are flattered and excited by the thought that their eText has a potential part to play in a respectable scientific enterprise. Don't be afraid of this naive enthusiasm---it does no harm to encourage peoples speculative ideas about your project, however uninformed they may be. Indeed sometimes you'll be surprised and learn something useful.

But once this phase of the conversation is slowing down, you have to move quickly to reassure them that this scientific contribution won't impact on their business. The line to take here is "Form, not content", i.e., we want your data because it's (a particular type of) English (or French or Thai), not because we want to exploit its content. An offer to use older material, or every other [pick an appropriate unit: issue, article, entry, . . .], can help to make this point clear.

Closely related to the previous point, you cannot emphasise too early your commitment to respect and enforce (insofar as you can) their copyright of the material involved. You cannot honourably negotiate with major eText owners unless you are prepared to take copyright seriously, and you will need to convince them that this is the case. There is no quicker way to find yourself headed for the door than to suggest that once text is on the Internet, no-one owns it anymore! If you are genuinely of that opinion yourself, you should think twice before getting into the corpus business at all, or at any rate look exclusively to materials already clearly in the public domain.

There are also some concrete benefits to the owner of the eText from contributing, which you should point out. The story of the Wall Street Journal is often useful here: for many years after the publication of the ACL DCI CD-ROM, the Wall Street Journal was English for computational purposes, establishing the brand name in a pre-eminent position, for free, in a situation with huge potential downstream impact. You don't want to over-promise the potential impact of your work, but it's perfectly reasonable to say something to the effect that if they give you the data, and your work is successful, their name will be at the heart of an area of technology development (financial management systems, medical informatics, etc.) which may in turn develop into a market opportunity for them.

This leads on to the next concrete payback. Some domains in general, and/or some potential providers in particular, are just on the verge of recognising their eText as a real business resource. An offer to hand back not only the normalised data at the end of the day, but also to make the normalising process available, may in such circumstances be quite appealing.

Finally, if this is not the first visit you have made, letting them know that you already have agreement from so-and-so can be a good idea. It's a judgement call whether this is a good idea if so-and-so is a business competitor, but if they are from another country or work in another language it's almost certainly a good idea.

2.2. Legalling acquiring the data

Once you've convinced the owners of some eText that providing you with data is something they'd like to do, you still have some hurdles to jump before you can start negotiating with the technical people. With luck, you shouldn't have to pay for data, but if you do, a rough guideline of what I'm aware has been paid in the past is around $100 per million words of raw data, in other words not very much.

Whether you pay for it or not, you need more than just informal agreement to give you the data---you need a licence which allows you to do what you intend to do with it. This is where the point above about respect for copyright comes home. You either have to undertake not to distribute the material at all, that is to be sure it never gets beyond your own laboratory, or, preferably, you have to negotiate a licence which allows further distribution. To get this, you will almost certainly have to agree the terms of the licence agreement you will require from anyone to whom you redistribute the data. You will also need to agree just what uses you (or other eventual recipients) can make of the material. Someone giving you data for free is entitled to restrict you to non-commercial uses, but you should be careful to get the right to modify the material, since uptranslation of character set and markup and addition of further markup is likely to be on your agenda.

The best thing to do if you are seriously intending to distribute corpus materials is to look at some sample licence agreements which one of the big corpus providers such as the LDC or ELRA use. Indeed if you are intending on any form of publication for your corpus, you may find that involving one of those non-profit establishments in the negotiations, so that the licence is actually between them and the supplier, may be to your advantage.

2.3. Physically acquiring the data

Not the least because what you are likely to get for free is not the most recent, the physical medium on which eText is delivered to you may not be under your control. If it is from archives of some sort, it may only be available in one physical form, and even if it is online in some sense, technical staff may be unwilling (especially if you are getting the data for free) or unable to deliver it in a form you would prefer. Mark Liberman described the first delivery of material for the ACL Data Collection Initiative as "Someone opened the back door to the warehouse, grunted, and tossed a cardboard box of nine-track mag tapes out into the parking lot."

If you're lucky, you'll get the material in a form compatible with your local computer system. If not, your first stop is of course whatever form of central computing support your institution or organisation can provide. Absent this, or if they are unable to help, there are commercial firms which specialise in this sort of thing, at a price. Searching for "data conversion" with a WWW search engine is a reasonable starting point if you are forced down this road.

3. Preparation

In the wonderful world of the future, all electronic documents will be encoded using Unicode and tagged with SGML, but until that unlikely state of bliss is achieved, once you've got your hands on the raw material for your corpus the work has only just begun---indeed you won't know for a while yet just how much work there is. For the foreseeable future, particularly because of the rate at which technology penetrates various commercial, governmental and industrial sectors, as well as issues of commercial value and timeliness, much of the material available for corpus creation will be encoded and structured in out-of-date, obscure and/or proprietary ways. This section covers the kind of detective work which may be involved in discovering what the facts are about your material in its raw state, the desired outcome, and plausible routes for getting from one to the other. The discussion is separated into two largely independent parts: character sets and encoding and document structure.

3.1. Character set issues

What do you want to end up with, what do you have to start with, and how do you get from one to the other? Not as simple questions as you might think (or hope).

3.1.1. Choosing a delivery character set and encoding

Without getting into the religious wars the lurk in this area, let us say that characters are abstract objects corresponding to some (set of) graphical objects called glyphs, and that a character set consists inter alia of a mapping from a subsequence of the integers (called code-points) to a set of characters. An encoding is in turn a mapping from a computer-representable byte- or word-stream to a sequence of code-points. ASCII, UNICODE (ISO 10646), JIS and ISO-Latin-1 (ISO-8859-1) are examples of character sets. UTF-8 is an encoding from 8-bit byte-streams to ASCII, ISO-Latin-1 and UNICODE. Shift-JIS is an encoding from 8-bit byte-streams to JIS. The best place for a detailed introduction to all these issues is ???

Need a reference here, and to check whether I've been hopelessly wrong about [shift-]JIS.

So which should you use to deliver your corpus in? There is a lot to be said for international (ISO) standards, but if all your data and your users are Japanese, or Arabic, then national/regional de-facto standards may make sense, even if they limit your potential clientele. As an over-broad generalisation, use ASCII or regional standards if you are concerned with a large existing customer base with conservative habits, use the identity encoding of ISO-Latin-n if your material will fit and your customer base is relatively up-to-date, but for preference use UTF-8-encoded UNICODE. This will ensure the widest possible audience and the longest lifetime, at the expense of some short-term frustration as tool distribution catches up. Note that you pay no storage penalty for using UTF-8 for files which contain only ASCII code-points.

3.1.2. Recoding

The hardest thing about dealing with large amounts of arbitrary non-SGML eText is that you can't trust anything anyone tells you about it. This applies particularly at the level of character sets and encoding. My experience suggests that the only safe thing to do is to construct a character histogram for the entire collection as the very first step in processing. Note that if you're not sure about whether an 8- or 16-bit encoding has been used, some preliminary detective work with a byte-level viewer is probably required. For substantial amounts of data, it's worth writing a small C program to build the histogram---alas it's not something I'm aware of a public domain source for. Once you've got that, you need to look very carefully at the results. To understand what you're looking at, you'll also need a good set of character set and encoding tabulations. The most comprehensive tabulation of 8-bit encodings I know of is RFC 1345, which can be found on the World Wide Web at a number of places: use your favourite search engine. For Chinese, Japanese and Korean character sets and encodings, the work of Ken Lunde is invaluable---see the references in the Resources section below.

Should we include a few common code pages at the end, borrowed from TEI P1?

A preliminary indication of the character set and encoding will usually be given by its origin. The first thing to do is to confirm that you're in the right ball-park by checking the code-points for the core alphabetic characters. I usually work with a multi-column tabulation, illustrated below, showing ASCII/ISO-8859-1 glyph or name (since that's what is most likely to appear on your screen faut de mieux) code-point in octal and hex, count, RFC 1345 code and name in currently-hypothesised character-set, code-point in target character-set/encoding if translation required, comment.

The example below shows extracts from a tabulation for a German eText which appears to be using something close to IBM Code Page 437. Note the difference between code points 5d and 81: the first is recoded to a point other than what the hypothesised encoding would suggest, whereas the latter, as indicated by the comment, is recoded consistently with that hypothesis.


                                 Hypothesised
chr oct hx count   RFC  name                        recode   comment
                   1345

^I \011 09 26062   HT   character tabulation (ht)            pageref if
                                                             before number
!  \041 21 6283    !    exclamation mark                
"  \042 22 25      "    quotation mark              �
#  \043 23 0       Nb   number sign                     
Y  \131 59 1136    Y    latin capital letter y          
Z  \132 5a 51403   Z    latin capital letter z          
[  \133 5b 30      <(   left square bracket         �
\  \134 5c 14      //   reverse solidus             �
]  \135 5d 79      )>   right square bracket        �
^  \136 5e 0       '>   circumflex accent               
x  \201 81 257074  u:   latin small letter 
                        u with diaeresis            �        yes

The only way to deal with material about which you have even the slightest doubt is to examine every row of the histogram for things out of the ordinary. Basically any count which stands out on either the high or the low side needs to be investigated. For instance in the above example the count for 09 is suspiciously high, and indeed turns out to be playing a special role, whereas the counts for 22 and 5b-5d are suspiciously low, so they in turn were examined and context used to identify their recodings.

Note that it is not always necessary to be (or have access to someone who is) familiar with the language in question to do this work. In the example above, the suspicious code-points were not the only ones used for the target character. For instance, not only was 5b used for latin-capital-a-with-diaeresis, but also 8e, the correct code-point for this character in the IBM 437 enconding. This was detected by finding examples of the use of the 5b code-point (using a simple search tool such as grep), and then looking for the result with some other character in the crucial place. For instance if we search for 5b (left square bracket) in the source eText, we find inter alia the word '[gide'. If we look for words of the form '.gide', where the full-stop matches any character, we find in addition to '[gide' also '\216gide', and no others. When '[ra' and '[nderungsantrag' yield the same pattern, it is clear, even without knowing any German, that we've hit the right recoding, and a quick check with a German dictionary confirms this, given octal 216 is hex 8e.

Even if you do have access to someone who knows the language, this kind of preliminary hypothesis testing is a good idea, because not only will it confirm that the problem code-point is at least being deployed consistently, but it will also reduce the demands you make on those you ask for help.

Once you've identified the mechanical recoding you need to do, how you go about doing it depends on the consistency of the recoding. If your eText makes consistent and correct use of only one encoding, then the recode software from the Free Software Foundation can be used to translate efficiently between a wide range of encodings. If, as in the example above, a mixture of encodings has been used, then you'll have to write a program to do the job. Almost any scripting language will suffice to do the work if your dataset is small, but if you're working with megabytes of eText you probably will want to use C. If you do have, or have access to, C expertise, recode is a good place to start, as you can use it to produce the declaration of a mapping array for the dominant encoding to get you started. I also use recode to produce the listing of names and code-points for hypothesised source encodings, as illustrated above.

3.1.3. Word-boundary issues

Once you've done the basic recoding, you still may have another tedious task ahead of you, if line-breaks have been introduced into your eText in a non-reversable way. Various code-points may be functioning as some sort of cross between end-of-line and soft hyphen, but you can't assume that deleting them will always be the right thing to do. One trick which may be helpful is to use the text itself to produce a wordlist for the language involved. You can then use that as a filter to help you judge what to do with unclear cases.

For example, if preliminary investigations suggest that 'X' is such a character, and you find the sequence " abcXdef " in your eText, if you have 'abcdef' as a word elsewhere in the text, but no occurences of either 'abc' or 'def', then you can pretty confidently elide the 'X' in this case.

As in the simpler recoding cases, you may well be left with a residuum of unclear cases for investigation by hand and/or referral to a native speaker, but doing as much as you can automatically is obviously to everyone's advantage when dealing with eText of any size.

Finally, you must take care to construct a histogram of your recoded eText as a final check that you haven't slipped up anywhere. Constructing a word list can be a useful sanity check at this point also. You should view it sorted both by size (bugs often show up as very short or very long 'words') and by frequency and size (all frequent words are worth a look, and frequent long words or infrequent short words are ipso facto suspicious).

3.2. Document structure issues

Unless your eText is extremely unusual, it will have at least some structure. Almost every eText has words organised into sentences and larger units: paragraphs, quotations, captions, headings, etc. The original may also include more explicitly presentational information such as font size and face changes, indentation, etc., some of which may reflect structural or semantic distinctions you wish to preserve.

Two primary questions arise in response to the structure:

  1. How will you notate the structural aspects of your eText?
  2. How will you process your originals to yield versions marked up with the desired notation?
3.2.1. Deciding what to notate explicitly

The matter of when to leave things as they are and when to delimit structure explicitly is by no means obvious. Below a certain level you will undoubtedly stick with the orthographic conventions of your source language, e.g. words are notated with their letters adjacent to one another, in left-to-right order, with spaces in between. Above that level you will want to use some conventions to identify structural units, but it is less clear whether orthographic conventions, even if they are available, are appropriate. Two factors will determine the point at which orthographic convention is no longer appropriate.

  1. If you need to include annotation systematically for elements of structure at a given level, then it is best not to use a (likely to be idiosyncratic) orthographic convention, but rather to make use of one of several more formal approaches. In some cases this may even operate at the level of words or sub-word units, if for example every word in your eText is annotated with its part of speech.
  2. At some point the orthographic convention becomes too hard to reliably interpret mechanically. Human readers can easily distinguish figure captions, section titles and single line paragraphs when they are set off from surrounding text by horizontal and/or vertical white space, but mechanical processing is unlikely to do as well.

The rule of thumb is that whenever possible, difficult programming jobs should be done once, by you, the corpus creator, rather than being left to the many users of your corpus to do many times, provided they can be done reliably.

3.2.1.1. A note on the vexed matter of sentences

Sentences are a difficult matter. In all but the most straightforward of eText, reliably and uncontroversially determining where sentences start and end is impossible. Only if you are very sure that you will not be fooled by abbreviations, quotations, tables or simple errors in the source text should you attempt to mechanically provide explicit formal indication of sentence structure. If you must have a go, at least do not replace the orthographic signals which accompany sentence boundaries in your eText, but add sentence boundary information in a way which your users can easily ignore or remove altogether, and then be left to make their own mistakes.

3.2.2. How is explicit structural information recorded?

Once you've decided what aspects of your eText to delimit and label, the question of how to do this in the most user-friendly and reuseable way arises. There are three main alternatives, with some subsidiary choices:

  1. Design your own idiosyncratic annotation syntax;
  2. Use a database:
    1. Use a widely available spreadsheet;
    2. Use a widely available non-specialised database;
    3. Use a database designed to store eText.
  3. Use a standard markup language such as SGML or XML;
    1. Use a public DTD such as TEI or CES (see below);
    2. Design your own DTD.

Although (1) was the option of choice until fairly recently, there is no longer any justification for the fragility and non-portability which this approach engenders, and it will not be considered further here.

The database approach has some things to recommend it. The simplest version, in which the lowest level of structure occupies a column of single spreadsheet cells, has the advantage that the necessary software (e.g. Microsoft Excel, Lotus ???) is very widely available. This approach also makes the calculation of simple statistics over the corpus very straightforward, and most spreadsheet packages have built-in visualisation tools as well. This approach does not scale particularly well, nor admit to batch processing or easy transfer to other formats, but may be a good starting point for small eText.

I only know of one example, MARSEC, which attempted to use an ordinary relational database for corpus storage, in their case spoken language transcripts, but for a small corpus this is also at least in principle possible, although once again potrability and scalability are likely to be problems.

The third database option has recently been widely adopted within that portion of the text processing community in the United States who are funded by ARPA, in particular those involved in the TREC and MUC information retrieval and message understanding programmes. A database architecture for eText storage and structural annotation called the Tipster archictecture has been specified and implemented. This architecture is also at the heart of a framework for supporting the modular development of text-processing systems called GATE, developed at the University of Sheffield.

SGML is the Standardized General Markup Language (??), ISO 8879. XML is the eXtensible Markup Language, a simplified version of SGML originally targetted at providing flexible document markup for the World Wide Web. They both provide a low-level grammar of annotation, which specifies how markup is to be distinguished from text. They also provide for the definition of the structure of families of related documents, or document types. The former says e.g. that to annotate a region of text it should be bracketted with start and end tags, in turn delimited with angle brackets:

<quotation>To be or not to be.</quotation>

Qualifications may be added to tags using attributes:

<quotation author='WS'>. . .</quotation>

Document type definitions (DTDs) are essentially context-free grammars of allowed tag structures. They also determine which attributes are allowed on which tags. Software is available to assist in designing DTDs, creating documents using a given DTD, and validating that a document conforms to a DTD.

This is not the place for a detailed introduction to SGML or the Tipster architecture, nor would it be appropriate to enter in to a detailed comparison of the relative strengths and weaknesses of the Tipster architecture as compared to e.g. XML, for local representation and manipulation of eText, which is a contentious matter already addressed in a number of scholarly papers. There is however little if any debate that for interchange, SGML or XML are the appropriate representation: both the LDC and ELRA now use SGML for all their new distributions. Furthermore the Tipster architecture mandates support for export and import of SGML-annotated material. Accordingly, and in keeping with our own practice, in the balance of this chapter we will assume the use of SGML or XML.

What DTD should you use? Is your eText so special that you should define your own grammar for it, with all the attendent design and documentation work this entails? Or can you use one of the existing DTDs which are already available? These fall into two broad categories: DTDs designed for those authoring electronic documents, e.g. DOCBOOK and RAINBOW, and those designed for the marking up existing text in order to produce generally useful eText for research purposes, e.g. TEI and CES. The Text Encoding Initiative (TEI) has published a rich modular set of DTDs for marking up text for scholarly purposes. All the tags and attributes are extensively documented, making it easier for both producer and consumer of eText. In addition to the core DTD, there are additional components for drama, verse, dictionaries and many other genres. The Corpus Encoding Standard (CES) is an extension of the TEI intended specifically for use in marking up large existing eText, as opposed to producing eText versions of pre-existing non-electronic documents.

Should you use SGML or XML? In the short term the TEI and CES DTDs are only available in SGML form, so using them means using SGML. But an XML version of the TEI DTD is expected to be available soon (watch the TEI mailing list), and it costs almost nothing to make the individual documents which make up your eText be XML-conformant. Since all valid XML documents are valid SGML documents, you will still be able to validate your eText against the TEI DTD, while still being well-placed to use the rapidly growing inventory of free XML software. If you are designing your own DTD, you probably know enough to make the decision unaided, but in the absence of a strong reason for using full SGML (e.g. you really need SUBDOC, or CONREF, or content model inclusions) I would recommend sticking with XML. In my view the apparent benefit of most of SGML's minimisation features, which are missing from XML, is illusory in the domain of eText: omitted (or short) tags, SHORTREF and DATATAG are almost always more trouble than they're worth, particularly when markup is largely both produced and exploited (semi-)automatically.

3.2.3. Uptranslation

At the beginning of this section, two issues were introduced: what means to use to annotate structure, and how to actually get that annotation added, using the markup system you've chosen, to your eText. We turn now to the second issue, called uptranslation in the SGML context.

The basic principle of formal markup is that all structure which is unambiguously annotated should be annotated consistently using the documented formal mechanism. This means any pre-existing more or less ad-hoc markup, whether formatting codes, white space or whatever has to be replaced with e.g. XML tags and attributes. Depending on the nature and quality of the pre-existing markup, this may be straightforward, or it may be extremely difficult.

In the case of small texts, this process can be done by hand, but with any substantial body of material, it will have to be automated. Proprietary tools designed for this task are available at a price (e.g. SGML Hammer, ???: See one of the SGML tools listings). The no-cost alternative is to use a text-oriented scripting language: sed, awk and perl have historically been the tools of choice, and these are now available for WIN32 machines as well as under UN*X.

The basic paradigm is one of step-wise refinement, writing small scripts to translate some regularity in the existing markup into tags and attributes. As in the case of character set normalisation discussed above, heuristics and by-hand post-processing may be necessary if the source material is inconsistent or corrupt.

What is absolutely crucial is to record every step which is taken, so that the uptranslation sequence can be repeated whenever necessary. This is necessary so that when e.g. at stage eight you discover that a side-effect of stage three was to destroy a distinction which only now emerges as significant, you can go back, change the script used for stage three and run the process forward from there again without much wasted effort. This in turn means that when by-hand post-processing is done, if possible the results should be recorded in the form of context-sensitive patches (the output of diff -C), which can be re-applied if the uptranslation has to be re-done. It is impossible to overestimate the value of sticking to this disciplined approach: in processing substantial amounts of eText you are bound to make mistakes or simply change your mind, and only if the overhead in starting over from the beginning is kept as low as possible will you do it when you should. Note also that since it's the infrequent cases which tend to be the sources of difficulty, in large eText you will usually find that problems are independent of one another, so your patches and scripts which are not directly involved with the problem are likely to continue to work.

Another tactic which often proves helpful is to be as profligate in your use of intermediate versions as the file storage limits you are working with allow. Using shell scripts to process all files in one directory through the script for a given stage in the uptranslation, saving the results in another directory, and so on for each stage, functions again to reduce the cost of backing up and redoing the process from an intermediate position.

At the risk of descending into hackery, don't forget that although shell scripts, as well as most simple text processing tools, are line-oriented, perl is not, and by judicious setting of $/ you can get sequence through your data in more appropriate chunks.

Another parallel with character set normalisation is that validating the intermediate forms of your eText as you progressively uptranslate it is a valuable diagnostic tool.

To sum up: Uptranslation is a software development task, not a text-editing task, and all the paraphenalia of software engineering can and should be brought to bear on it.

4. Distribution

Having put all that work in, you really ought to make the results of your work available to the wider community. This section briefly introduces the main issues which you will need to confront: what medium to use for the distribution, what tools if any should accompany the data, what documentation should you provide, how should you manage the legal issues and exactly what should you include in the distribution.

4.1. Media issues

For the last few years there has only been one plausible option for the distribution of large eText, namely CD-ROM. A CD holds over 600MB of data, which is large enough for fairly substantial collections of text. Small amounts of textual material (say under 25MB) are probably better compressed (see below) and made available via the Internet (but see below regarding licences). The drawback of CDs is that the economics and pre-production hassle vary considerably with the number you need to distribute.

In large quantities (i.e. hundreds) getting CDs pressed by a pressing plant can be quite inexpensive, as the marginal cost per disc is very small (a few pounds) and the cost of a master disc from which the copies are pressed is falling as the market is quite competitive. Exactly how much mastering costs depends on how much work you do ahead of time: if you send a pre-mastered CD image and all the artwork needed for label and insert (if any) it can be as low as 1000 pounds (??), but this can rise quickly the more work you leave for the producer. If you think your eText is of sufficient interest to warrent a large press run, you may well be better off negotiating with the LDC or ELRA to handle pre-production and pressing for you.

In small numbers you can use write-once CDs and do your own production using a relatively cheap using a ??? drive, but this gets both expensive and tedious for more than a dozen or so discs.

Finally note that although the standard for CD-ROMs (ISO-9660, sometimes referred to as High Sierra) is compatible with all three major platforms (UN*X, WIN32 and Macintosh), it is by nature a lowest-common-denominator format, so that e.g. symbolic links and long file names are best avoided.

In a number of ways CDs are not very satisfactory: they are too big for modest eText, too small for large eText, particularly when accompanied by digitised speech (a single CD will only hold about two hours of uncompressed high-quality stereo digitised speech) and slow to access. It may be that new technology such as Zip and SuperDrive will provide an alternative for modest eText, and DVD may be an improvement for some larger eText, but the situation for very large datasets does not look likely to improve soon.

4.2. Tools and platforms

Time was when distributing eText meant either distributing the tools to access it along with the eText itself, or condemning users to build their own. Fortunately the last few years have seen a rapid increase in the availability of free software for processing SGML, and most recently XML, for both WIN32 and UN*X platforms, so the use of one of these markup languages pretty much obviates the necessity of distributing tools. Of course special-purpose tools which take advantage of e.g. XML markup can add considerably to the value of eText, so if you've built them you should certainly consider including them. For example, along with the SGML-marked-up transcripts and the digitised audio files, our own HCRC Map Task Corpus CD-ROMs included software which links the two, so that while browsing the transcripts the associated audio is available at the click of a button.

It must be said that the situation is not as good on the Macintosh. Large-scale processing of eText is not well-supported, although free XML processing tools are beginning to be ported to the Macintosh, including our own LT XML tool suite and API.

4.2.1. (De)Compression

One of the more vexed questions associated with distribution is whether or not to compress your eText, and if so how. Particularly if the size of the uncompressed eText is just above the limit for some candidate distribution medium, the question is sure to arise. My inclination is not to compress, unless it makes a big difference in ease of distribution, e.g. by getting your total size down to what will fit on a single CD-ROM:

4.2.2. Validation

Assuming you are using XML or SGML to annotate your eText, you should certainly use a DTD, and you should certainly confirm that your eText as valid per that DTD. For SGML there the state-of-the-art, bang up-to-date validator is free: James Clark's SP, available for WIN32 or UN*X. For XML the list is larger and still changing rapidly, so the best bet is to check one of the tools summary pages.

4.2.3. Search and Retrieval

What's the use of structure if you can't exploit it? (Actually, it's worth using rigourously defined markup even if you can't or won't exploit it, just for the consistency it guarantees and the documentation it provides.) A number of free tools now exist to allow structure to figure in simple searches of XML or SGML annotated eText (e.g. LT XML, sgrep: see the tools summary pages). There is clearly scope for structure-sensitivity in concordancing and other more advanced text-statistical tools, but as yet such tools have not emerged from the experimental stage.

4.3. Documentation

Another benefit to using the TEI DTD (or the TEI-based CES DTD) is that it requires a carefully-thought-out and detailed header, which records not only basic bibliographic information, but also relevant information about encoding, markup and uptranslation procedures used, if any. Any eText consumer needs to know all this, as well as the meaning of any markup whose semantics is not already published (i.e. in the TEI and/or CES specifications). Note also that bibliographic information falls into two categories: information about the original source of the eText, and information about the eText itself.

4.4. Licence agreements

If you have had to sign licence agreements with the copyright owners of any of the sources of your eText, you may well be obliged thereby to in turn require agreement to licence terms from anyone to whom you distribute that eText. Even if you haven't, you must make clear to anyone who comes across your eText the terms under which they may make use of it, if any. The contribution you made in producing a body of eText is copyright by you the minute you publish it in any way, so even if you wish anyone to be free to use it in any way they choose (and it's within your rights to grant this) you need to explicitly grant them licence to do so.

So at the very least you need to incorporate a copyright notice in your eText, either granting a licence, or stating clearly that unlicenced use is not allowed, and the means, if any, whereby a licence can be obtained. In either case, most people also include some form of disclaimer of responsibility for errors etc., although the legal status of such notices is not entirely clear. Here's one we've used (borrowed from who knows where):

LTG makes no representations about the suitability of this data for any purpose. It is provided "as is" without express or implied warranty. LTG disclaims all warranties with regard to this data, including all implied warranties of merchantability and fitness, in no event shall LTG be liable for any special, indirect or consequential damages or any damages whatsoever, action of contract, negligence or other tortious action, arising out of or in connection with the use of this data.

When you actually need a licence agreement, which a user acknowledges as governing their access to the eText, you should at least look to existing precedent from the experts, the LDC or ELRA, and if necessary and you have the patience, talk to a lawyer.

For eText of major significance, particularly if you have a legal obligation as a result of licence terms you have agreed to, you may well choose to require physical signature of a paper licence agreement and/or payment before releasing copies. For eText where this would be overkill, and which is small enough to be distributed via the Internet, we have been content with an interactive licencing process, either with or without an e-mail step to get positive identification. Most intermediate-level introductions to web-site construction contain enough information and examples to put one of these together.

4.5. What to distribute

This may seem obvious, you distribute the eText, right? Not quite, or rather, not only. Along with the eText itself, its documentation and copyright notice, your distribution package should if at all possible include the raw original sources, in as close as possible a form to that which you yourself started with (but note that here's the obvious place where compression may be both necessary and appropriate). You should never be so sure of your recoding and uptranslation, or as confident that your markup is what others need and want, to restrict them to your version of the sources alone. Replication of results is not a paradigm often observed in our branch of science, but the advent of data-intensive processing and theorising make it much more appropriate, and in my view it should apply even at the humble level we've been discussing in this chapter.

5. Resource pointers

Here's a list of pointers to the resources referred to above. In keeping with the nature of the business, they're almost all pointers to World Wide Web pages, and as such subject to change without notice. With luck there's enough identifying text so that a search engine will be able to locate their current incarnation by the time you read this.