Summary of Availability of Spoken Language Resources, with Emphasis on Non-American English Materials

Summary of Availability of Spoken Language Resources, with Emphasis on Non-American English Materials
6 June 1995
Henry S. Thompson
Being the results of an e-mail and net survey carried out in May of 1995.
Note that I have NOT included here the bulk of the (American English for the most part) LDC backlist, q.v.

===========================================================
First hand information (i.e. from the producer of the data)
===========================================================

Map Task (HCRC/LDC)
   http://www.cogsci.ed.ac.uk/elsnet/resources.html
   ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html
   
   The HCRC Map Task Corpus is a set of 8 CD-ROMs containing linked
   sampled audio and transcriptions of a total of about 18 hours of
   spontaneous speech that was recorded from 128 two-person
   conversations according to a detailed experimental design.

Corpus of Spoken American English (Dept of Linguistics, UC Santa Barbara):
   John DuBois <dubois@humanitas.ucsb.edu>
   We hope to have the first CD-ROM (with 22.05 kHz 16-bit stereo .WAV
   audio and transcription in Windows for PC's) sometime this summer.
   It will contain just 10 transcripts averaging 25 minutes each, or
   about 5% of the eventual one million words of material.

The Groningen Speech Corpus (SPEX)
   http://www.cogsci.ed.ac.uk/elsnet/resources.html
   The Groningen Speech Corpus was collected by A.M. Sulter, MD and
   Prof. H.K. Schutte as part of a research project funded by NWO
   (Netherlands Organization for Scientific Research). The 4 CD-ROMs
   contain over 20 hours of speech. It is a corpus of read speech
   material in Dutch, recorded on PCM tape under fairly good
   conditions.  238 speakers READING Texts, sentences, words, numbers
   and 3 vowels. 750 ECU (academic use), industrial use 3000 ECU.

Dutch Polyphone (SPEX):
   spex@spex.nl
   5000 speakers reading 50 items (digits, sentences (phonetically
   rich), transliterated.

Speechstyles (SPEX)
   spex@spex.nl
   129 speakers, spontaneous speech (monologues), semi- spontaneous
   speech (picture descriptions), reading. All transliterated, and
   provided with NIST Sphere Headers. Price: about 750 ECU
   (academic), 3000 industrial.
   
Dutch Read Text corpus (SPEX):
   spex@spex.nl
   one speakers reading 45 texts (some of them also at fast speech
   rate).  6 text are segmented and labelled at the phoneme level.
   Price 200 ECU (academic) 800 industrial.

DIRECT (Sao Paolo and Liverpool):
   HELOISA COLLINS <hcollins@bra000.canal-vip.onsp.br>
   
   Development of Research in English for Commerce and Technology, a
   binational project going on in the Catholic University of Sao
   Paulo, Brazil, and the University of Liverpool in England (check
   ftp.liv.ac.uk for the working papers produced so far), has some
   spoken data that might be of interest.  As a member od the research
   team, I've done some work on public presentations (non-academic)
   and am now doing analysis of meetings. A PhD student working under
   my supervision is working on job interviews and another one is
   currently analysing conducted tours. This material is not publicly
   available yet, but we could consider making part of it available on
   an exchange basis.  Languages are English (as native, second and
   foreign language) and Brazilian Portuguese. I haven't got details
   about number of words right now (we work on the basis of the texts
   of complete communicative events), but this might give you a rough
   idea: 4 presentations in English (transcribed) 4 or 5 in Portuguese
   (not trasncribed) 2 in English (not transcribed) 3 meetings in
   English and 2 in Portuguese (transcribed) 10 conducted tours
   (transcribed) 10 job interviews (trancriptions almost done) We have
   more stuff, transcribed, that has been collected by other members
   of the project.

   In principle, as I said before, there would be no cost involved,
   since we are more interested in enlarging the corpus and would,
   therefore, prefer to exchange data. We would like texts of complete
   events in the area of general business. In fact, we may be
   interested in anything which is not strictly academic.

MARSEC (Univ. Leeds & Reading):
   http://midwich.reading.ac.uk/research/speechlab/marsec/marsec.html
   The MAchine Readable Spoken English Corpus. 
   A small section of the corpus is available for anonymous ftp from
   The Speech Laboratory at Leeds University. lethe.leeds.ac.uk:/pub/marsec

EUROM0 (University College London):
   M.Huckvale@ucl.ac.uk
   EUROM0 is a CD-ROM containing spoken recordings of digits,
   sentences and passages by 4 speakers in each of 5 European
   languages. 

Japanese (ATR):
   sho@ctr.atr.co.jp
   [Name] ATR Speech Databases for Research
   [Language] Japanese
   [Description]
     Magnetic Tapes and/or CD-ROM.
     20kHz (partially 12kHz) sampling, 16bit digitized.
   [Contents]
     Set A: 8,500 Words Speech Database
	    20 speakers (10 males and 10 females)
     Set B: Phoneme-Balanced 503 Sentences Speech Database
	    10 speakers (6 males and 4 females)
     Set C: Large Size of Speakers Speech Database
     Set D: Text Speech Database
	    2 speakers (1 male and 1 female)
	    12 stories (about 400 sentences)
     Set E: English Speech Database
	    4 speakers (2 males and 2 females), about 5,000 words
     Set F: Sentence Speech Database
	    6 speakers (3 males and 3 females), about 1,100 sentences
   [Costs] Please contact to the distribution coordinator.
   [Distribution Coordinator]
     Mr. Shohei TAHARA
     Research Engineering Department
     ATR (Advanced Telecommunications Research Institute) International
     2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
     Telephone: +81 774 95 1192
     Facsimile: +81 774 95 1179
     E-Mail:    sho@ctr.atr.co.jp

German (Vienna University of Technology):
   Gernot Kubin <kubin@sampo.nt.tuwien.ac.at>
   We have recorded a data base of sustained, continuant speech sounds.

   Language: German, 25 continuants.  Number of speakers: 60+,
   male/female, mixed age.  Recording conditions: DAT (48 kHz, 16
   bit), anechoic chamber.  File format: raw 16 bit integer, mono.

   Each file corresponds to an individual continuant sustained over
   approx. 1 second.

Spanish (Univ. Politecnica de Madrid):
   luis@gaps.ssr.upm.es (Luis Hernandez Gomez)
   Corpus:
    - Spanish Headlines from AT&T Bell Lab.
    - 650 sentences
    - Orthographic anf phonetic transcription
   Speakers:
    - 25 male speakers. 200 sentences/speaker
    - 25 female speakers. 200 senetences/speaker
   Recording conditions:
    - Recording studio
    - 16 bits, 16 KHz

Northern Ireland Transcribed Corpus of Speech (Queen's University Belfast):   
   J.M.Kirk@qub.ac.uk
   Language: English Transcription: Orthographic 105 interviews with
   people from 38 localities in Northern Ireland.  3 age groups:
   children, middle-aged and elderly.  Word token: c. 250,000
   Recordings made late 70s early 80s Contents: for elderly and
   middle-aged: interviews about changes in the pattern of life, many
   recollections andreminiscences and anecdotes. Lots of questions by
   thefieldworker/interviewer.  Good gender and ethnic balance, too.
   Fits on three HD 3.5" floppy disks

==========================
No audio, transcripts only
==========================

ECI MCI (HCRC/LDC)
   http://www.cogsci.ed.ac.uk/elsnet/resources.html
   ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html

   ECI has produced Multilingual Corpus I (ECI/MCI) of over 98 million
   words, covering most of the major European languages, as well as
   Turkish, Japanese, Russian, Chinese, Malay and more. The primary
   focus in this effort is on textual material of all kinds, including
   transcriptions of spoken material. The ECI/MCI is now available at
   a price of GBP 23.50 (including GBP 3.50 VAT) for countries within
   the European Union, and GBP 20 for countries outside the European
   Union.

Japanese (ATR):
   sho@ctr.atr.co.jp
   [Name] ATR Dialogue Text Databases for Research
   [Language] Japanese and English
   [Description]
   ATR corpus contains conversations between Japanese speakers through
   telephone and/or keyboard communications.  All conversations are
   transcribed.  Morphological and syntactical tags are given.
   Corresponding English is given.  About half million words are
   available.
   [Contents]
     Set 1: Telephone Conversation, Conference Registration Task
     Set 2: Keyboard Conversation,  Conference Registration Task
     Set 3: Telephone Conversation, Travel Arrangement Task
     Set 4: Keyboard Conversation,  Travel Arrangement Task
   [Costs] Please contact to the distribution coordinator.
   [Distribution Coordinator]
     Mr. Shohei TAHARA, as above

=======================
Here on are forthcoming
=======================

EUROM-1 (University College London):
   M.Huckvale@ucl.ac.uk
   "CDs are currently in production"

Dutch Eurom1
   spex@spex.nl
   HIFI recordings of 64 speakers, reading passages, sentences and numbers.
   Price: not yet known.

BREF (LIMSI/CNRS, Paris):
   lamel@limsi.fr
   "soon. we were about to make an announcement and then
    had another administrative problem.

    i hope that this will be resolved very soon - we thought
    that all was in place..." -- Lori Lamel, 4/12/95

TRAINS (University of Rochester):
   Peter Heeman <heeman@cs.rochester.edu>
   
   The TRAINS spoken dialogue corpus should be available soon from the
   LDC.  It is a corpus of spoken english in a task oriented setting.
   There is about 6 and a half hours of dialogue, comprising about
   55,000 words spoken.  I am not sure about the cost.
   
ShATR (Univ. Sheffield & ATR):
   B.Karlsen@dcs.shef.ac.uk
   http://www.dcs.shef.ac.uk/research/groups/spandh/ShATR.html
   
   At the moment we are finishing the last transcriptions of very special
   corpus here at Sheffield. The corpus is called ShATR
   (Sheffield-ATR) and it contains a set of high quality recordings of
   multiple speakers speaking simultaneously. There are 4 british
   english speakers and 1 american english speaker. The corpus
   contains 8 channels: one for each speaker (head mounted mic), an
   omnidirectional mic, and the left and right channel of a acoustical
   mannikin with artificial ears. The data are in NeXT/Sun sound file
   format, and there are almost 37 min. speech at 48kHz sampling rate
   (16 bit linear) for each channel.  Only part of the corpus will be
   made available via ftp, the entire corpus will be possible to
   purchase from LDC (Linguistic Data Corporation, US) on CD-ROMs when
   the corpus is finished. Price is still unknown.

BABEL (Reading Univ. and others):
   http://midwich.rdg.ac.uk/
   new European (Copernicus) project based in Reading, making
   SAM-style database of Bulgarian, Estonian, Hungarian, Polish and
   Romanian.  Some Bulgarian data already available.   

========================================================================
Second hand (i.e. someone says "I believe that [someone else] has [...]"
========================================================================

University of Victoria Phonetic Database:
   Sampled data files from 45 languages (including some Amerindian
   ones I had never heard of), together with phonetic and orthographic
   transcriptions and software for playing from CD-ROM using PC with
   Soundblaster card. I have played with this, but don't yet own a
   copy. Available for about $470 from Speech Technology Research
   Ltd. in Victoria, fax. 604/477-2540

The Oxford Acoustic Database:
   Produced by Brian Pickering and Burt Rosner, published by Oxford
   University Press; cost somewhere around 100 pounds. I've lost the
   details, but I think there are about 8 well-known languages on
   it.

Fin-DSDB (Helsinki):
   aiivonen@helsinki.fi
   Finnish Digital Speech Database () including an editing and analysing
   program QuickSig designed by Matti Karjalainen and Toomas Altosaar
   (Helsinki Technical University); database designed in collaboration
   with Department of Phonetics, University of Helsinki.

PHONDAT (Kiel University):
   Now on sale from Klaus Kohler's Dept., text in 2 vols. of the Kiel
   working papers (AIPUK 27/8). All German.

SCRIBE (various UK partners):
   
   now on sale at DRA Malvern. CD-ROMS and time-aligned
   transcriptions. All English.

Spanish spoken material:
   
   There is at least one spoken corpus of spoken Spanish available at the
   Universidad Autonoma de Madrid.

   ftp://lola.lllf.uam.es/pub/corpus/

   There are also some South American corpora there but they are
   probably written texts.

   I have tried on several occasions to download the description of
   the oral corpus but have had nothing but problems, even though the
   corpus itself is fine, so I cannot say much about the sources used.
   The corpus takes up about 7Mb.

CHILDES:
   brian+@andrew.cmu.edu (Brian MacWhinney)
   The Child Language Data Exchange System reportedly has oral
   child-adult conversational material.

ASJ (JIPDEC):
   http://www.itl.atr.co.jp/cocosda/corpora/japanese
   1. Corpus name: ASJ Continuous Speech Corpus for Research 
   2. Producer: Japan Information Processing Development Corporation
   3. Contents: Vol. 1-3 : ATR 503 PB sentences (read speech)
			   64 speakers (30 males & 34 females)
			   9.600 sentences
		Vol. 4-6 : Various guide task sentences (read speech)
			   36 speakers (18 males & 18 females)
			   12,474 sentences
		Vol. 7   : Simulated dialogues with transcribed texts
			   37 speakers (29 males & 8 females)
			   37 dialogues
   4. A/D condition: 16 kHz sampling rate, 16 bit quantization
   5. Media: CD-ROM (ISO 9660)
   6. Distribution condition: for non-commercial purposes
   7. Price: Yen 3.090/vol + mailing cost
   8. Note: Submission of license agreement form is required
   9. Person in charge:

		K. KATAOKA
		AI and Fuzzy Promotion Center,  
		Japan Information Processing Development Center (JIPDEC)  
		3-5-8 Shibakoen, Minatoku, Tokyo 105, JAPAN

		TEL. +81 3 3432 9390 
		FAX. +81 3 3431 4324  

   Note:
   As for volumes one to three of the ASJ corpus, only several copies are 
   available and a hundred or more copies are available for volumes 4 to 7.  
   Some volumes of CD-ROMs may be reproduced if they receive many requests.

JEIDA Noise Database:
   http://www.itl.atr.co.jp/cocosda/corpora/japanese
   2. Producer: Japan Electronic Industry Development Association
   3. Reference: Mr. T. Kitamura, Sunrise Music Inc.
   4. Content: Various environmental noise
   5. Speakers, Repetition: 17 sorts of noise in 17 DAT cassettes
   6. AD conversion condition: 48 kHz, 16 bits
   7. Distribution media/way: 18 DAT cassettes, one of which is a
                              digest tape of 17 sorts of noise
   8. Distribution condition: for non-commercial purposes
   9. Others: Submission of license agreement form is required

   Contact address of Mr. Kitamura:
      4-7-6 Akasaka, Minato, Tokyo 107, Japan
      Sunrise Music Co. Ltd.
      Tel: +81 3 3585 6541
      Fax: +81 3 3585 6748
   
   Cost of dubbing for one set: Yen 72,000.- including tapes.

   Contents are as follows.  1. Automobile cabin (Medium-size car)
   2. Automobile cabin (Compact car) 3. Exhibition hall A (In a booth)
   4. Exhibition hall B (In a passage) 5. Railway station (Near ticket
   vending machines / In a passage) 6. Telephone booth (Down town)
   7. Factory (Machinery / Press) 8. Parcel classification works
   9. Trunk road / Road crossing 10. Crowded street 11. New trunkline
   train 12. Ordinary train 13. Computer room A (Minicomputers)
   14. Computer room B (Workstations) 15. Large air conditioner
   16. Air conditioning fan coil / Ventilation duct 17. Elevator
   passage (Hospital / Department store) 18. Digest tape of Nos. 1 to
   17
   
Non-native French (Univ. of London):
   j.dewaele@french.bbk.ac.uk
   Debates, formal and informal interviews with non-native speakers
   (Dutch).  Audio tape and transcriptions on diskettes.