Being the results of an e-mail and net survey carried out in May of 1995.
Note that I have NOT included here the bulk of the (American English for the most part) LDC backlist, q.v.
=========================================================== First hand information (i.e. from the producer of the data) =========================================================== Map Task (HCRC/LDC) http://www.cogsci.ed.ac.uk/elsnet/resources.html ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html The HCRC Map Task Corpus is a set of 8 CD-ROMs containing linked sampled audio and transcriptions of a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations according to a detailed experimental design. Corpus of Spoken American English (Dept of Linguistics, UC Santa Barbara): John DuBois <dubois@humanitas.ucsb.edu> We hope to have the first CD-ROM (with 22.05 kHz 16-bit stereo .WAV audio and transcription in Windows for PC's) sometime this summer. It will contain just 10 transcripts averaging 25 minutes each, or about 5% of the eventual one million words of material. The Groningen Speech Corpus (SPEX) http://www.cogsci.ed.ac.uk/elsnet/resources.html The Groningen Speech Corpus was collected by A.M. Sulter, MD and Prof. H.K. Schutte as part of a research project funded by NWO (Netherlands Organization for Scientific Research). The 4 CD-ROMs contain over 20 hours of speech. It is a corpus of read speech material in Dutch, recorded on PCM tape under fairly good conditions. 238 speakers READING Texts, sentences, words, numbers and 3 vowels. 750 ECU (academic use), industrial use 3000 ECU. Dutch Polyphone (SPEX): spex@spex.nl 5000 speakers reading 50 items (digits, sentences (phonetically rich), transliterated. Speechstyles (SPEX) spex@spex.nl 129 speakers, spontaneous speech (monologues), semi- spontaneous speech (picture descriptions), reading. All transliterated, and provided with NIST Sphere Headers. Price: about 750 ECU (academic), 3000 industrial. Dutch Read Text corpus (SPEX): spex@spex.nl one speakers reading 45 texts (some of them also at fast speech rate). 6 text are segmented and labelled at the phoneme level. Price 200 ECU (academic) 800 industrial. DIRECT (Sao Paolo and Liverpool): HELOISA COLLINS <hcollins@bra000.canal-vip.onsp.br> Development of Research in English for Commerce and Technology, a binational project going on in the Catholic University of Sao Paulo, Brazil, and the University of Liverpool in England (check ftp.liv.ac.uk for the working papers produced so far), has some spoken data that might be of interest. As a member od the research team, I've done some work on public presentations (non-academic) and am now doing analysis of meetings. A PhD student working under my supervision is working on job interviews and another one is currently analysing conducted tours. This material is not publicly available yet, but we could consider making part of it available on an exchange basis. Languages are English (as native, second and foreign language) and Brazilian Portuguese. I haven't got details about number of words right now (we work on the basis of the texts of complete communicative events), but this might give you a rough idea: 4 presentations in English (transcribed) 4 or 5 in Portuguese (not trasncribed) 2 in English (not transcribed) 3 meetings in English and 2 in Portuguese (transcribed) 10 conducted tours (transcribed) 10 job interviews (trancriptions almost done) We have more stuff, transcribed, that has been collected by other members of the project. In principle, as I said before, there would be no cost involved, since we are more interested in enlarging the corpus and would, therefore, prefer to exchange data. We would like texts of complete events in the area of general business. In fact, we may be interested in anything which is not strictly academic. MARSEC (Univ. Leeds & Reading): http://midwich.reading.ac.uk/research/speechlab/marsec/marsec.html The MAchine Readable Spoken English Corpus. A small section of the corpus is available for anonymous ftp from The Speech Laboratory at Leeds University. lethe.leeds.ac.uk:/pub/marsec EUROM0 (University College London): M.Huckvale@ucl.ac.uk EUROM0 is a CD-ROM containing spoken recordings of digits, sentences and passages by 4 speakers in each of 5 European languages. Japanese (ATR): sho@ctr.atr.co.jp [Name] ATR Speech Databases for Research [Language] Japanese [Description] Magnetic Tapes and/or CD-ROM. 20kHz (partially 12kHz) sampling, 16bit digitized. [Contents] Set A: 8,500 Words Speech Database 20 speakers (10 males and 10 females) Set B: Phoneme-Balanced 503 Sentences Speech Database 10 speakers (6 males and 4 females) Set C: Large Size of Speakers Speech Database Set D: Text Speech Database 2 speakers (1 male and 1 female) 12 stories (about 400 sentences) Set E: English Speech Database 4 speakers (2 males and 2 females), about 5,000 words Set F: Sentence Speech Database 6 speakers (3 males and 3 females), about 1,100 sentences [Costs] Please contact to the distribution coordinator. [Distribution Coordinator] Mr. Shohei TAHARA Research Engineering Department ATR (Advanced Telecommunications Research Institute) International 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan Telephone: +81 774 95 1192 Facsimile: +81 774 95 1179 E-Mail: sho@ctr.atr.co.jp German (Vienna University of Technology): Gernot Kubin <kubin@sampo.nt.tuwien.ac.at> We have recorded a data base of sustained, continuant speech sounds. Language: German, 25 continuants. Number of speakers: 60+, male/female, mixed age. Recording conditions: DAT (48 kHz, 16 bit), anechoic chamber. File format: raw 16 bit integer, mono. Each file corresponds to an individual continuant sustained over approx. 1 second. Spanish (Univ. Politecnica de Madrid): luis@gaps.ssr.upm.es (Luis Hernandez Gomez) Corpus: - Spanish Headlines from AT&T Bell Lab. - 650 sentences - Orthographic anf phonetic transcription Speakers: - 25 male speakers. 200 sentences/speaker - 25 female speakers. 200 senetences/speaker Recording conditions: - Recording studio - 16 bits, 16 KHz Northern Ireland Transcribed Corpus of Speech (Queen's University Belfast): J.M.Kirk@qub.ac.uk Language: English Transcription: Orthographic 105 interviews with people from 38 localities in Northern Ireland. 3 age groups: children, middle-aged and elderly. Word token: c. 250,000 Recordings made late 70s early 80s Contents: for elderly and middle-aged: interviews about changes in the pattern of life, many recollections andreminiscences and anecdotes. Lots of questions by thefieldworker/interviewer. Good gender and ethnic balance, too. Fits on three HD 3.5" floppy disks ========================== No audio, transcripts only ========================== ECI MCI (HCRC/LDC) http://www.cogsci.ed.ac.uk/elsnet/resources.html ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html ECI has produced Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material. The ECI/MCI is now available at a price of GBP 23.50 (including GBP 3.50 VAT) for countries within the European Union, and GBP 20 for countries outside the European Union. Japanese (ATR): sho@ctr.atr.co.jp [Name] ATR Dialogue Text Databases for Research [Language] Japanese and English [Description] ATR corpus contains conversations between Japanese speakers through telephone and/or keyboard communications. All conversations are transcribed. Morphological and syntactical tags are given. Corresponding English is given. About half million words are available. [Contents] Set 1: Telephone Conversation, Conference Registration Task Set 2: Keyboard Conversation, Conference Registration Task Set 3: Telephone Conversation, Travel Arrangement Task Set 4: Keyboard Conversation, Travel Arrangement Task [Costs] Please contact to the distribution coordinator. [Distribution Coordinator] Mr. Shohei TAHARA, as above ======================= Here on are forthcoming ======================= EUROM-1 (University College London): M.Huckvale@ucl.ac.uk "CDs are currently in production" Dutch Eurom1 spex@spex.nl HIFI recordings of 64 speakers, reading passages, sentences and numbers. Price: not yet known. BREF (LIMSI/CNRS, Paris): lamel@limsi.fr "soon. we were about to make an announcement and then had another administrative problem. i hope that this will be resolved very soon - we thought that all was in place..." -- Lori Lamel, 4/12/95 TRAINS (University of Rochester): Peter Heeman <heeman@cs.rochester.edu> The TRAINS spoken dialogue corpus should be available soon from the LDC. It is a corpus of spoken english in a task oriented setting. There is about 6 and a half hours of dialogue, comprising about 55,000 words spoken. I am not sure about the cost. ShATR (Univ. Sheffield & ATR): B.Karlsen@dcs.shef.ac.uk http://www.dcs.shef.ac.uk/research/groups/spandh/ShATR.html At the moment we are finishing the last transcriptions of very special corpus here at Sheffield. The corpus is called ShATR (Sheffield-ATR) and it contains a set of high quality recordings of multiple speakers speaking simultaneously. There are 4 british english speakers and 1 american english speaker. The corpus contains 8 channels: one for each speaker (head mounted mic), an omnidirectional mic, and the left and right channel of a acoustical mannikin with artificial ears. The data are in NeXT/Sun sound file format, and there are almost 37 min. speech at 48kHz sampling rate (16 bit linear) for each channel. Only part of the corpus will be made available via ftp, the entire corpus will be possible to purchase from LDC (Linguistic Data Corporation, US) on CD-ROMs when the corpus is finished. Price is still unknown. BABEL (Reading Univ. and others): http://midwich.rdg.ac.uk/ new European (Copernicus) project based in Reading, making SAM-style database of Bulgarian, Estonian, Hungarian, Polish and Romanian. Some Bulgarian data already available. ======================================================================== Second hand (i.e. someone says "I believe that [someone else] has [...]" ======================================================================== University of Victoria Phonetic Database: Sampled data files from 45 languages (including some Amerindian ones I had never heard of), together with phonetic and orthographic transcriptions and software for playing from CD-ROM using PC with Soundblaster card. I have played with this, but don't yet own a copy. Available for about $470 from Speech Technology Research Ltd. in Victoria, fax. 604/477-2540 The Oxford Acoustic Database: Produced by Brian Pickering and Burt Rosner, published by Oxford University Press; cost somewhere around 100 pounds. I've lost the details, but I think there are about 8 well-known languages on it. Fin-DSDB (Helsinki): aiivonen@helsinki.fi Finnish Digital Speech Database () including an editing and analysing program QuickSig designed by Matti Karjalainen and Toomas Altosaar (Helsinki Technical University); database designed in collaboration with Department of Phonetics, University of Helsinki. PHONDAT (Kiel University): Now on sale from Klaus Kohler's Dept., text in 2 vols. of the Kiel working papers (AIPUK 27/8). All German. SCRIBE (various UK partners): now on sale at DRA Malvern. CD-ROMS and time-aligned transcriptions. All English. Spanish spoken material: There is at least one spoken corpus of spoken Spanish available at the Universidad Autonoma de Madrid. ftp://lola.lllf.uam.es/pub/corpus/ There are also some South American corpora there but they are probably written texts. I have tried on several occasions to download the description of the oral corpus but have had nothing but problems, even though the corpus itself is fine, so I cannot say much about the sources used. The corpus takes up about 7Mb. CHILDES: brian+@andrew.cmu.edu (Brian MacWhinney) The Child Language Data Exchange System reportedly has oral child-adult conversational material. ASJ (JIPDEC): http://www.itl.atr.co.jp/cocosda/corpora/japanese 1. Corpus name: ASJ Continuous Speech Corpus for Research 2. Producer: Japan Information Processing Development Corporation 3. Contents: Vol. 1-3 : ATR 503 PB sentences (read speech) 64 speakers (30 males & 34 females) 9.600 sentences Vol. 4-6 : Various guide task sentences (read speech) 36 speakers (18 males & 18 females) 12,474 sentences Vol. 7 : Simulated dialogues with transcribed texts 37 speakers (29 males & 8 females) 37 dialogues 4. A/D condition: 16 kHz sampling rate, 16 bit quantization 5. Media: CD-ROM (ISO 9660) 6. Distribution condition: for non-commercial purposes 7. Price: Yen 3.090/vol + mailing cost 8. Note: Submission of license agreement form is required 9. Person in charge: K. KATAOKA AI and Fuzzy Promotion Center, Japan Information Processing Development Center (JIPDEC) 3-5-8 Shibakoen, Minatoku, Tokyo 105, JAPAN TEL. +81 3 3432 9390 FAX. +81 3 3431 4324 Note: As for volumes one to three of the ASJ corpus, only several copies are available and a hundred or more copies are available for volumes 4 to 7. Some volumes of CD-ROMs may be reproduced if they receive many requests. JEIDA Noise Database: http://www.itl.atr.co.jp/cocosda/corpora/japanese 2. Producer: Japan Electronic Industry Development Association 3. Reference: Mr. T. Kitamura, Sunrise Music Inc. 4. Content: Various environmental noise 5. Speakers, Repetition: 17 sorts of noise in 17 DAT cassettes 6. AD conversion condition: 48 kHz, 16 bits 7. Distribution media/way: 18 DAT cassettes, one of which is a digest tape of 17 sorts of noise 8. Distribution condition: for non-commercial purposes 9. Others: Submission of license agreement form is required Contact address of Mr. Kitamura: 4-7-6 Akasaka, Minato, Tokyo 107, Japan Sunrise Music Co. Ltd. Tel: +81 3 3585 6541 Fax: +81 3 3585 6748 Cost of dubbing for one set: Yen 72,000.- including tapes. Contents are as follows. 1. Automobile cabin (Medium-size car) 2. Automobile cabin (Compact car) 3. Exhibition hall A (In a booth) 4. Exhibition hall B (In a passage) 5. Railway station (Near ticket vending machines / In a passage) 6. Telephone booth (Down town) 7. Factory (Machinery / Press) 8. Parcel classification works 9. Trunk road / Road crossing 10. Crowded street 11. New trunkline train 12. Ordinary train 13. Computer room A (Minicomputers) 14. Computer room B (Workstations) 15. Large air conditioner 16. Air conditioning fan coil / Ventilation duct 17. Elevator passage (Hospital / Department store) 18. Digest tape of Nos. 1 to 17 Non-native French (Univ. of London): j.dewaele@french.bbk.ac.uk Debates, formal and informal interviews with non-native speakers (Dutch). Audio tape and transcriptions on diskettes.