Anyone interested in obtaining any of these corpora, please leave a comment.
**
** Introducing: The LDC Institute **
** Membership Year 2004 in Review **
LDC2004S04
** 2002 NIST Speaker Recognition Evaluation (SRE) *
*LDC2004T11
** Arabic Treebank: Part 3 v.1.0 * *
LDC2004S05
** ISL Meeting Corpus Speech Part 1 ***
*LDC2004T10
** ISL Meeting Corpus Transcripts Part 1 *
***
In this month's update, the Linguistic Data Consortium (LDC) would like to introduce the LDC Institute, review Membership Year 2004, and announce the availability of four new corpora.
*
(1) For the past two years, the LDC has hosted the LDC Institute, a seminar series on issues in language data and database creation. The goals of the series are to create a forum to communicate experience in data collection, standards, and annotation, and to work with researchers and others who may be interested in LDC data or who may wish to contribute new data to the archives. Past presentations topics have ranged from information extraction from biomedical texts to the Pennsylvania Sumerian Dictionary project to interfaces for parser and dictionary access.
We would like to invite the LDC community to learn more about this seminar series by visiting the LDC Institute
*
(2) Each year the LDC strives to provide a rich and diverse array of corpora for LDC members and nonmembers. Membership Year 2004 is shaping up to be no different! In the last few months, we have released 9 publications including treebanks in Arabic and Chinese, English meeting data, and Czech broadcast news. Namely, these corpora are:
LDC2004T02 Arabic Treebank: Part 2 v 2.0
LDC2004T05 Chinese Treebank Version 4.0
LDC2004S01 Czech Broadcast News Speech
LDC2004T01 Czech Broadcast News Transcripts
LDC2004S02 ICSI Meeting Speech
LDC2004T04 ICSI Meeting Transcripts
LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
LDC2004T03 Morphologically Annotated Korean Text
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
For further information on each of the above, please visit:
http://www.ldc.upenn.edu/Catalog/ByYear.jsp#2004
*
(3) The 2002 NIST Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition.
The 2002 NIST Speaker Recognition Evaluation main data was extracted from the Switchboard Cellular part 2. The extended data task used two phases of Switchboard II, phases 2 and 3. This evaluation also included the first multi-modal task, using data from the FBI voice database. There are a total of 9153 speech files in sphere format, for a total of ~156 hours. 2002 NIST Speaker Recognition Evaluation is distributed on 2 DVD.
For further information, including a link to the 2002 NIST Speaker Recognition Evaluation website, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$1000.
*
(4) Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. This corpus includes 600 stories from the An Nahar News Agency. There are a total of 340,281 words (counting non-Arabic tokens such as numbers and punctuation) in the 600 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.
The corpus contains 293,035 Arabic-only word tokens (prior to the separation of clitics), of which 290,842 (99.25%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 2,193 (0.75%) were items that the morphological parser failed to analyze correctly. Arabic Treebank: Part 3 v 1.0 is distributed on 1 CD.
For further information, including online documentation, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$3000.
*
(5) ISL Meeting Speech Part 1 is the first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from 8 to 64 minutes and averages at 34 minutes. Word-level orthographic transcriptions are available as ISL Meeting Transcripts Part 1
ISL Meeting Speech Part 1 includes 105 speech files, for a total of approximately 10 hours of meeting speech. There are a total of 31 unique speakers in the corpus. Meetings involved anywhere from 3 to 9 participants, averaging at 5. The corpus contains a significant proportion of non-native English speakers, varying in fluency. ISL Meeting Speech Part 1 is distributed on 2 DVD.
For further information, including a link to the ISL Meeting Room project page, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$1500.
*
(6) The ISL Meeting Transcripts Part 1 is the corresponding transcription for ISL Meeting Speech Part 1
Transcriptions were prepared by means of the TransEdit transcription application. This application was developed for the transcription of multi-channel recordings and displays a synchronized multi-track view for all channels of a meeting with listening and segmentation function for each single channel separately. ISL Meeting Transcripts Part 1 is distributed by ftp transfer.
For further information, including a sample transcript, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$500.
*
If you need additional information or would like to inquire about membership in the LDC, please send email to
----------------------------------------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 email: ldc@ldc.upenn.edu
Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu
Beatriz Maeireizo Tokeshi
On May 26th 2004, Beatriz will give a small talk about the poster submitted to ACL 2004, as a result of the research done IN the CS PhD course 2002 (Research Experience in CS).
ABSTRACT
Natural Language Processing applica-tions often require large amounts of an-notated training data, which are expensive to obtain. In this paper we in-vestigate the applicability of Co-training to train classifiers that predict emotions in spoken dialogues. In order to do so, we have first applied the wrapper ap-proach with Forward Selection and Naïve Bayes, to reduce the dimensionality of our feature set. Our results show that Co-training can be highly effective when a good set of features are chosen.
The NLP Group will continue its weekly meetings throughout the summer. This week, we will meet to set up the talk schedule for the rest of the term.
Mark Core
University of Edinburgh
Monday May 10 , 10:00
731 LRDC
This work is the first systematic investigation of initiative in
human-human tutorial dialogue. We studied initiative management in two
dialogue strategies: didactic tutoring and Socratic tutoring. We
hypothesized that didactic tutoring would be mostly tutor-initiative while
Socratic tutoring would be mixed-initiative, and that more student
initiative would lead to more learning (i.e., task success for the
tutor). Surprisingly, students had initiative more of the time in the
didactic dialogues (21% of the turns) than in the Socratic dialogues (10%
of the turns), and there was no direct relationship between student
initiative and learning. However, Socratic dialogues were more interactive
than didactic dialogues as measured by percentage of tutor utterances that
were questions and percentage of words in the dialogue uttered by the
student, and interactivity had a positive correlation with learning.
(The above is his EACL 2003 talk. Since that was a short talk,
if time permits he might also present some research that he is
presenting at HLT-NAACL...
Robustness versus Fidelity in Natural Language Understanding
A number of issues arise when trying to scale-up natural language
understanding (NLU) tools designed for relatively simple domains (e.g.,
flight information) to domains such as medical advising or tutoring where
deep understanding of user utterances is necessary. Because the subject
matter is richer, the range of vocabulary and grammatical structures is
larger meaning NLU tools are more likely to encounter out-of-vocabulary
words or extra-grammatical utterances. This is especially true in medical
advising and tutoring where users may not know the correct vocabulary and
use common sense terms or descriptions instead. Techniques designed to
improve robustness (e.g., skipping unknown words, relaxing grammatical
constraints, mapping unknown words to known words) are effective at
increasing the number of utterances for which an NLU sub-system can produce
a semantic interpretation. However, such techniques introduce additional
ambiguity and can lead to a loss of fidelity (i.e., a mismatch between the
semantic interpretation and what the language producer meant). To control
this trade-off, we propose semantic interpretation confidence scores akin
to speech recognition confidence scores, and describe our initial attempt
to compute such a score in a modularized NLU sub-system.)
----
Short bio:
Mark received his Ph.D. from the University of Rochester under the supervision
of Len Schubert. The subject of his dissertation was dialog parsing; his
dialog parser identified speech repairs as well as the dialogue acts of
utterances. Starting in 2000, Mark has been a researcher at the University
of Edinburgh, working with Johanna Moore on the BEETLE tutorial dialogue
system. He built a natural language understanding module for BEETLE using
the CARMEL workbench, adding features such as unknown word handling and
semantic-confidence-score calculation. The second area of his research is
dialogue annotation and analysis, looking at phenomena such as initiative,
and dialogue acts and games.
Joel Tetrault
University of Rochester
Friday May 7, 1:30
731 LRDC
In a spoken dialog system, the job of a reference resolution module is to
identify noun phrases and resolve them to entities evoked in the dialogue.
This involves finding antecedents for pronouns such as "that" or "they" and
resolving definite noun phrases such as "the two hospitals" or "the ambulance
here." Though reference is just one part of the overall interpretation of
a sentence, it is a very important piece because failure to resolve the
entities in a sentence correctly can lead to an incorrect interpretation
of a sentence and thus an erroneous response to the user.
Many approaches to reference resolution, specifically pronoun resolution,
have relied heavily on syntactic and surface features. While these
methods are able to perform very well, such as resolving as much as 80% of
the pronouns in a large corpus correctly, the "20% gap" has been hard
to overcome because these pronoun require additional information on top of
syntactic features for resolution. In this talk I present work that
incorporates discourse structure and semantic features into a pronoun
resolution algorithm to improve performance over two types of corpora: a
newspaper domain (Penn Treebank) and human-human spoken dialogue.
Short Bio:
Joel Tetreault is in his final year of his PhD in Computer Science at the
University of Rochester. He received his bachelor's degree from Harvard
University in 1998 and Master's from Rochester in 2000. His main
interest is Natural Language Processing. He has done work in reference
resolution, discourse processing, spoken dialogue systems, and information
retrieval techniques for detecting affect.