This year's Speech Communications Best Paper Award is for paper:
Julia Hirschberg, Diane Litman, Marc Swerts, "Prosodic and Other Cues
to Speech Recognition Failures", Speech Communication, 43(1-2):155-176, 2004.
Congratulations to Mihai on passing his comprehensive exam today!
His reading lists, writeups, and presentation can be found
online.
In this month's newsletter, the LDC would like to announce the availability of a new LDC Online service and the release of three new corpora.
------------------------------------------------------------------------
The LDC is pleased to announce that an improved LDC Online service is now available. LDC Online can be accessed at the following url:
https://online.ldc.upenn.edu/login.html
Organizations that hold 2005 Membership in the LDC will be able to perform text searches on our entire English Gigaword corpus. This corpus is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. Current members will also be able to access the American English Spoken Lexicon (AESL). AESL contains pronunciations in individual audio files for more than 50,000 of the most common words in English
Even if your organization is not a current member, you can access LDC Online through a guest account. As a guest, an LDC online user will be able to access the American English Spoken Lexicon.
We will offer periodic updates to LDC Online to include new corpora and search functions. Please check in with us often as we anticipate this will be an exciting offering.
------------------------------------------------------------------------
ACE 2004 Multilingual Training Corpus
Sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.ACE 2004 Multilingual Training Corpus is distributed on one CD-ROM.
2005 Subscription Members will automatically receive two copies of this corpus. 2005 Standard Members may request a copy as part of their 16 free membership corpora. Nonmembers may license this data for US$3000.
*
Chinese News Translation Text Part 1
The source Chinese text and its English translations were selected and translated in different LDC projects. A total of about 474K Chinese characters were selected from two sources, namely Xinhua and AFP, and translation services were provided by seven translation agencies. Each Chinese news story was translated once. Chinese News Translation Text Part 1 is distributed via ftp.
2005 Subscription Members will automatically receive two copies of this corpus on CD-ROM. 2005 Standard Members may request a copy as part of their 16 free membership corpora. Nonmembers may license this data for US$1500.
*
Discourse Treebank
2005 Subscription Members will automatically receive two copies of this corpus on CD-ROM. 2005 Standard Members may request a copy as part of their 16 free membership corpora. Nonmembers may license this data for US$200.
------------------------------------------------------------------------
If you need further information, or would like to inquire about membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573 2175.
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
Hirschberg, Litman and Swerts 2004 is currently the hottest article in Speech Communication.
At CS Day on February 18, 2005, Carol Nichols won the best undergraduate award, and Mihai Rotaru won the graduate research competition award. Great job to both of you!
Carol received an Honorable Mention for CRA's Outstanding Undergraduate Award for 2005! Great job, Carol!
We now have a conference page which has basic information about important dates of the conference. The page is at: http://nlp.cs.pitt.edu/dates.htm
I have also marked the blog for two of the conference deadlines (ACL & AAAI). Please feel free to add information to the conference page and also the blog as new information about conferences arrives.
Meeting time for Fall term has changed to be Mondays at 12:30 -- 2pm.
Next meeting will be on Sept. 13th.
We will have an organizational meeting Wednesday, September 1, 12:15, Room 6329.
I have created an internal page which will host presentations and related materials for our weekly meetings. The page address is:
http://nlp.cs.pitt.edu/presentations/
We now have a repository for NLP related conference papers.
Please note that due to copy right restrictions, this page is only accessible within the relevant subdomains of pitt.edu (cs, isp, lrdc, etc.).
If you're not able to access the page from your domain (inside Pitt), please contact Behrang.
See the "Teaching Computers to Teach Like Humans" article in the June 7, 2004 Pitt Chronicle!
http://www.discover.pitt.edu/media/pcc/comps_like_humans.html
Also visit Pitt's website (www.pitt.edu) to see a press release from June 3 2004 about our research.
The text of the release is also below.
FOR IMMEDIATE RELEASE
June 3, 2004
Contact: Patricia Lomando White
412-624-9101
laer@pitt.edu
Pitt Researchers Developing Computers That Teach Like Humans
Natural language recognition key to improved tutoring by machines
PITTSBURGH—While new federal education rules emphasizing testing and standards have fueled a tutoring boom, relatively few pupils enjoy access to effective but costly one-on-one teaching. In an effort to spread the intellectual wealth, scientists at the University of Pittsburgh’s Learning Research and Development Center (LRDC) are working to bring individual instruction to all students.
With $2.5 million from the National Science Foundation (NSF), principal investigator (PI) Kurt VanLehn, a Pitt computer science professor and LRDC senior scientist, is working to build less expensive computer tutors as good as their more expensive human counterparts. Looking specifically at the best ways to teach and learn physics, VanLehn and his colleagues are probing both tutor and student behavior.
“The computer tutors available in stores today just tell you if your answer is right or wrong,” VanLehn said. “With a human tutor, though, students can do much more,” including discussing their reading with the tutor and getting help solving longer, more complex problems.
A major difference between human and computer tutors has been that only human tutors understand unconstrained natural language—the conversational, open-ended give-and-take that can often flummox the smartest software.
Today, commercial educational technology involves two response formats: multiple choice and mathematical formulas. If all goes as planned, a tutoring program should be on the market in five to 10 years that can handle open-ended questions and analyze the students’ text or speech responses.
The LRDC team’s basic approach to improving computer tutoring is to simply study and learn from interactions between humans and computer tutors. As more effective dialogue strategies are identified, they will be incorporated into a natural language-based tutoring system.
LRDC’s new tutoring venture builds on a recently completed five-year, $5 million NSF-funded Center for Interdisciplinary Research on Constructive Learning Environments, led by VanLehn. The center developed several prototypes of natural language tutoring systems both at LRDC and at Carnegie Mellon University. The center also developed tools for building more such tutors.
Capitalizing on LRDC’s ability to attract and link researchers from a wide variety of disciplines, the computer tutor study includes researchers specializing in the cognitive psychology of human tutoring, the technology of natural language processing, and the design of effective tutoring systems.
The Co-PIs are Diane J. Litman, a Pitt computer science professor and LRDC research scientist; Michelene Chi, a Pitt psychology professor and LRDC senior scientist; Pamela W. Jordan, a LRDC research associate; and Carolyn P. Rose, a research scientist at Carnegie Mellon.
The group’s grant is administered under NSF’s Information Technology Research program, which supports innovative multidisciplinary research that extends the frontiers of information technology, leads to new and unanticipated technologies, creates revolutionary applications, or provides alternative approaches to complete important activities.
###
6/3/04/tmw
Anyone interested in obtaining any of these corpora, please leave a comment.
**
** Introducing: The LDC Institute **
** Membership Year 2004 in Review **
LDC2004S04
** 2002 NIST Speaker Recognition Evaluation (SRE) *
*LDC2004T11
** Arabic Treebank: Part 3 v.1.0 * *
LDC2004S05
** ISL Meeting Corpus Speech Part 1 ***
*LDC2004T10
** ISL Meeting Corpus Transcripts Part 1 *
***
In this month's update, the Linguistic Data Consortium (LDC) would like to introduce the LDC Institute, review Membership Year 2004, and announce the availability of four new corpora.
*
(1) For the past two years, the LDC has hosted the LDC Institute, a seminar series on issues in language data and database creation. The goals of the series are to create a forum to communicate experience in data collection, standards, and annotation, and to work with researchers and others who may be interested in LDC data or who may wish to contribute new data to the archives. Past presentations topics have ranged from information extraction from biomedical texts to the Pennsylvania Sumerian Dictionary project to interfaces for parser and dictionary access.
We would like to invite the LDC community to learn more about this seminar series by visiting the LDC Institute
*
(2) Each year the LDC strives to provide a rich and diverse array of corpora for LDC members and nonmembers. Membership Year 2004 is shaping up to be no different! In the last few months, we have released 9 publications including treebanks in Arabic and Chinese, English meeting data, and Czech broadcast news. Namely, these corpora are:
LDC2004T02 Arabic Treebank: Part 2 v 2.0
LDC2004T05 Chinese Treebank Version 4.0
LDC2004S01 Czech Broadcast News Speech
LDC2004T01 Czech Broadcast News Transcripts
LDC2004S02 ICSI Meeting Speech
LDC2004T04 ICSI Meeting Transcripts
LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
LDC2004T03 Morphologically Annotated Korean Text
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
For further information on each of the above, please visit:
http://www.ldc.upenn.edu/Catalog/ByYear.jsp#2004
*
(3) The 2002 NIST Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition.
The 2002 NIST Speaker Recognition Evaluation main data was extracted from the Switchboard Cellular part 2. The extended data task used two phases of Switchboard II, phases 2 and 3. This evaluation also included the first multi-modal task, using data from the FBI voice database. There are a total of 9153 speech files in sphere format, for a total of ~156 hours. 2002 NIST Speaker Recognition Evaluation is distributed on 2 DVD.
For further information, including a link to the 2002 NIST Speaker Recognition Evaluation website, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$1000.
*
(4) Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. This corpus includes 600 stories from the An Nahar News Agency. There are a total of 340,281 words (counting non-Arabic tokens such as numbers and punctuation) in the 600 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.
The corpus contains 293,035 Arabic-only word tokens (prior to the separation of clitics), of which 290,842 (99.25%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 2,193 (0.75%) were items that the morphological parser failed to analyze correctly. Arabic Treebank: Part 3 v 1.0 is distributed on 1 CD.
For further information, including online documentation, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$3000.
*
(5) ISL Meeting Speech Part 1 is the first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from 8 to 64 minutes and averages at 34 minutes. Word-level orthographic transcriptions are available as ISL Meeting Transcripts Part 1
ISL Meeting Speech Part 1 includes 105 speech files, for a total of approximately 10 hours of meeting speech. There are a total of 31 unique speakers in the corpus. Meetings involved anywhere from 3 to 9 participants, averaging at 5. The corpus contains a significant proportion of non-native English speakers, varying in fluency. ISL Meeting Speech Part 1 is distributed on 2 DVD.
For further information, including a link to the ISL Meeting Room project page, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$1500.
*
(6) The ISL Meeting Transcripts Part 1 is the corresponding transcription for ISL Meeting Speech Part 1
Transcriptions were prepared by means of the TransEdit transcription application. This application was developed for the transcription of multi-channel recordings and displays a synchronized multi-track view for all channels of a meeting with listening and segmentation function for each single channel separately. ISL Meeting Transcripts Part 1 is distributed by ftp transfer.
For further information, including a sample transcript, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10
Institutions that have membership in the LDC for the 2004 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this data for US$500.
*
If you need additional information or would like to inquire about membership in the LDC, please send email to
----------------------------------------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 email: ldc@ldc.upenn.edu
Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu
Postdoctoral Research Associate Position in Spoken Dialogue / Intelligent Tutoring Systems
The Natural Language Processing (NLP) group at the University of
Pittsburgh has several GSR positions to fill, beginning Summer or Fall
2004.
Interested graduate students in Computer Science or Intelligent
Systems are invited to peruse our webpages (nlp.cs.pitt.edu), and to
apply directly to one or more of the following NLP faculty members,
each of whom is hiring:
Professor Rebecca Hwa (hwa@cs.pitt.edu), for the project
"Semi-supervised Learning for Multilingual Processing"
(www.cs.pitt.edu/~hwa/semi.htm)
Professor Diane Litman (litman@cs.pitt.edu), for the projects
"Monitoring Student State in Tutorial Spoken Dialogue" and
"Adding Spoken Language to a Text-Based Dialogue Tutor"
(www.cs.pitt.edu/~litman/itspoke.html)
Professor Janyce Wiebe (wiebe@cs.pitt.edu), for the projects
"Improving Subjectivity Analysis to Achieve High-Precision Information
" Extraction" and "Opinions in Automatic Question Answering"
(www.cs.pitt.edu/~wiebe/projects.html)
To apply, send a statement of interest and your vita. Please send a separate
application to each faculty member whose project(s) you are interested in.
For full consideration, applications should be received no later than
March 1, 2004.
I received announcements for two new LCD corpora (info below). If you would like the lab to get either one (or both), please post a comment to this message.