March 23, 2005

[LDC} new corpora

In this month's newsletter, the LDC would like to announce the availability of a new LDC Online service and the release of three new corpora.

------------------------------------------------------------------------

The LDC is pleased to announce that an improved LDC Online service is now available. LDC Online can be accessed at the following url:

https://online.ldc.upenn.edu/login.html

Organizations that hold 2005 Membership in the LDC will be able to perform text searches on our entire English Gigaword corpus. This corpus is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. Current members will also be able to access the American English Spoken Lexicon (AESL). AESL contains pronunciations in individual audio files for more than 50,000 of the most common words in English

Even if your organization is not a current member, you can access LDC Online through a guest account. As a guest, an LDC online user will be able to access the American English Spoken Lexicon.

We will offer periodic updates to LDC Online to include new corpora and search functions. Please check in with us often as we anticipate this will be an exciting offering.

------------------------------------------------------------------------

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form.
Sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.ACE 2004 Multilingual Training Corpus is distributed on one CD-ROM.
2005 Subscription Members will automatically receive two copies of this corpus. 2005 Standard Members may request a copy as part of their 16 free membership corpora. Nonmembers may license this data for US$3000.

*

Chinese News Translation Text Part 1 supports the development of automatic machine translation systems, the LDC was sponsored to solicit English translations for a single set of Chinese source materials.

The source Chinese text and its English translations were selected and translated in different LDC projects. A total of about 474K Chinese characters were selected from two sources, namely Xinhua and AFP, and translation services were provided by seven translation agencies. Each Chinese news story was translated once. Chinese News Translation Text Part 1 is distributed via ftp.

2005 Subscription Members will automatically receive two copies of this corpus on CD-ROM. 2005 Standard Members may request a copy as part of their 16 free membership corpora. Nonmembers may license this data for US$1500.

*

Discourse Treebank aims to define a descriptively adequate data structure for representing discourse coherence structures.. This project also investigates the impact of discourse coherence structures on other linguistic processes and natural language applications (e.g. anaphor resolution,summarization, information retrieval), to develop and test discourse parsing algorithms. The data consists of 135 texts from AP Newswire and Wall Street Journal, annotated with coherence relations. The source for data is TIPSTER Complete (LDC93T3A). Discourse Graphbank is distributed via ftp

2005 Subscription Members will automatically receive two copies of this corpus on CD-ROM. 2005 Standard Members may request a copy as part of their 16 free membership corpora. Nonmembers may license this data for US$200.

------------------------------------------------------------------------

If you need further information, or would like to inquire about membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573 2175.


Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu

Posted by hwa at 03:04 PM

March 18, 2005

[talk] Dan Gildea

Dr. Dan Gildea from University of Rochester will be giving a talk on his recent work on March 18th, Noon.

Syntactic Structure and Statistical Machine Translation

Given that statistical methods have revolutionized both
natural language parsing and machine translation, it may
seem surprising that most current statistically-based
translation systems make no use of syntactic structure.
I will describe work on models of translation that aim
to fill this gap, presenting results for models that
make use of syntactic information provided for one or
both languages, as well as models that infer structure
directly from parallel bilingual text. I will also
describe the use of syntactic information for the
automatic evaluation of machine-produced translations.


Please sign up for a slot to meet with Dan

9:45 -- 10:00 Rebecca (SENSQ 5421)
10:00 -- 10:30 Behrang (SENSQ 5503)
10:30 -- 11:00 Paul, Swapna, and Jason (SENSQ 5422)
11:00 -- 11:15 Rebecca Part Deux (SENSQ 5421)
11:15 -- 11:45 Daqing and Hua (Cheng) (SENSQ 5111)
11:45 -- 12:00 Talk prep
12:00 -- 1:15 Talk (SENSQ 5317)
1:15 -- 2:45 Lunch (with Rebecca, Lillian, Oren, Bo, Diane?, Mihai)
2:45 -- 3:15 Diane (SENSQ 5105)
3:15 -- 3:45 Amruta and Hua (Ai) (SENSQ 5108)
3:45 -- 4:15 Theresa (SENSQ 5422)
4:15 -- 4:45 Mihai and Beatriz (SENSQ 5420)

Dinner at 6pm (with Jan, Joel, Rebecca)

Posted by hwa at 10:30 AM

March 13, 2005

[NEWS] Hot Article

Hirschberg, Litman and Swerts 2004 is currently the hottest article in Speech Communication.

Posted by nlplab at 09:22 PM