The base directory for the resources listed below:
   /afs/cs.pitt.edu/projects/nlp/
Access is restricted to within Pitt's CS dept, the LRDC, and ISP. See the README's for more details. Some software packages have not been tested and some scripts may need modifications (pathnames etc.).

Corpora

  • ICSI Meeting Speech and ICSI Meeting Transcripts
    • The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002.
    • Transcript is located at the icsi_mr_transcr/ subdirectory. The speech data has not been uploaded yet.
  • Arabic Treebank: Part 2 v 2.0
    • This publication is the second part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. (We don't have part 1).
    • Data not uploaded yet.
  • Penn Treebank v3
    • Parsed English sentences from 1989 WSJ stories, Brown (a balanced corpus), ATIS (air travel information ), and Switchboard (phone conversation transcripts)
  • CBC and Remedia Reading Comprehension Stories, Questions, and Answers
    • Two Reading Comprehension Corpora Used for Question-Answering
    • See Diane

Software

  • Constituent tree to Dependency tree Converter
    • A program to turn PTB style constituent trees into dependency trees by Rebecca Hwa.
    • located at the engc2dep/ subdirectory
  • EVALB
    • A widely-used program for scoring parse trees. Written by Satoshi Sekine
    • located at the evalb/ subdirectory
  • GIZA++ v2
    • IBM style machine translation model
    • located at the GIZA++_v2/ subdirectory (use run_giza.pl)
  • Dependency Grammatical Relationship Labeler
    • A set of scripts by Adam Lopez that labels the grammatical relationship between a head word and its modifier
    • located at the label_deprel/ subdirectory
  • Maximum Entropy Tool Kit
    • A maximum entropy learning package in Java; from the OpenNLP Project (see sourceforge.net) by Jason Baldridge
    • in the maxent/ subdir.
  • Maximum Entropy Part-of-speech Tagger and Sentence Boundary Detector
    • Adwait Ratnaparkhi's max ent tools for sentence segmenting (mxterminator) and for part of speech tagging (mxpost). Both in Java.
    • in the maxpost/ subdir.
  • Collins Parser for English
    • Collins parser trained on 02-21 of WSJ from PennTreebank
    • in the ptb_parser/ subdir -- use run_mcparser.pl script
  • Treebank Tool
    • A program for processing treebanks -- it outputs the treebank in various formats for training different parsers. By Rebecca Hwa
    • In proctreebank/
  • English Tokenizer
    • For tokenizing English in the style of Penn Treebank (can't --> ca n't etc.). It's a combination of a rule-based perl script by various people and a Java program trained on PTB by Tom Morton.
    • in the tokenization/ subdir (run uber-engtok.sh)
  • Festival
    • An open source speech synthesizer from CMU/Edinburgh
    • in the itspoke directories
  • Sphinx2
    • An open source speech recognizer from CMU.
    • in the itspoke directories
  • ESPS/waves
    • a suite of both command line and gui programs for analysis of audio data
    • in the itspoke directories
  • Wavesurfer
    • A standalone executable which allows your to record and play audio files, and do some analysis of the data (but not as full blown as either ESPS/waves or Praat).
    • in the itspoke directories
  • Weka
    • A machine learning toolkit (data mining softwware in java)
    • in the itspoke directories
  • Ripper
    • A rule learner
    • in the itspoke directories
  • NLTK
    • The Natural Language Toolkit is a suite of Python libraries and programs for symbolic and statistical natural language processing
    • in the javalab directories