By Rebecca Hwa and Carol Nichols
ABSTRACT:
Parsing is an important component in many NLP systems. While
recent advances in statistical methods and machine learning have made
it possible to build highly accurate parsers, the success depends on
the quantity and quality of annotated training data, which may not
always be available. Arabic is an interesting case because it is
diglossic (i.e., the language exists in two forms: a "prestigious"
variety for formal writings (Modern Standard Arabic) and colloquial
varieties that are primarily spoken and are not standardized (Arabic
dialects)). There is much on-going NLP work in building resources for
MSA, but resources and NLP research for Arabic dialect are still at an
infancy stage. Because there are no parallel written corpora between
any of the dialects and any other language, including MSA, most of the
techniques developed for parsing that exploit supervised machine
learning do not apply.
In this talk, we describe our framework for leveraging existing
resources and tools for MSA in order to parse Arabic dialects. In
particular, we focus on building a bilexicon between MSA and the
Levantine dialect and building a Levantine part-of-speech tagger by
adapting from a MSA tagger. We will also present some preliminary
findings in building a Levantine parser from these resources.
This work was conducted as a part of the Parsing Arabic Dialect team
at the 2005 JHU Summer Workshop on Language Engineering.