Unsupervised Learning of Morphology: Survey, Model, Algorithm and Experiments
Licentiate thesis, 2007
This thesis contains work on a specific problem in field of Language
Technology. The problem can be described as follows:
"Can a computer extract a description of word conjugation in a natural
language using only written text in the language?"
The problem is often referred to as Unsupervised Learning of
Morphology (ULM) and has a wide variety of Language Technology
applications, including Machine Translation, Document Categorization
and I nformation Retrieval. The ULM problem is also relevant for
linguistic theory, and can serve to boost empirical investigations in
subfields as Quantitative Linguistics and Linguistic Typology,
The first part of the thesis contains a comprehensive survey of work
done on the ULM problem. All the minor and major lines of work are
mentioned with a reference and a very brief characterization.
Different approaches that have been prevalent in the field as a whole
are highlighted and critically discussed. The general picture
resulting from the survey is that much work has been repeated over and
over, with little exchange and evolution of techniques.
The second part of the thesis describes a simple model of
concatenative affixation, i.e., how stems and affixes are stringed
together to form words. The model says that words consist of
high-frequency strings (``affixes'') attached to low-frequency strings
(``stems''), e.g., as in the English play-ing. Then it is
shown that from a set words constructed according to the model, the
affixes can be extracted with their correct segmentation. The
algorithm for extraction is impressionistically evaluated on a diverse
set of natural languages.
The affix extraction algorithm does not output a full-fledged description of
conjugational patterns -- it only produces a list of affixes. The third and
fourth parts of the thesis show how it can be used in further morphological
analysis.
In the third part, an algorithm is presented that decides
if two given words are conjugations of the same stem. The key part is
the development of a metric for quantifying which endings tend to attach
to the same set of stems. The algorithm has no parameters or human input
and works equally well for languages with widely different morphological
typology. It achieves almost perfect accuracy on word pairs selected from
running text.
In the fourth part, the affix extraction model is exploited for the
written language identification problem, i.e., to decide which natural
language a given text is written in. Existing state-of-the-art
techniques to identify the language of a written text most often use a
3-gram frequency table as basis for 'fingerprinting' a language. While
this approach performs very well in practice (99\%-ish accuracy) if
the text to be classified is of size, say, 100 characters or more, it
cannot be reliably used to classify even shorter input, nor can it
detect if the input is a concatenation of text from several languages.
Therefore a more fine-grained model is presented which aims at reliable
classification of input as short as one word. In essence, the
language of an unseen word is guessed based on any salient affixes
that appear on it. Many practical applications do not need this fine
level of granularity, but Multilingual Information Retrieval is a
major target area where input is usually only one or a few words.
The algorithm is given a rigorous evaluation on a 32-language parallel
bible corpus showing competitive accuracy on short input as well as
multi-lingual input, and not only for a set of European languages with
similar morphological typology.