A naive theory of affixation and an algorithm for extraction
Paper i proceeding, 2006

We present a novel approach to the unsupervised detection of affixes, that is, to extract a set of salient prefixes and suffixes from an unlabeled corpus of a language. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, i.e occur much more often that random segments of the same length, and that 2. words essentially are variable length sequences of random characters, e.g a character should not occur in far too many words than random without a reason, such as being part of a very frequent affix. The affix extraction algorithm uses only information from fluctation of frequencies, runs in linear time, and is free from thresholds and untransparent iterations. We demonstrate the usefulness of the approach with example case studies on typologically distant languages.

Författare

Harald Hammarström

Chalmers, Data- och informationsteknik

HLT-NAACL 2006 - SIGPHON 2006: 8th Meeting of the ACL Special Interest Group on Computational Phonology, Proceedings of the Workshop

79-88

8th Meeting of the ACL Special Interest Group on Computational Phonology, SIGPHON 2006, collocated with the HLT-NAACL 2006
New York City, USA,

Ämneskategorier

Språkteknologi (språkvetenskaplig databehandling)

Jämförande språkvetenskap och allmän lingvistik

Studier av enskilda språk

Mer information

Senast uppdaterat

2021-12-09