Multilingual Grammars and Universal Dependencies
Licentiate thesis, 2016
Abstract syntax trees are an alternative representation to syntactic structures commonly found in NLP systems. This representation allows for sharing of structures across languages, making it well suited to serve as a translation interlingua. Grammatical Framework is a grammar formalism that captures cross-linguistic generalizations through the use of abstract syntax. The Resource Grammar Library (GF-RGL) in GF implements multilingual grammars for over 30 languages.
Universal Dependencies (UDs) is a parallel effort to use shared structures to analyse sentences in different languages. The set of part-of-speech tags and functions are shared across languages. The linguistic data available from this project is annotated data i.e. sentences annotated with UD structures in over 40 languages.
The main contribution of this thesis is to bridge these two representations: despite the similar motivation behind these two efforts, the representations used vary significantly. Hence, we propose a conversion method to convert the abstract syntax trees in GF to the structures used in UD.We find that the correspondence between GF-RGL and UD is significant, and the differences between the two raise interesting questions about the level of abstraction. We also present practical applications to our method: (1) the use of GF parser as a dependency parser and (2) to bootstrap UD treebanks from GF treebanks.
Another topic addressed in this thesis is the problem of out-of-vocabulary words that comes up in symbolic systems. We address this problem in the context of part-of-speech tagging and statistical dependency parsing. We propose a simple method to use a distributional thesaurus to replace unknown words and show through empirical evaluation that our method improves both overall accuracies and accuracies for unknown words. Our method is generic and can be adapted to fit other NLP systems.