Learning Domain-Specific Grammars from a Small Number of Examples
Paper in proceeding, 2021

In this chapter we investigate the problem of grammar learning from a perspective that diverges from previous approaches. These prevailing approaches to learning grammars usually attempt to infer a grammar directly from example corpora without any additional information. This either requires a large training set or suffers from bad accuracy. We instead view learning grammars as a problem of grammar restriction or subgrammar extraction. We start from a large-scale grammar (called a resource grammar) and a small number of example sentences, and find a subgrammar that still covers all the examples. To accomplish this, we formulate the problem as a constraint satisfaction problem, and use a constraint solver to find the optimal grammar. We created experiments with English, Finnish, German, Swedish, and Spanish, which show that 10–20 examples are often sufficient to learn an interesting grammar for a specific application. We also present two extensions to this basic method: we include negative examples and allow rules to be merged. The resulting grammars can more precisely cover specific linguistic phenomena. Our method, together with the extensions, can be used to provide a grammar learning system for specific applications. This system is easy-to-use, human-centric, and can be used by non-syntacticians. Based on this grammar learning method, we can build applications for computer-assisted language learning and interlingual communication, which rely heavily on the knowledge of language and domain experts who often lack the competence to develop required grammars themselves.

Constraint satisfaction

Grammar learning

Grammar restriction

Domain-specific grammar

Author

Herbert Lange

University of Gothenburg

Peter Ljunglöf

University of Gothenburg

Studies in Computational Intelligence

1860-949X (ISSN) 1860-9503 (eISSN)

Vol. 939 105-138
9783030637866 (ISBN)

Natural Language Processing in Artificial Intelligence, NLPinAI 2020 held within the 12th International Conference on Agents and Artificial Intelligence, ICAART 2020
Valletta, Malta,

Subject Categories (SSIF 2011)

Other Computer and Information Science

Language Technology (Computational Linguistics)

Computer Science

DOI

10.1007/978-3-030-63787-3_4

More information

Latest update

1/12/2022