Computational learning of construction grammars

JONATHAN DUNN

doi:10.1017/LANGCOG.2016.7

Outline

Computational Learning of Construction Grammars

Jonathan Dunn

2017, Language & Cognition

https://doi.org/10.1017/LANGCOG.2016.7

visibility

…

description

34 pages

link

1 file

Abstract

This paper presents an algorithm for learning the construction grammar of a language from a large corpus. This grammar induction algorithm has two goals: First, to show that construction grammars are learnable without highly specified innate structure; Second, to develop a model of which units do or do not constitute constructions in a given dataset. The basic task of construction grammar induction is to identify the minimum set of constructions that represents the language in question with maximum descriptive adequacy. These constructions must (1) generalize across an unspecified number of units while (2) containing mixed levels of representation internally (e.g., both item-specific and schematized representations) and (3) allowing for unfilled and partially filled slots. Additionally, these constructions may (4) contain recursive structure within a given slot that needs to be reduced in order to produce a sufficiently schematic representation. In other words, these constructions are multi-length, multi-level, possibly discontinuous co-occurrences which generalize across internal recursive structures. These co-occurrences are modeled using frequency and the ΔP measure of association, expanded in novel ways to cover multi-unit sequences. This work provides important new evidence for the learnability of construction grammars as well as a tool for the automated corpus analysis of constructions.

Figures (8)

A lower-case grammar is the representation of a specific language while an upper-case
Grammar is the ability to learn such a grammar from linguistic input alone with minimal innate
structure. Thus, language-specific construction grammars (e.g., analyses in Fillmore, 1988, and Kay
& Fillmore, 1999) can be seen as part of a more general Construction Grammar (e.g., Goldberg,
2006; Langacker, 2008). This differs from Chomsky’s various divisions of competence/performanc:
and universal/specific grammar (1965, 1975), however, in that the Grammar does not consist of
pre-defined structures/rules/constraints but rather of mechanisms for deriving or learning such
structures/rules/constraints from observed language data. This data-driven view can be visualized
as in Figure 1, where the Grammar is a link between language observations and generalized
language representations (grammars). — A lower-case grammar is the representation of a specific language while an upper-case Grammar is the ability to learn such a grammar from linguistic input alone with minimal innate structure. Thus, language-specific construction grammars (e.g., analyses in Fillmore, 1988, and Kay & Fillmore, 1999) can be seen as part of a more general Construction Grammar (e.g., Goldberg, 2006; Langacker, 2008). This differs from Chomsky’s various divisions of competence/performanc: and universal/specific grammar (1965, 1975), however, in that the Grammar does not consist of pre-defined structures/rules/constraints but rather of mechanisms for deriving or learning such structures/rules/constraints from observed language data. This data-driven view can be visualized as in Figure 1, where the Grammar is a link between language observations and generalized language representations (grammars).

Alternate methods for calculating multi-unit association strength include Wei & Li (2013),
who start with da Silva & Lopes’ (1999) notion of pseudo-bigrams, in which all sequences longer
than two units are reduced to all possible pairwise combinations (e.g., AJ]BCD, AB|CD, ABC|D for the
sequence ABCD). This is similar to the divided AP measures described above. Starting with these
pseudo-bigrams, Wei & Li take the average pointwise mutual information score for each pseudo-
bigram in the sequence, but refine the average by weighting each pseudo-bigram by its probability
in the corpus. This gives more weight in the final measure to the most probable sub-sequences. — Alternate methods for calculating multi-unit association strength include Wei & Li (2013), who start with da Silva & Lopes’ (1999) notion of pseudo-bigrams, in which all sequences longer than two units are reduced to all possible pairwise combinations (e.g., AJ]BCD, AB|CD, ABC|D for the sequence ABCD). This is similar to the divided AP measures described above. Starting with these pseudo-bigrams, Wei & Li take the average pointwise mutual information score for each pseudo- bigram in the sequence, but refine the average by weighting each pseudo-bigram by its probability in the corpus. This gives more weight in the final measure to the most probable sub-sequences.

In both directions the Summed and Mean measures are closely related; the scatter plot
shows three distinct degrees of correlation with the correlation diminishing as the sequences in
question grow longer (i.e., the sum and the mean are very similar for shorter sequences, which is
expected). Thus, this relationship decreases as candidates grow longer. The two methods for
comparing sub-sequences within a candidate, the Divided and Reduced measures, show little
correlation between their respective Beginning and End variants in both directions (the highest
such correlation being 0.230 for the right-to-left Divided measures). The relationship between the
Divided and Reduced measures is quite high at the beginning of the sequences (i.e., at the Beginnin:
going left-to-right and at the End going right-to-left), exceeding 0.800 in both cases. However, at thi
end of the sequences the correlation is much lower (never higher than 0.370). Thus, these
variations on the sub-sequence measure do provide unique information in many but not all
situations. For all of these measures, it seems to be the case that they grow less correlated as the — In both directions the Summed and Mean measures are closely related; the scatter plot shows three distinct degrees of correlation with the correlation diminishing as the sequences in question grow longer (i.e., the sum and the mean are very similar for shorter sequences, which is expected). Thus, this relationship decreases as candidates grow longer. The two methods for comparing sub-sequences within a candidate, the Divided and Reduced measures, show little correlation between their respective Beginning and End variants in both directions (the highest such correlation being 0.230 for the right-to-left Divided measures). The relationship between the Divided and Reduced measures is quite high at the beginning of the sequences (i.e., at the Beginnin: going left-to-right and at the End going right-to-left), exceeding 0.800 in both cases. However, at thi end of the sequences the correlation is much lower (never higher than 0.370). Thus, these variations on the sub-sequence measure do provide unique information in many but not all situations. For all of these measures, it seems to be the case that they grow less correlated as the

The next question is whether the measures make adequate distinctions between potential
multi-unit constructions. We approach this question by looking at measures of the distribution of
each of these features, in Table 9, calculated as above across only multi-unit potential candidates in
the first 20 million sentences in the corpus. The measures show what we would expect: wide ranges
of values with means close to zero. This is because most candidates do not show association. Those
which do show internal association are outliers, in a sense, and this is what allows them to be
identified as actual constructions. The two measures which do not show means close to zero are the
summed values, in both directions. This is a result of the fact that only multi-unit candidates are — The next question is whether the measures make adequate distinctions between potential multi-unit constructions. We approach this question by looking at measures of the distribution of each of these features, in Table 9, calculated as above across only multi-unit potential candidates in the first 20 million sentences in the corpus. The measures show what we would expect: wide ranges of values with means close to zero. This is because most candidates do not show association. Those which do show internal association are outliers, in a sense, and this is what allows them to be identified as actual constructions. The two measures which do not show means close to zero are the summed values, in both directions. This is a result of the fact that only multi-unit candidates are

Table 9. Distribution Measures for Each Feature
The ideal construction grammar has at least one construction to account for every linguistic
expression in a corpus. In other words, because all linguistic expressions are hypothesized to be
formed from an underlying grammatical construction, it should be the case that all attested
linguistic expressions can be described by at least one construction in the predicted grammar. Thus
the degree of coverage of a grammar is an important criteria for evaluating a learned construction
grammar and, following from this, for evaluating the learning algorithm itself. The measure of
coverage is calculated as in (20), in which LE stands for Linguistic Expressions (operationalized in
this case as sentences), with c standing for the sub-set covered by a hypothesized construction and
n for the subset not covered in this way. Thus, this measure is simply the percentage of the test
corpus represented by the learned grammar, using sentences as the unit of analysis.
considered here, so that all instances have at least three units. This, of course, influences the mean
value but is necessary to allow this measure to be compared directly with the others. — Table 9. Distribution Measures for Each Feature The ideal construction grammar has at least one construction to account for every linguistic expression in a corpus. In other words, because all linguistic expressions are hypothesized to be formed from an underlying grammatical construction, it should be the case that all attested linguistic expressions can be described by at least one construction in the predicted grammar. Thus the degree of coverage of a grammar is an important criteria for evaluating a learned construction grammar and, following from this, for evaluating the learning algorithm itself. The measure of coverage is calculated as in (20), in which LE stands for Linguistic Expressions (operationalized in this case as sentences), with c standing for the sub-set covered by a hypothesized construction and n for the subset not covered in this way. Thus, this measure is simply the percentage of the test corpus represented by the learned grammar, using sentences as the unit of analysis. considered here, so that all instances have at least three units. This, of course, influences the mean value but is necessary to allow this measure to be compared directly with the others.

Figure 4. Degree of coverage across test sets of 100k sentences
The coverage experiment shows that larger grammars (e.g., without pruning) have more
coverage. However, this increased coverage is not proportional to the size of the grammar. Thus,
the fully reduced grammar is only 2% of the size of the full grammar, and yet maintains coverage
between 5% and 10% lower than the much larger grammar. Thus, while some important elements
of the grammar have been discarded, the association measure model allows a much smaller
grammar to find most of the optimum constructions. This is significant because the problem is to
maintain high coverage on unseen test sets without simply positing a very large grammar: the small
pruned grammar contains few false positives, even if it misses some true positives. — Figure 4. Degree of coverage across test sets of 100k sentences The coverage experiment shows that larger grammars (e.g., without pruning) have more coverage. However, this increased coverage is not proportional to the size of the grammar. Thus, the fully reduced grammar is only 2% of the size of the full grammar, and yet maintains coverage between 5% and 10% lower than the much larger grammar. Thus, while some important elements of the grammar have been discarded, the association measure model allows a much smaller grammar to find most of the optimum constructions. This is significant because the problem is to maintain high coverage on unseen test sets without simply positing a very large grammar: the small pruned grammar contains few false positives, even if it misses some true positives.

Table 10. Grammar Agreement Across Corpus Sizes
The results in Table 10 show that stability increases as more data is given to the algorithm.
For example, the first sizable increase in agreement is between 10 and 20 million sentences. It is
interesting that, even though the subsets have scaled frequency thresholds, the number of
candidates decreases as the amount of data increases. This is because the model is more clearly
able to separate the grammatical representations from noise as the dataset becomes larger. Given
the cap on this experiment, the question of how much data is required for convergence is left open.
A further question is whether frequency or association measures have more impact on the amount — Table 10. Grammar Agreement Across Corpus Sizes The results in Table 10 show that stability increases as more data is given to the algorithm. For example, the first sizable increase in agreement is between 10 and 20 million sentences. It is interesting that, even though the subsets have scaled frequency thresholds, the number of candidates decreases as the amount of data increases. This is because the model is more clearly able to separate the grammatical representations from noise as the dataset becomes larger. Given the cap on this experiment, the question of how much data is required for convergence is left open. A further question is whether frequency or association measures have more impact on the amount

The agreement ranges from the low- to mid-70s. This is quite strong, especially considering
the measures of stability by size discussed above (i.-e., it would likely be higher if the size of each
subset was increased to 20 or 40 million sentences). This means that the algorithm, given entirely
different datasets, produced grammars sharing over 70% of their constructions. While by no means
perfect, this shows that the grammar induction algorithm is not burdened with a poverty-of-the-
stimulus that requires innate structure to produce consistent output across learners. In other
words, the hypothesis of innate structure is not required to explain relatively consistent grammars
from different language learners.
An argument for innate structure, advanced by Lidz & Williams (2009), is that learners
produce very similar grammars for a language even though subject to different observed input. This
results, they argue, from innate constraints. Here we turn this into an empirical question: to what
degree do instances of the same grammar induction algorithm (i.e., language learners) agree in
their learned grammars when provided mutually exclusive sub-sets of the same size? In other
words, how much agreement is there when the algorithm is run on different datasets? If the output
grammars largely agree, this is evidence that such innate constraints are not, in fact, required to
explain this stability in learned grammars. Figure 5 shows the agreement between the grammars
produced on four distinct sub-sets of the corpus, each containing 10 million sentences. Agreement
is calculated as the number of shared constructions given the total number of constructions,
comparing all subsets to subset 1 for the sake of visualization. — The agreement ranges from the low- to mid-70s. This is quite strong, especially considering the measures of stability by size discussed above (i.-e., it would likely be higher if the size of each subset was increased to 20 or 40 million sentences). This means that the algorithm, given entirely different datasets, produced grammars sharing over 70% of their constructions. While by no means perfect, this shows that the grammar induction algorithm is not burdened with a poverty-of-the- stimulus that requires innate structure to produce consistent output across learners. In other words, the hypothesis of innate structure is not required to explain relatively consistent grammars from different language learners. An argument for innate structure, advanced by Lidz & Williams (2009), is that learners produce very similar grammars for a language even though subject to different observed input. This results, they argue, from innate constraints. Here we turn this into an empirical question: to what degree do instances of the same grammar induction algorithm (i.e., language learners) agree in their learned grammars when provided mutually exclusive sub-sets of the same size? In other words, how much agreement is there when the algorithm is run on different datasets? If the output grammars largely agree, this is evidence that such innate constraints are not, in fact, required to explain this stability in learned grammars. Figure 5 shows the agreement between the grammars produced on four distinct sub-sets of the corpus, each containing 10 million sentences. Agreement is calculated as the number of shared constructions given the total number of constructions, comparing all subsets to subset 1 for the sake of visualization.

References (65)

Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). "The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-crawled Corpora." Language Resources and Evaluation, 43: 209-226.
Blunsom, P. & Cohn, T. (2010). "Unsupervised induction of tree substitution grammars for dependency parsing." In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1,204-1,213.
Bod, R. (2006). "Exemplar-based syntax: How to get productivity from examples. The Linguistic Review, 22: 291-320.
Briscoe, T. (2000). "Grammatical Acquisition: Inductive bias and coevolution of language and the language acquisition device." Language, 76(2): 245-296.
Bryant, J. (2004). "Scalable construction-based parsing and semantic analysis." In Proceedings of the Workshop on Scalable Natural Language Understanding (HLT-NAACL): 33-40.
Bybee, J. (2006). "From usage to grammar: The mind's response to repetition." Language, 82(4): 711-733.
Bybee, J. (2010). Language, Usage, and Cognition. Cambridge, UK: Cambridge University Press.
Chang, N.; De Beule, J.; & Micelli, V. (2012). "Computational construction grammar: Comparing ECG and FCG." In Steels, L. (ed.), Computational Issues in Fluid Construction Grammar. Berlin: Springer. 259-288.
Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Chomsky, N. (1975). Logical Structure of Linguistic Theory. Philadelphia: Springer.
Clark, A. (2001). "Unsupervised induction of stochastic context-free grammars using distributional clustering." In Proceedings of the 5 th Conference on Natural Language Learning.
da Silva, J. & Lopez, G. (1999). "A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora." In Proceedings of the 6 th Meeting on the Mathematics of Language, 369-381.
Daudaravičius, V. & Marcinkevičienė, R. (2004). "Gravity counts for the boundaries of collocations." International Journal of Corpus Linguistics, 9(2): 321-348.
Davies, M. (2010). "The Corpus of Contemporary American English as the first reliable monitor corpus of English." Literary and Linguistic Computing, 25(4): 447-464.
Dennis, S. (2005). "An exemplar-based approach to unsupervised parsing." In Proceedings of the 27 th Annual Conference of the Cognitive Science Society: 583-588.
Dunn, J. (2015). "Review of The Semantic Representation of Natural Language." Studies in Language, 39(2): 492-500.
Fillmore, C. (1988). "The Mechanisms of 'Construction Grammar.'" In Proceedings of the Fourteenth Annual Meeting of the Berkeley Linguistics Society. 35-55.
Firth, J. (1957). Papers in Linguistics, 1934-1951. Oxford, Oxford University Press.
Forsberg, M.; Johansson, R.; Bäckström, L.; Borin, L.; Lyngfelt, B.; Olofsson, J.; & Prentice, J. (2014). "From construction candidates to constructicon entries: An experiment using semi- automatic methods for identifying constructions in corpora." Constructions and Frames, 6(1): 114-135.
Goldberg, A. (2006). Constructions at work: The nature of generalization in language. Oxford: Oxford University Press.
Goldberg, A. (2009). "The nature of generalization in language." Cognitive Linguistics, 20(1): 93-127.
Goldberg, A.; Casenhiser, D.; & Sethuraman, N. (2004). "Learning argument structure generalizations." Cognitive Linguistics, 15(3): 289-316.
Goldsmith, J. (2001). "Unsupervised learning of the morphology of a natural language." Computational Linguistics, 27(2): 153-198.
Goldsmith, J. (2006). "An algorithm for the unsupervised learning of morphology." Natural Language Engineering, 12(4): 353-371.
Gries, S. (2008). "Dispersions and adjusted frequencies in corpora." International Journal of Corpus Linguistics, 13(4): 403-437.
Gries, S. (2012). "Frequencies, probabilities, and association measures in usage-/ exemplar-based linguistics: Some necessary clarifications." Studies in Language, 11(3): 477-510.
Gries, S. (2013). "50-something years of work on collocations: What is or should be next." International Journal of Corpus Linguistics, 18(1): 137-165.
Gries, S. & Mukherjee, J. (2010). "Lexical gravity across varieties of English: An ICE-based study of n- grams in Asian Englishes." International Journal of Corpus Linguistics, 15(4): 520-548.
Gries, S. & Stefanowitsch, A. (2004a). "Extending collostructional analysis: A corpus-based perspective on 'alternations'." International Journal of Corpus Linguistics, 9(1): 97-129.
Gries, S. & Stefanowitsch, A. (2004b). "Co-varying lexemes in the into-causative." In Achard, M. & Kemmer, S. (eds.), Language, culture, and mind. Stanford: Stanford: CSLI. 225-236.
Headden, W.; Johnson, M.; & McClosky, D. (2009). "Improving unsupervised dependency parsing with richer contexts and smoothing." In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, 101-109.
Heinz, J.; de la Higuera, C.; van Zaanen, M. (2016). Grammatical inference for computational linguistics. San Rafael, CA: Morgan & Claypool Publishers.
Hilpert, M. (2008). "New evidence against the modularity of grammar: Constructions, collocations, and speech perception." Cognitive Linguistics, 19(3): 483-503.
Hopper, P. (1987). "Emergent grammar." Proceedings of the 13 th Annual Meeting of the Berkeley Linguistics Society, 139-157.
Istvan, N. & Vincze, V. (2014). "VPCTagger: Detecting Verb-Particle constructions with syntax-based methods." In Proceedings of the 10 th Workshop on Multiword Expressions, 17-25.
Jelinek, F. (1990). "Self-organizing language modeling for speech recognition." In A. Waibel & K. Lee (eds.), Readings in Speech Recognition. San Mateo, CA: Morgan Kaufmann. 450-506.
Katzir, R. (2014). "A cognitively plausible model for grammar induction." Journal of Language Modelling, 2(2): 213-248.
Kay, P. & Fillmore, C. (1999). "Grammatical constructions and linguistic generalizations: The What's X Doing Y? construction. "Language, 75(1): 1-33.
Klein, D. & Manning, C. (2002). "A generative constituent-context model for improved grammar induction." In Proceedings of the 40 th Annual Meeting of the Association for Computational Linguistics: 128-135.
Langacker, R. (1987). Foundations of Cognitive Grammar. Stanford: Stanford University Press.
Langacker, R. (2006). "On the continuous debate about discreteness." Cognitive Linguistics, 17(1): 107-151.
Langacker, R. (2008). Cognitive Grammar: A basic introduction. Oxford: Oxford University Press.
Levison, M.; Lessard, G.; Thomas, C.; Donald, M. (2013). The Semantic Representation of Natural Language. New York: Bloomsbury Publishing.
Lidz, J. & Williams, A. (2009). "Constructions on holiday." Cognitive Linguistics, 20(1): 177-189.
Mareček, D. & Straka, M. (2013). "Stop-probability estimates computed on a large corpus improve unsupervised dependency parsing." In Proceedings of the 51 st Annual Meeting of the Association for Computational Linguistics, 281-290.
Nirenburg, S. & Raskin, V. (2004). Ontological Semantics. Cambridge, MA: MIT Press.
Nivre, J.; Hall, J; Nilsson, J.; Chanev, A.; Eryigit, G.; Kubler, S.; Marinov, S.; & Marsi, E. (2007). "MaltParser: A language-independent system for data-driven dependency parsing." Natural Language Engineering, 13(2): 95-135.
O'Donnell, M. & Ellis, N. (2010). "Towards an inventory of English verb argument constructions." In Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL): 9-16.
Piao, S.; Bianchi, F.; Dayrell, C.; D'Egidio, A.; & Rayson, P. (2015). "Development of the multilingual semantic annotation system." In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics -Human Language Technologies, 1268-1274.
Schmid, H. (1994). "Probabilistic part-of-speech tagging using decision trees." In Proceedings of the International Conference on New Methods in Language Processing.
Solan, Z.; Horn, D.; Ruppin, E.; Edelman, S. (2005). "Unsupervised learning of natural languages." Proceedings of the National Academy of Sciences, 102(33): 11,629-11,634.
Spitkovsky, V.; Alshawi, H.; & Jurafsky, D. (2013). "Breaking out of local optima with count transforms and model recombination: A study in grammar induction." In Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing, 1983-1995.
Steels, L. (2004). "Constructivist development of grounded construction grammar." In Proceedings of the 42 nd Meeting of the Association for Computational Linguistics: 9-16.
Steels, L. (2012). "Design methods for fluid construction grammar." In Steels, L. (ed), Computational Issues in Fluid Construction Grammar. Berlin: Springer. 3-36.
Stefanowitsch, A. & Gries, S. (2003). "Collostructions: Investigating the interaction between words and constructions." International Journal of Corpus Linguistics, 8(2): 209-243.
Stefanowitsch, A. & Gries, S. (2005). "Covarying lexemes." Corpus Linguistics and Linguistic Theory, 1(1): 1-43.
Tomasello, M. (2003). Constructing a language. Cambridge, MA: Harvard University Press. Computational Learning of Grammars, 32
Tsao, N. & Wible, D. (2013). "Word similarity using constructions as contextual features." In Proceedings of the Joint Symposium on Semantic Processing: Textual Inference and Structures in Corpora. 51-59.
van de Cruys, T. (2011). "Two multivariate generalizations of pointwise mutual information." In Proceedings of the Workshop on Distributional Semantics and Compositionality, 16-20.
van Zaanen, M. (2000). "ABL: Alignment-based learning." In Proceedings of the 18 th International Conference on Computational Linguistics, 961-967.
Vincze, V.; Zsibrita, J.; & Istvan, N. (2013). "Dependency parsing for identifying Hungarian light-verb constructions." In Proceedings of the International Joint Conference on Natural Language Processing, 207-215.
Wei, N. & Li, J. (2013). "A new computing method for extracting contiguous phraseological sequences from academic text corpora." International Journal of Corpus Linguistics, 18(4): 506-535.
Wible, D. & Taso, N. (2010). "StringNet as a computational resource for discovering and investigating linguistic constructions." In Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL): 25-31.
Zadrozny, W.; Szummer, M.; Jarecki, S.; Johnson, D.; & Morhenstern, L. (1994). "NL understanding with a grammar of constructions." In Proceedings of the International Conference on Computational Linguistics: 1,289-1,293.
Zuidema, W. (2006). "What are the productive units of natural language grammar? A DOP approach to the automatic identification of constructions." In Proceedings of the 10 th Conference on Computational Natural Language Learning: 29-36.

Computational Learning of Construction Grammars

Sign up for access to the world's latest research

Abstract

Related papers

References (65)

Related papers

Related topics

Cited by