extraction of notions and concepts, collocations, measures of association, classification, function of logarithmic likelihood, KDE method


The paper presents the results of assessing the quality of the binary classification of pairs of words (bigrams) on the basis of various measures of association, during which the bigrams were divided into classes 'concepts and notions' and 'other bigrams'. It is shown that the usual ranking of objects based on the values of the association measure, followed by the use of threshold filtering (or selection of a fixed number of the first elements of the sorted list), allows you to get only a certain top of the rating, but does not allow you to achieve an effective solution to the classification problem.

The approach proposed by the authors is based on the threshold filtering not of the values of the association measure, but the probability of the bigram belonging to the class 'concepts and notions' for a given value of the association measure. The indicated probability is calculated based on the values of the probability density functions (PDFs) corresponding to the distributions of the association measure as a random variable in both classes. The construction of empirical PDFs was performed by analyzing the labeled training sample.

Determination of the threshold value of the probability is reduced to solving a one-dimensional optimization problem, during which the ratio of the number of objects identified as 'concepts and notions' to the number of objects classified as 'other bigrams' is maximized. Determination of the nature of the statistical distribution of most of the considered association measures is difficult (rejection of the null hypothesis for the main known distributions based on the results of the -test), due to which the PDF was approximated by the Parzen-Rosenblatt window method. Such a solution made it possible to significantly increase the quality of the classification (an increase in the -measure up to 58% for certain association measures).

The performed correlation analysis of measures of association made it possible to distinguish two clusters: measures focused on the strength of connection in a collocation, and measures focused on the frequency of occurrence of collocation. The logarithmic likelihood function and Student's t test take into account both of these factors approximately equally.

It was found that the use of the log-likelihood function (as a measure of association), together with the proposed threshold filtering algorithm, makes it possible to achieve a classification with a value of the -measure equal to one (according to the data obtained for the training and test samples used).

