ASSESSMENT OF THE ACCURACY OF NOTION AND CONCEPT EXTRACTION BASED ON MEASURES OF ASSOCIATION
Ключові слова:extraction of notions and concepts, collocations, measures of association, classification, function of logarithmic likelihood, KDE method
The paper presents the results of assessing the quality of the binary classification of pairs of words (bigrams) on the basis of various measures of association, during which the bigrams were divided into classes 'concepts and notions' and 'other bigrams'. It is shown that the usual ranking of objects based on the values of the association measure, followed by the use of threshold filtering (or selection of a fixed number of the first elements of the sorted list), allows you to get only a certain top of the rating, but does not allow you to achieve an effective solution to the classification problem.
The approach proposed by the authors is based on the threshold filtering not of the values of the association measure, but the probability of the bigram belonging to the class 'concepts and notions' for a given value of the association measure. The indicated probability is calculated based on the values of the probability density functions (PDFs) corresponding to the distributions of the association measure as a random variable in both classes. The construction of empirical PDFs was performed by analyzing the labeled training sample.
Determination of the threshold value of the probability is reduced to solving a one-dimensional optimization problem, during which the ratio of the number of objects identified as 'concepts and notions' to the number of objects classified as 'other bigrams' is maximized. Determination of the nature of the statistical distribution of most of the considered association measures is difficult (rejection of the null hypothesis for the main known distributions based on the results of the -test), due to which the PDF was approximated by the Parzen-Rosenblatt window method. Such a solution made it possible to significantly increase the quality of the classification (an increase in the -measure up to 58% for certain association measures).
The performed correlation analysis of measures of association made it possible to distinguish two clusters: measures focused on the strength of connection in a collocation, and measures focused on the frequency of occurrence of collocation. The logarithmic likelihood function and Student's t test take into account both of these factors approximately equally.
It was found that the use of the log-likelihood function (as a measure of association), together with the proposed threshold filtering algorithm, makes it possible to achieve a classification with a value of the -measure equal to one (according to the data obtained for the training and test samples used).
Baranov, V. A. (2016). Opyit sozdaniya modulya n-gramm sistemyi «Manuskript» i otsenki effektivnosti ego ispolzovaniya dlya poiska kollokatsiy v korpuse M. V. Lomonosova. Intellektualnyie sistemyi v proizvodstve. 4, 124–131.
Bolshakova, E.I., Klyishinskiy, E.S., & Lande, D. V. i dr. (2011). Avtomaticheskaya obrabotka tekstov na estestvennom yazyike i kompyuternaya lingvistika. M.: MIEM…
Lyse, G. I. & Andersen, G. (2012). Collocations and statistical analysis of n-grams: Multiword expressions in newspaper text. Exploring Newspaper Language. Amsterdam, New York: John Benjamins, pp. 79–109.
Vinogradova, N. V., & Ivanov, V. K. (2016). Sovremennyie metodyi avtomatizirovannogo izvlecheniya klyuchevyih slov iz teksta. Informatsionnyie resursyi Rossii. 4, 13–18.
Lossio-Ventura, J. A., Jonquet, C., & Roche, M. et al. (2013). Combining C-value and Keyword Extraction Methods for Biomedical Terms Extraction. Proceedings of the LBM: Languages in Biology and Medicine: 5th International Symposium, (Japan, Tokyo, December 12-13, 2013). Tokyo, pp. 1–6.
Evert, S., & Krenn, B. (2005). Using Small Random Samples for the Manual Evaluation of Statistical Association Measures. Computer Speech & Language. 19, 450–466.
Wei, C.-H., Allot, A., Leaman, R. & Lu, Z. (2019). PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Research. 47, 587–593.
Gehrmann, S., Derenoncourt, F., & Li, Y. et al. (2018). Comparing Deep Learning and Concept Extraction Based Methods for Patient Phenotyping from Clinical Narratives. PLoSOne. 13, 2, 1–19.
Vanyushkin, A. S., & Graschenko, L. A. (2016). Metodyi i algoritmyi izvlecheniya klyuchevyih slov. Novyie informatsionnyie tehnologii v avtomatizirovannyih sistemah. 19, 85–93.
Mozzherina, E. S. (2011). Avtomaticheskoe postroenie ontologii po kollektsii tekstovyih dokumentov. Proceedings of the Elektronnyie biblioteki: perspektivnyie metodyi i tehnologii, elektronnyie kollektsii: Trudyi 13-y Vserossiyskoy nauchnoy konferentsii. (Rossia, Voronezh, October 19-22, 2011). Voronezh: Izdatelstvo Voronezhskogo gosudarstvennogo universiteta, pp. 293–298.
Christopher, D. M., Hinrich, S. (1999). Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 178–183.
Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative Evaluation of Collocation Extraction Metrics. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). (Canary Islands – Spain, Las Palmas, May, 2002). Luxembourg: European Language Resources Association (ELRA), pp. 620–625.
Kolesnikova, O. (2016). Survey of Word Co-occurrence Measures for Collocation Detection. Computacion y Sistemas. 20, 327–344. DOI: 10.13053/CyS-20-3-2456.
Hoang, H. H., Kim, S. N., & Kan, M.-Y. (2009). A Re-examination of Lexical Association Measures. Proceedings of the Identification, Interpretation, Disambiguation and Applications: Workshop on Multiword Expressions (MWE 2009). (Singapore, Singapore, August, 2009). Stroudsburg: Association for Computational Linguistics, pp. 31–39.
Pazienza, M. T., Pennacchiotti, M., & Zanzotto, F. B. (2006). Terminology extraction: an analysis of linguistic and statistical approaches. Studies in Fuzziness and Soft Computing. 185, 255–279.
Bouma, G. (2009). Normalized (Pointwise) Mutual Information in Collocation Extraction. Proceedings of the Biennial GSCL Conference, pp. 1–11.
Calculate Pointwise Mutual Information (PMI). Retrieved from: https://polmine.github.io/ polmineR/reference/pmi.html.
Mikolov, T., Sutskever, I., & Chen, K. et al. (2013). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the Neural Information Processing Systems 2013: conference. (USA, Lake Tahoe, 2013). In Advances in Neural Information Processing Systems. 9 p.
Kogay, V. N., & Pak, V. S. (2016). Algoritmicheskaya model kompyuternoy sistemyi vyideleniya klyuchevyih slov iz teksta na baze ontologiy. Problemyi sovremennoy nauki i obrazovaniya. 16 (58), 33–40.
Damani, O. (2013). Improving Pointwise Mutual Information (PMI) by Incorporating Significant Co-occurrence. Proceedings of the Seventeenth Conference on Computational Natural Language Learning. (Bulgaria, Sofia, August 8-9, 2013). Madison: Omnipress, pp. 20–28.
Andreev, I. A., Bashaev, V. A., & Kleyn, V. V. i dr. (2013) Kombinirovanie statisticheskogo i lingvisticheskogo metodov dlya izvlecheniya dvuhslovnyih terminov iz teksta. Avtomatizatsiya protsessov upravleniya. 4, 61–70.
SMART Information Retrieval System. Retrieved from: https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System.
Porshnev, S. V., & Koposov, A. S. (2013). Ispolzovanie approksimatsii Rozenblatta-Parzena dlya vosstanovleniya funktsii raspredeleniya nepreryivnoy sluchaynoy velichinyi s ogranichennyim odnomodalnyim zakonom raspredeleniya. Nauchnyiy zhurnal KubGAU. 92, 1–14.