Finding the presence of borrowings in scientific works based on Markov chains

Authors

  • Rustam R. Saakyan Vanadzor State University named after H. Tumanyan, 36, ul. Tigran Mets, Vanadzor, 2001, Republic of Armenia https://orcid.org/0009-0001-4088-6411
  • Irina A. Shpekht Academy of Marketing and Social Information Technologies - IMSIT, 5, Zipovskaya ul., Krasnodar, 350010, Russian Federation https://orcid.org/0009-0001-4088-6411
  • Gevorg A. Petrosyan Vanadzor State University named after H. Tumanyan, 36, ul. Tigran Mets, Vanadzor, 2001, Republic of Armenia https://orcid.org/0000-0003-1286-5223

DOI:

https://doi.org/10.21638/11701/spbu10.2023.104

Abstract

The study aims to develop optimal approaches to the search for borrowings in scientific works. The article discusses the stages of searching for the presence of borrowings, such as preprocessing, rough filtering of texts, searching for similar texts, and searching for borrowings. The main focus is on the description of approaches and techniques that can be effectively implemented at each stage. For example, for the preprocessing stage, it may be converting text characters from uppercase to lowercase, removing punctuation marks, and removing stop words. For the stage of rough text filtering, it is filters by topic and word frequency. It may be calculating the importance of words in the context of the text and representing the word as a vector in multidimensional space to determine the proximity measure for the stage of finding similar texts. Finally, it is a search for an exact match, paraphrases and a measure of similarity of expressions for the stage of finding borrowings. The scientific novelty lies in using Markov chains to find the similarity of texts for the second and third stages of the search for borrowings proposed by authors. As a result, the example shows the technique of using Markov chains for text representation, searching for the most frequently occurring words, building a graph of a Markov chain of words, and the prospects for using Markov chains of texts for rough filtering and searching for similar texts.

Keywords:

search for borrowings, algorithms for finding borrowings, Markov chains, originality checker software

Downloads

Download data is not yet available.
 

References

Литература

Заимствования в научных публикациях и рекомендации по оформлению цитирований. М.: Рос. эконом. ун-т им. Г. В. Плеханова, 2022. URL: https://www.rea.ru/ru/org/managements/orgnirupr/Pages/Заимствования.aspx (дата обращения: 1 сентября 2022 г.).

Agrawal R. Must known techniques for text preprocessing in NLP // Analytics Vidhya. 2022. URL: https://www.analyticsvidhya.com/blog/2021/06/must-known-techniques-for-text-preprocessing-in-nlp/ (дата обращения: 1 сентября 2022 г.).

Camacho-Collados J., Pilehvar M. T. On the role of text preprocessing in neural network architectures // An evaluation study on text categorization and sentiment analysis. 2018. URL: https://arxiv.org/pdf/1707.01780.pdf (дата обращения: 1 сентября 2022 г.).

Minaee Sh., Kalchbrenner N., Cambria E., Nikzad N., Chenaghlu M., Gao J. Deep learning based text classification: a comprehensive review. Cornell: Cornell University, 2020. URL: https://arxiv.org/pdf/2004.03705.pdf (дата обращения: 1 сентября 2022 г.).

Mikolov T., Chen K., Corrado G., Dean J. Efficient estimation of word representations in vector space. Cornell: Cornell University, 2013. URL: https://arxiv.org/pdf/1301.3781.pdf (дата обращения: 1 сентября 2022 г.).

Le V., Mikolov T. Distributed representations of sentences and documents. Cornell: Cornell University, 2014. URL: https://arxiv.org/pdf/1405.4053.pdf (дата обращения: 1 сентября 2022 г.).

Yang Zh., Jin Sh., Huang Y., Zhang Y., Li H. Automatically generate steganographic text based on Markov model and Huffman coding. Cornell: Cornell University, 2018. URL: https://arxiv.org/ftp/arxiv/papers/1811/1811.04720.pdf (дата обращения: 1 сентября 2022 г.).

Thelin R. Build a deep learning text generator project with Markov chains // Educative, 2022. URL: https://www.educative.io/blog/deep-learning-text-generation-markov-chains (дата обращения: 1 сентября 2022 г.).

Papadopoulos A., Roy P., Pachet F. Avoiding plagiarism in Markov sequence generation // Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. July 27-31, 2014. P. 2731-2737. URL: https://www.francoispachet.fr/wp-content/uploads/2021/01/papadopoulos-14a.pdf (дата обращения: 1 сентября 2022 г.).

References

Borrowings in scientific publications and recommendations for citations. Moscow, Plekhanov Russian University of Economics Press, 2022. Avalaible at: https://www.rea.ru/ru/org/managements/orgnirupr/Pages/Заимствования.aspx (accessed: September 1, 2022).

Agrawal R. Must known techniques for text preprocessing in NLP. Analytics Vidhya, 2022. Avalaible at: https://www.analyticsvidhya.com/blog/2021/06/must-known-techniques-for-text-preprocessing-in-nlp/ (accessed: September 1, 2022).

Camacho-Collados J., Pilehvar M. T. On the role of text preprocessing in neural network architectures. An evaluation study on text categorization and sentiment analysis, 2018. Avalaible at: https://arxiv.org/pdf/1707.01780.pdf (accessed: September 1, 2022).

Minaee Sh., Kalchbrenner N., Cambria E., Nikzad N., Chenaghlu M., Gao J. Deep learning based text classification: a comprehensive review. Cornell, Cornell University Press, 2020. Avalaible at: https://arxiv.org/pdf/2004.03705.pdf (accessed: September 1, 2022).

Mikolov T., Chen K., Corrado G., Dean J. Efficient estimation of word representations in vector space. Cornell, Cornell University Press, 2013. Avalaible at: https://arxiv.org/pdf/1301.3781.pdf (accessed: September 1, 2022).

Le V., Mikolov T. Distributed representations of sentences and documents. Cornell, Cornell University Press, 2014. Avalaible at: https://arxiv.org/pdf/1405.4053.pdf (accessed: September 1, 2022).

Yang Zh., Jin Sh., Huang Y., , Zhang Y., Li H. Automatically generate steganographic text based on Markov model and Huffman coding. Cornell, Cornell University Press, 2018. Avalaible at: https://arxiv.org/ftp/arxiv/papers/1811/1811.04720.pdf (accessed: September 1, 2022).

Thelin R. Build a deep learning text generator project with Markov chains. Educative, 2022. Avalaible at: https://www.educative.io/blog/deep-learning-text-generation-markov-chains (accessed: September 1, 2022).

Papadopoulos A., Roy P., Pachet F. Avoiding plagiarism in Markov sequence generation. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27-31, 2014, pp. 2731-2737. Avalaible at: https://www.francoispachet.fr/wp-content/uploads/2021/01/papadopoulos-14a.pdf (accessed: September 1, 2022).

Published

2023-04-27

How to Cite

Saakyan, R. R., Shpekht, I. A., & Petrosyan, G. A. (2023). Finding the presence of borrowings in scientific works based on Markov chains. Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 19(1), 43–50. https://doi.org/10.21638/11701/spbu10.2023.104

Issue

Section

Applied Mathematics