Metric binary trees, and nested cluster hierarchy building

Authors

  • Andrey V. Orekhov St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation https://orcid.org/0000-0001-7641-956X
  • Igor V. Vasiliev St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation https://orcid.org/0009-0000-3679-8809

DOI:

https://doi.org/10.21638/spbu10.2024.405

Abstract

Machine learning methods use data trees to organize and store information. Each of these structures has certain advantages and allows improving the quality of a particular algorithm. If all tree nodes have no more than two descendants, then it is called binary; its main advantage is the high efficiency of implementing search and sorting algorithms. In this regard, it is important to note that dendrograms of hierarchical agglomerative clustering methods are also binary trees reflecting the taxonomy of elements of a data set. Any cluster that is not a singleton can be divided into subclusters, and this allows us to form a hierarchical structure in a metric space (metric tree) with additional properties. For example, automatically set the height of the tree, considering by definition that the number of levels on which its nodes are located coincides with the number of options for dividing the sample set into clusters, subclusters, subclusters of subclusters, etc. This problem can be solved using approximation-estimation tests, the change in sensitivity of which, using the trend coefficient, allows us to obtain various clustering options. When conducting computational experiments, a synthetic set of points on the Euclidean plane was used and were studied the results of centroid based clustering. Markov moments of stopping the clustering process were determined using approximation-estimation test. Verification of the results obtained in numerical modeling was carried out by changing the step size of the trend coefficient.

Keywords:

metric tree, agglomerative clustering, Markov moment, least squares method.

Downloads

Download data is not yet available.
 

References

Литература

Mwangi D. 5 common data structures and algorithms used in machine learning. URL: https://dzone.com/articles/5-common-data-structures-and-algorithms-used-in-ma (дата обращения: 19 сентября 2024 г.).

Cormen T. H., Leiserson C. E., Rivest R. L., Stein C. Introduction to algorithms. 4th ed. Cambridge, Massachusetts, US: MIT Press and McGraw-Hill, 2022. 1312 p.

Quinlan J. R. Induction of decision trees // Machine Learning. 1986. Vol. 1. P. 81–106. https://doi.org/10.1007/BF00116251

Uhlmann J. K. Satisfying general proximity/similarity queries with metric trees // Information Processing Letters. 1991. Vol. 40. Iss. 4. P. 175–179. https://doi.org/10.1016/0020-0190(91)90074-R

Samet H. Foundations of multidimensional and metric data structures (The Morgan Kaufmann series in data management systems). San Francisco, US: Morgan Kaufmann Publ. Inc., 2006. 1024 p.

Bozkaya T., Ozsoyoglu Z. M. Indexing large metric spaces for similarity search queries // ACM Trans. Database Systems. 1999. Vol. 24. N 3. P. 361–404. https://doi.org/10.1145/328939.328959

Brin S. Near neighbor search in large metric spaces // 21th International Conference on Very Large Data Bases (VLDB 1995). September 11–15, 1995. Zurich, Switzerland, 1995. P. 574–584.

Жамбю М. Иерархический кластер-анализ и соответствия / пер. с фр. М.: Финансы и статистика, 1988. 344 с.

Calirnski T., Harabasz J. A dendrite method for cluster analysis // Communications in Statistics. 1974. N 3. P. 1–27.

Орехов А. В. Марковский момент остановки агломеративного процесса кластеризации в евклидовом пространстве // Вестник Санкт-Петербургского университета. Прикладная математика. Информатика. Процессы управления. 2019. Т. 15. Вып. 1. С. 76–92. https://doi.org/10.21638/11701/spbu10.2019.106

Orekhov A. V. Agglomerative method for texts clustering // Lecture Notes in Computer Science. 2019. Vol. 11551 LNCS. P. 19–32. https://doi.org/10.1007/978-3-030-17705-8_2

Everitt B. S. Cluster analysis. Chichester, West Sussex, John Wiley & Sons Ltd. Press, 2011. 330 p.

Duda R. O., Hart P. E., Stor D. G. Pattern classification. New York, Chichester: Wiley Press, 2001. 654 p.

Lance G. N., Williams W. T. A general theory of classificatory sorting strategies. 1. Hierarchical systems // The Computer Journal. 1967. Vol. 9. P. 373–380.

Milligan G. W. Ultrametric hierarchical clustering algorithms // Psychometrika. 1979. Vol. 44. P. 343–346.

Thorndike R. L. Who belongs in the family? // Psychometrika. 1953. Vol. 18. P. 267–276. https://doi.org/10.1007/BF02289263

Олдендерфер М. С., Блэшфилд Р. К. Кластерный анализ // Факторный, дискриминантный и кластерный анализ / пер. с англ.; под ред. И. С. Енюкова. М.: Финансы и статистика, 1989. C. 139–209.

Orekhov A. V. Quasi-deterministic processes with monotonic trajectories and unsupervised machine learning // Mathematics. 2021. Vol. 9. Art. N 2301. https://doi.org/10.3390/math9182301

Левин Б. Р. Теоретические основы статистической радиотехники. М.: Радио и связь, 1989. 656 с.

Bodrunova S. S., Orekhov A. V., Blekanov I. S., Lyudkevich N. S., Tarasov N. A. Topic detection based on sentence embeddings and agglomerative clustering with Markov moment // Future Internet. 2020. Vol. 12. Art. N 144. https://doi.org/10.3390/fi12090144

Мазалов В. В. Математическая теория игр и приложения. СПб.: Лань, 2017. 448 с.

Булинский А. В., Ширяев А. Н. Теория случайных процессов. М.: Физматлит, 2003. 400 с.

Ширяев А. Н. Статистический последовательный анализ. 2-е изд. М.: Наука, 1976. 272 с.

Граничин О. Н., Шалымов Д. С., Аврос Р., Волкович З. Рандомизированный алгоритм нахождения количества кластеров // Автоматика и телемеханика. 2011. № 4. С. 86–98.

Иерархическая кластеризация. URL: https://neerc.ifmo.ru/wiki/index.php?title=Иерархическая_кластеризация (дата обращения: 19 сентября 2024 г.).

Rousseeuw P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis // Journal of Comput. Appl. Math. 1987. Vol. 20. P. 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

Kaufman L., Rousseeuw P. J. Finding groups in data: An introduction to cluster analysis. New York: John Wiley & Sons Inc., 1990. 342 p. https://doi.org/10.1002/9780470316801

Fisher R. A. The use of multiple measurements in taxonomic problems // Annals of Eugenics. 1936. Vol. 7. P. 179–188. http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137.x

Snell-Hornby M. Translation studies: An integrated approach. Amsterdam: John Benjamins Publ., 1988. 172 p.


References

Mwangi D. 5 common data structures and algorithms used in machine learning. Available at: https://dzone.com/articles/5-common-data-structures-and-algorithms-used-in-ma (accessed: September 19, 2024).

Cormen T. H., Leiserson C. E., Rivest R. L., Stein C. Introduction to algorithms. 4th ed. Cambridge, Massachusetts, US, MIT Press and McGraw-Hill Publ., 2022. 1312 p.

Quinlan J. R. Induction of decision trees. Machine Learning, 1986, vol. 1, pp. 81–106. https://doi.org/10.1007/BF00116251

Uhlmann J. K. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 1991, vol. 40, iss. 4, pp. 175–179. https://doi.org/10.1016/0020-0190(91)90074-R

Samet H. Foundations of multidimensional and metric data structures (The Morgan Kaufmann series in data management systems). San Francisco, US, Morgan Kaufmann Publ. Inc., 2006, 1024 p.

Bozkaya T., Ozsoyoglu Z. M. Indexing large metric spaces for similarity search queries. ACM Trans. Database Systems, 1999, vol. 24, no. 3, pp. 361–404. https://doi.org/10.1145/328939.328959

Brin S. Near neighbor search in large metric spaces. 21th International Conference on Very Large Data Bases (VLDB 1995), September 11–15, 1995. Zurich, Switzerland, 1995, pp. 574–584.

Jambue M. Iyerarkhicheskiy klaster-analiz i sootvetstviya [Hierarchical cluster analysis and correspondences]. Moscow, Finance and Statistics Publ., 1988, 344 p. (In Russian)

Calirnski T., Harabasz J. A dendrite method for cluster analysis. Communications in Statistics, 1974, no. 3, pp. 1–27.

Orekhov A. V. Markovskii moment ostanovki aglomerativnogo protsessa klasterizatsii v evklidovom prostranstve [Markov moment for the agglomerative method of clustering in Euclidean space]. Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 2019, vol. 15, iss. 1, pp. 76–92. https://doi.org/10.21638/11701/spbu10.2019.106 (In Russian)

Orekhov A. V. Agglomerative method for texts clustering. Lecture Notes in Computer Science, 2019, vol. 11551 LNCS, pp. 19–32. https://doi.org/10.1007/978-3-030-17705-8_2

Everitt B. S. Cluster analysis. Chichester, West Sussex, John Wiley & Sons Ltd. Press, 2011, 330 p.

Duda R. O., Hart P. E., Stor D. G. Pattern classification. New York, Chichester, Wiley Press, 2001, 654 p.

Lance G. N., Williams W. T. A general theory of classificatory sorting strategies. 1. Hierarchical systems. The Computer Journal, 1967, vol. 9, pp. 373–380.

Milligan G. W. Ultrametric hierarchical clustering algorithms. Psychometrika, 1979, vol. 44, pp. 343–346.

Thorndike R. L. Who belongs in the family? Psychometrika, 1953, vol. 18, pp. 267–276. https://doi.org/10.1007/BF02289263

Oldenderfer M. S., Blashfield R. K. Klasterniy analiz [Cluster analysis]. Faktorniy, diskriminantniy i klasterniy analiz [Factor, discriminant and cluster analysis]. Moscow, Finance and Statistics Publ., 1989, pp. 139–209. (In Russian)

Orekhov A. V. Quasi-deterministic processes with monotonic trajectories and unsupervised machine learning. Mathematics, 2021, vol. 9, art. no. 2301. https://doi.org/10.3390/math9182301

Levin B. R. Teoreticheskiye osnovy statisticheskoy radiotekhniki [Theoretical foundations of statistical radio engineering]. Moscow, Radio and Communication Publ., 1989, 656 p. (In Russian)

Bodrunova S. S., Orekhov A. V., Blekanov I. S., Lyudkevich N. S., Tarasov N. A. Topic detection based on sentence embeddings and agglomerative clustering with Markov moment. Future Internet, 2020, vol. 12, art. no. 144. https://doi.org/10.3390/fi12090144

Mazalov V. V. Matematicheskaya teoriya igr i prilozheniya [Mathematical game theory and applications]. St. Petersburg, Lan’ Publ., 2017, 448 p. (In Russian)

Bulinsky A. V., Shiryaev A. N. Teoriya sluchaynykh protsessov [Theory of random processes]. Moscow, Fizmatlit Publ., 2003, 400 p. (In Russian)

Shiryaev A. N. Statisticheskiy posledovatel'nyy analiz. 2-ye izd. [Statistical sequential analysis. 2nd ed.]. Moscow, Nauka Publ., 1976, 272 p. (In Russian)

Granichin O. N., Shalymov D. S., Avros R., Volkovich Z. Randomizirovannyy algoritm nakhozhdeniya kolichestva klasterov [Randomized algorithm for finding the number of clusters]. Automation and Telemechanics, 2011, no. 4, pp. 86–98. (In Russian)

Hierarchical clustering. Available at: https://neerc.ifmo.ru/wiki/index.php?title=Иерархическая_кластеризация (accessed: September 19, 2024). (In Russian)

Rousseeuw P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Comput. Appl. Math., 1987, vol. 20, pp. 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

Kaufman L., Rousseeuw P. J. Finding groups in data: An introduction to cluster analysis. New York, John Wiley & Sons Inc. Publ., 1990, 342 p. https://doi.org/10.1002/9780470316801

Fisher R. A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1936, vol. 7, pp. 179–188. http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137.x

Snell-Hornby M. Translation studies: An integrated approach. Amsterdam, John Benjamins Publ., 1988, 172 p.

Published

2024-12-30

How to Cite

Orekhov, A. V., & Vasiliev, I. V. (2024). Metric binary trees, and nested cluster hierarchy building. Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 20(4), 487–499. https://doi.org/10.21638/spbu10.2024.405

Issue

Section

Computer Science