Extending the applicability of the Zipf’s laws to the sequences of byte data

Authors

  • Sergey L. Sergeev St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation
  • Ivan S. Blekanov St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation https://orcid.org/0000-0002-7305-1429
  • Fedor V. Ezhov St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation
  • Nikita A. Tarasov St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation https://orcid.org/0000-0002-9473-6179

DOI:

https://doi.org/10.21638/spbu10.2024.307

Abstract

Zipf’s law have been shown to hold true in many places. From it’s first idea of a statistical phenomenon related to natural language to it’s later adaptations for economical, social and many other fields, it has been shown to work almost universally. In all of these cases authors discuss the applicability of the Zipf’s law in terms of semantically complex structures. We take this notion a step further and show how this law can work for data analysis, in particular for the sequences of byte data, obtained from various sources. We show that, using the basic chunking methodology, the Zipf’s law can be shown to hold true for many different types of raw sequences of byte data. In particular, the law holds true in all cases for the “middle point” of data, where it is present with a degree of certainty of more than 90 %. We conclude by discussing the implications and potential use cases of these findings.

Keywords:

Zipf’s laws, byte data, chunking, frequency analysis

Downloads

Download data is not yet available.
 

References


References

Zipf G. K. The psycho-biology of language: An introduction to dynamic philology. London, Routledge Publ., 1999, 356 p.

Zipf G. K. Human behavior and the principle of least effort. Cambridge, Mass., 1965, 573 p.

Mandelbrot B. An informational theory of the statistical structure of language. Communication Theory, 1953, vol. 84, pp. 486–502.

Mandelbrot B. The fractal geometry of nature. New York, W. H. Freeman & Co. Publ., 1982, 468 p.

Lu G., Jin Y., Du D. H. C. Frequency based chunking for data de-duplication. 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE, 2010, pp. 287–296.

Baayen R. H. Word frequency distributions. Dordrecht, Springer Science & Business Media, 2001, 335 p.

Piantadosi S. T. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 2014, vol. 21, no. 5, pp. 1112–1130.

Yu S., Xu C., Liu H. Zipf's law in 50 languages: its structural pattern, linguistic interpretation, and cognitive motivation. arXiv preprint, arXiv: 1807.01855, 2018.

Arshad S., Hu S., Ashraf B. N. Zipf’s law and city size distribution: A survey of the literature and future research agenda. Physica A: Statistical Mechanics and its Applications, 2018, vol. 492, pp. 75–92.

Gao L., Zhou G., Luo J., Huang Y. Word embedding with Zipf’s context. IEEE Access, 2019, vol. 7, pp. 168934–168943.

Baumann A., Kaźmierski K., Matzinger T. Scaling laws for phonotactic complexity in spoken english language data. Language and Speech, 2021, vol. 64, no. 3, pp. 693–704.

Perotti J. I., Billoni O. V. On the emergence of Zipf’s law in music. Physica A: Statistical Mechanics and its Applications, 2020, vol. 549, art. no. 124309.

Kershenbaum A., Demartsev V., Gammon D. E., Geffen E., Gustison M. L., Ilany A., Lameira A. R. Shannon entropy as a robust estimator of Zipf's law in animal vocal communication repertoires. Methods in Ecology and Evolution, 2021, vol. 12, no. 3, pp. 553–564.

Crosier M., Griffin L. D. Zipf's law in image coding schemes. BMVC 2007 — Proceedings of the British Machine Vision Conference, 2007, pp. 1–10.

Kornai A. Zipf's law outside the middle range. Sixth Meeting on Mathematics of Language, 1999, pp. 347–356.

Corral Á., Boleda G., Ferrer-i-Cancho R. Zipf’s law for word frequencies: Word forms versus lemmas in long texts. PloS One 10, 2015, vol. 549, no. 7, pp. 1–23.

Matsumoto M., Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS), 1998, vol. 8, no. 1, pp. 3–30.

O’Neill M. E. PCG: A family of simple fast space-efficient statistically good algorithms for random number generation. Available at: https://www.pcg-random.org/ (accessed: May 1, 2024).

Upgrading PCG64 with PCG64DXSM — NumPy v1.24 Manual. Available at: URL: https://numpy.org/doc/stable/reference/random/upgrading-pcg64.html, (accessed: May 01, 2024).

Salmon J. K., Moraes M. A., Dror R. O., Shaw D. E. Parallel random numbers: as easy as 1, 2, 3. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, pp. 1–12.

SFC64. Small Fast Chaotic PRNG. Available at: https://numpy.org/doc/stable/reference/random/bit_generators/sfc64.html (accessed: May 1, 2024).

Bakulina M. P. Application of the Zipf law to text compression. Journal of Applied and Industrial Mathematics, 2008, vol. 2, no. 4, pp. 477–483.

Mahmood M. A., Hasan K. A. Efficient compression scheme for large natural text using Zipf distribution. 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), 2019, pp. 1–6.

Downloads

Published

2024-10-31

How to Cite

Sergeev, S. L., Blekanov, I. S., Ezhov, F. V., & Tarasov, N. A. (2024). Extending the applicability of the Zipf’s laws to the sequences of byte data: . Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 20(3), 391–403. https://doi.org/10.21638/spbu10.2024.307

Issue

Section

Computer Science