Выделение текстовых трендов в социальной сети OK

Евгений Алексеевич Малютин; Дмитрий Юрьевич Бугайченко; Алексей Николаевич Мишенин

doi:10.21638/11701/spbu10.2017.308

Authors

Евгений Алексеевич Малютин St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation
Дмитрий Юрьевич Бугайченко St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation
Алексей Николаевич Мишенин St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation

DOI:

https://doi.org/10.21638/11701/spbu10.2017.308

Abstract

Social networks now serve not as a mere medium for entertainment, but as an information distribution channel that is replacing classical mass media. In this article we describe a scalable trend detection system implemented with the social network OK. Actors (users and communities) of social networks form a broad agenda. The content of social networks is specific:
• UGC (user generated content) is difficult to process;
• actors generate a multilingual text. This requires attracting a large number of highly paid professionals in the case of classical media analysis;
• modern social networks comprise a highly-connected society with high “response time”. Therefore, the system must work in real time;
• social networks are used by spammers as a platform for promotion and obtrusive advertising, therefore the system should contain the ability to filter spam content.
Applying standard methods of media analysis to this seems impossible. It creates a natural demand for developing and implementing textual trend detection and analysis software. There are two main approaches of trend detection in academic papers: topic modeling (and further topics evolutionary analysis) and distributive models based on frequency-like properties of distinct terms. We conducted an analysis of scientific papers using both approaches taking into account the specific features of social networks. As a result of research, it was decided to use distributive models as a base for the system development. OK is one of the largest social networks in Russia and the CIS countries. Actors generate over 100M symbols of text every day. Even basic processing is a serious technical problem. So we are forced to use Big Data approaches through the development. We introduce lambda-architecture based on three main components:
• daily-batch processing component, based on Apache Spark;
• streaming processing component, based on Apache Samza;
• mini-batch processing component, based on Spark Streaming.
The article describes in detail the architecture and technical features of each component. In conclusion we present the results of operating the system as well as discuss areas for further research and development. Refs 13. Figs 7. Table 1.

Keywords:

natural language processing, trend detection, big data

Downloads

Download data is not yet available.

References

Литература

Lau J. H., Collier N., Baldwin T. On-line trend analysis with topic models: twitter trends detection topic model online // Proceedings of COLING: technical papers. Mumbai, 2012. P. 1519–1534.

Ahmed A., Xing E. P. Timeline: A dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream // Proceedings of the Twenty-Sixth Conference. Conference on Uncertainty in Artificial Intelligence. 2010. Vol. 20. P. 29.

Schubert E., Weiler M., Kriegel H.-P. Signitrend: scalable detection of emerging topics in textual streams by hashed significance thresholds // Proceedings of the 20th ACM SIGKDD International conference on Knowledge discovery and data mining. 2014. P. 871–880.

Cvijikj I. P., Michahelles F. Monitoring trends on facebook // Dependable, Autonomic and Secure Computing (DASC), 2011. IEEE Ninth Intern. Conference on. 2011. P. 895–902.

Finch T. Incremental calculation of weighted mean and variance: technical report. Cambridge, 2009. Vol. 4.

Ester M., Kriegel H.-P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise // Kdd. 1996. Vol. 96, N 34. P. 226–231.

Open-source library for language detection. URL: https://github.com/optimaize/languagedetector (accessed: 26.02.2017).

Additional language profile for CIS-languages. URL: https://github.com/denniean/language_profiles (accessed: 26.02.2017).

Jeffrey D., Ullman Anand Rajaraman, Jure Leskovec. Mining of massive datasets, 2013. URL: http://infolab.stanford.edu/~ullman/mmds.html (accessed: 26.02.2017).

Open-source library for data analysis. URL: https://elki-project.github.io/ (accessed: 26.02.2017).

Scalable stream processing platform. URL: https://kafka.apache.org/ (accessed: 26.02.2017).

Apache Samza: distributed stream processing framework. URL: http://samza.apache.org/ (accessed: 26.02.2017).

Apache Zeppelin: web-dashboard for interactive data analysis. URL: https://zeppelin.apache.org/ (accessed: 26.02.2017).

References

Lau J. H., Collier N., Baldwin T. On-line trend analysis with topic models: twitter trends detection topic model online. Proceedings of COLING: Technical Papers. Mumbai, 2012, pp. 1519–1534.

Ahmed A., Xing E. P. Timeline: A dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream. Proceedings of the Twenty-Sixth Conference. Conference on Uncertainty in Artificial Intelligence, 2010, iss. 20, p. 29.

Schubert E., Weiler M., Kriegel H.-P. Signitrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD International conference on Knowledge discovery and data mining, 2014, pp. 871–880.

Cvijikj I. P., Michahelles F. Monitoring trends on facebook. Dependable, Autonomic and Secure Computing (DASC), IEEE Ninth International Conference on., 2011, pp. 895–902.

Finch T. Incremental calculation of weighted mean and variance. Technical report. Cambridge, 2009, vol. 4.

Ester M., Kriegel H.-P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 1996, vol. 96, no. 34, pp. 226–231.

Open-source library for language detection. Available at: https://github.com/optimaize/languagedetector (accessed: 26.02.2017).

Additional language profile for CIS-languages. Available at: https://github.com/denniean/language_profiles (accessed: 26.02.2017).

Jeffrey D., Ullman Anand Rajaraman, Jure Leskovec. Mining of massive datasets, 2013. Available at: http://infolab.stanford.edu/~ullman/mmds.html (accessed: 26.02.2017).

Open-source library for data analysis. Available at: https://elki-project.github.io/ (accessed: 26.02.2017).

Scalable stream processing platform. Available at: https://kafka.apache.org/ (accessed: 26.02.2017).

Apache Samza: distributed stream processing framework. Available at: http://samza.apache.org/ (accessed: 26.02.2017).

Apache Zeppelin: web-dashboard for interactive data analysis. Available at: https://zeppelin.apache.org/ (accessed: 26.02.2017).

Textual trends detection at OK

Authors

DOI:

Abstract

Keywords:

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Language

indexed

Information