Formation of the core of documents in Internet monitoring systems under resource constraints
https://doi.org/10.17586/0021-3454-2022-65-11-826-832
Abstract
The features of development of open-type Internet monitoring systems with an unlimited number of sources in conditions of a limited amount of data storage systems are considered. The purpose of the work is to solve the problem of forming a set of documents of the minimum required size (the core of documents) that meets the requirements of representativeness and variability of topics when monitoring the Internet. To formalize and solve the problem, a set-theoretic model of the document core is developed. The proposed approach is distinguished by the use of a preemptive algorithm that supports the availability of only relevant documents in the database within the available volume of the data storage system. The results of an experiment using real data confirming the applicability of the developed model are presented. The proposed approach can be used in a number of practical tasks, in particular for searching the Internet for information (documents, pages) for which there is no a priori information needed for keyword search.
About the Authors
S. V. KuleshovRussian Federation
Sergey V. Kuleshov — Dr. Sci., Professor; St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Chief Researcher
St. Petersburg
A. A. Zaytseva
Russian Federation
Alexandra A. Zaytseva — PhD; St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Senior Researcher
St. Petersburg
A. Yu. Aksenov
Russian Federation
Alexey Yu. Aksenov — PhD; St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Senior Researcher
St. Petersburg
References
1. Zachlod C., Samuel O., Ochsner A., & Werthmüller S. Journal of Business Research, 2022, vol. 144, рр. 1064–1076, DOI: 10.1016/j.jbusres.2022.02.016.
2. Fink C., Toivonen T., Correia R. A., & Di Minin E. Applied Geography, 2021, рр. 134, DOI: 10.1016/j.apgeog.2021.102505.
3. Han H., Wang C., Zhao Y., Shu M., Wang W., & Min Y. World Wide Web, 2022, no. 3(25), pp. 1169–1195, DOI: 10.1007/s11280-022-01031-4.
4. Krewinkel A., Sünkler S., Lewandowski D. et al. Food Control, 2016, vol. 61, рр. 204–212, DOI: 10.1016/j.foodcont.2015.09.039.
5. Beliaevskii K.O. Peter the Great St. Petersburg Polytechnic University. Computing, Telecommunications and Control, 2019, no. 4(12), pp. 97–110. (in Russ.)
6. Puzak T.R. Analysis of Cache Replacement-Algorithms, Doctor’s thesis, 1985.
7. Wilson P.R. et al. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1995, vol. 986, рр. 1–116.
8. Laliwala Z., Shaikh A. Web Crawling and Data Mining with Apache Nutch., Packt Publishing, 2013.
9. Nasraoui O. Computer Science, 2008, DOI:10.1145/1540276.1540281.
10. Van den Broucke S., Baesens B. From Web Scraping to Web Crawling. Practical Web Scraping for Data Science, Apress – Berkeley, CA, 2018, рр. 155–172.
11. Alkalbani A.M., Hussain W. & Kim J.Y. IEEE Access, 2019, vol. 7, рр. 128213–128223, DOI: 10.1109/ACCESS.2019.2939543.
12. Wu Z., Cai Z., Tang, X., Xu Y., & Deng T. Journal of Parallel and Distributed Computing, 2022, vol. 166, рр. 1–14, DOI:10.1016/j.jpdc.2022.04.008.
13. Zaitseva A.A., Kuleshov S.V., Mikhailov S.N. Trudy SPIIRAN (SPIIRAS Proceedings), 2014, no. 37, pp. 144—155. (in Russ.)
14. Kuleshov S.V., Zaytseva A.A., Levashkin S.P. Informatization and communication, 2020, no. 5, pp. 22–28. (in Russ.)
15. Kuleshov S., Zaytseva A., Aksenov A. Systems Applications in Software Engineering. CoMeSySo 2019. Advances in Intelligent Systems and Computing, 2019, vol. 1046, рр. 7–26, DOI 10.1007/978-3-030-30329-7_26.
Review
For citations:
Kuleshov S.V., Zaytseva A.A., Aksenov A.Yu. Formation of the core of documents in Internet monitoring systems under resource constraints. Journal of Instrument Engineering. 2022;65(11):826-832. (In Russ.) https://doi.org/10.17586/0021-3454-2022-65-11-826-832