Architectural and technological solutions to systems for collecting and managing unstructured data flows
https://doi.org/10.17586/0021-3454-2025-68-11-919-926
Abstract
The problem of processing large amounts of unstructured data obtained from open web sources in conditions of limited storage resources and a growing proportion of spam content is considered. The aim of the research is to develop architectural and technological solutions for effective management of unstructured data flows, including maintaining the current state of the core of domain-specific documents. Options for implementing technologies for proactive data storage and deferred web scanning are offered. Preferential storage allows you to manage data in systems with a fixed amount of memory, using criteria for the importance of documents: creation time, compliance with the subject area and the level of duplication. Deferred analysis technology is designed to enrich data by supplementing and clarifying information from open sources without creating a peak load on external resources. A solution to the problem of maintaining an up-todate core of documents related to the state of the subject area is proposed. The architecture of a proactive storage and deferred web scraping system is proposed, which allows efficient data management with exponential content growth. The results obtained can be used to improve the methods of processing aggregated and synthetic content obtained from open sources.
Keywords
About the Authors
S. V. KuleshovRussian Federation
Sergey V. Kuleshov — Dr. Sci., Professor of the RAS; St. Petersburg Institute for Informatics and Automation of the RAS, Laboratory of Scientific Research Automation; Chief Researcher
St. Petersburg
A. A. Zaytseva
Russian Federation
Alexandra A. Zaytseva — PhD, Associate Professor; St. Petersburg Institute for Informatics and Automation of the RAS, Laboratory of Scientific Research Automation; Senior Researcher
St. Petersburg
References
1. https://www.gazeta.ru/tech/news/2024/03/06/22489231.shtml. (in Russ.)
2. https://blog.google/products/search/google-search-update-march-2024/.
3. Tao F. et al. 2018 IEEE International Conference on Data Mining (ICDM), 2018, рр. 1260–1265.
4. Agichtein E., Gravano L. Proceedings of the 5th ACM Conference on Digital libraries, 2000, рр. 85–94.
5. Kuleshov S., Zaytseva A., Aksenov A. Intelligent Systems Applications in Software Engineering. Advances in Intelligent Systems and Computing, Springer, Cham, 2019, vol. 1046, рр. 285–294, DOI 10.1007/978-3-030-30329-7_26.
6. Razrabotka teoreticheskikh i tekhnologicheskikh osnov analiza nestrukturirovannykh dannykh i mnogomodal’nogo vzaimodeystviya pol’zovateley, intellektual’noy podderzhki tselenapravlennogo kollektivnogo povedeniya uchastnikov v cheloveko-mashinnykh soobshchestvakh (promezhutochnyy, 2 etap) (Development of Theoretical and Technological Foundations for the Analysis of Unstructured Data and Multimodal User Interaction, Intelligent Support for Targeted Collective Behavior of Participants in Human-Machine Communities (Intermediate, Stage 2)), Research Report, Code FFZF-2022-0005, State Registration Number 224021200089-4. (in Russ.)
7. Kuleshov S.V. Metodologicheskiye problemy upravleniya makrosistemami (Methodological Problems of Macrosystems Management), Proceedings of the XV All-Russian Scientific and Practical Conference, Apatity, April 1–4, 2024. (in Russ.)
8. Kuleshov S.V., Zaitseva A.A. Journal of Instrument Engineering, 2023, no. 12(66), pp. 1002–1010, DOI: 10.17586/0021-3454-2023-66-12-1002-1010. (in Russ.)
9. Kuleshov S.V., Zaitseva A.A., Aksenov A.Yu. Journal of Instrument Engineering, 2022, no. 11(65), pp. 826–832, DOI: 10.17586/0021-3454-2022-65-11-826-832. (in Russ.)
10. Aleksandrov V.V., Kuleshov S.V. Kachestvo. Innovatsii. Obrazovaniye, 2008, no. 3(34), pp. 68–70. (in Russ.)
11. Trishin I.G. Istoricheskiy zhurnal: nauchnyye issledovaniya, 2023, no. 3, pp. 29–39, DOI 10.7256/2454-0609.2023.3.39859. (in Russ.)
12. Apanovich Z.V., Marchuk A.G. Elektronnyye biblioteki: perspektivnyye metody i tekhnologii, elektronnyye kollektsii (Electronic Libraries: Promising Methods and Technologies, Electronic Collections), Proceedings of the XV All-Russian Scientific Conference RCDL’2013, Yaroslavl, October 14–17, 2013, рр. 300–305. (in Russ.)
13. Chen H. et al. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (vol. 1. Long Papers), 2021, рр. 4370–4379.
14. Dumais S., Chen H. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, рр. 256–263.
Review
For citations:
Kuleshov S.V., Zaytseva A.A. Architectural and technological solutions to systems for collecting and managing unstructured data flows. Journal of Instrument Engineering. 2025;68(11):919-926. (In Russ.) https://doi.org/10.17586/0021-3454-2025-68-11-919-926






















