Phenomenological Description of Internet Documents Collecting and Processing
https://doi.org/10.17586/0021-3454-2023-66-12-1002-1010
Abstract
The state of the Internet as a repository of information resources is analyzed from the point of view of a bot a program that collects data for the purpose of monitoring resources, filling a search engine, or other commercial or research purposes. An approach is proposed to describe the problem under study through a set of phenomena that arise when collecting documents on the Internet. The described phenomena must be taken into account when developing monitoring systems or search engines. A number of features that arise during web scraping, harvesting and other cases of using bots to collect data on the Internet are given. The problems of using subdomains, recursive subdomains, dynamically loaded content technologies, search engine optimization of text content and others are described. It is shown that the task of collecting data from Internet resources is not only technological, but also to a greater extent knowledge intensive, and since research is in an active phase, there is no “out-of-the-box” solution for it. The article will be useful to researchers in the field of Internet development, search engine developers, specialists in data retrieval and Internet technologies, as well as specialists in the field of creation and support of Internet resources and in the field of Internet marketing.
About the Authors
S. V. KuleshovRussian Federation
Sergey V. Kuleshov — Dr. Sci., Professor RAS; St. Petersburg Federal Research Center of the RAS,
St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Senior Researcher
A. A. Zaytseva
Russian Federation
Alexandra A. Zaytseva — PhD; St. Petersburg Federal Research Center of the RAS
References
1. Berners-Lee T. Information Management: A Proposal, CERN, March 1989, May 1990.
2. RFC 1945, https://datatracker.ietf.org/doc/html/rfc1945.
3. Barnet B. Memory Machines: The Evolution of Hypertext, Anthem Press, 2013.
4. Olston C. and Najork M. Information Retrieval, 2010, no. 3(4), pp. 175–246.
5. Najork M., Heydon A. High-Performance Web Crawling in Handbook of Massive Data Sets. Massive Computing, Springer, 2002, vol. 4, https://doi.org/10.1007/978-1-4615-0005-6_2.
6. Laliwala Z., Shaikh A. Web Crawling and Data Mining with Apache Nutch, Packt Publishing, 2013.
7. Nasraoui O. ACM SIGKDD Explorations Newsletter, 2008, DOI: https://doi.org/10.1145/1540276.1540281.
8. Chakrabarti S. Mining the Web: Discovering knowledge from hypertext data, Elsevier, 2003.
9. Castillo C. ACM SIGIR Forum, 2005, DOI: https://doi.org/10.1145/1067268.1067287.
10. Boeing G., Waddell P. Journal of Planning Education and Research, 2017, no. 4(37), DOI:10.2139/ssrn.2781297.
11. Practical Web Scraping for Data Science, Apress, Berkeley, CA, https://doi.org/10.1007/978-1-4842-3582-9_6.
12. Bloch J. Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, 2006, рр. 506–507.
13. Robillard M.P. et al. IEEE Transactions on Software Engineering, 2012, no. 5(39), pp. 613–637.
14. Ofoeda J., Boateng R., Effah J. International Journal of Enterprise Information Systems (IJEIS), 2019, no. 3(15), pp. 76–95.
15. Qi L. et al. IEEE transactions on big data, 2020, no. 3(8), pp. 685–698.
16. https://eais.rkn.gov.ru/. (in Russ.)
17. HTML::LinkExtor - Extract links from an HTML document, http://search.cpan.org/dist/HTML.-Parser/lib/HTML/LinkExtor.pm.
18. http://habrahabr.ru/post/185816/. (in Russ.)
19. http://seopult.ru/subscribe.html?id=76. (in Russ.)
20. http://habrahabr.ru/post/23456/. (in Russ.)
21. http://habrahabr.ru/post/130258/. (in Russ.)
22. http://socio.escience.ifmo.ru/content/files/file/network+centered.pdf. (in Russ.)
23. http://download.yandex.ru/company/techno/YandexTech_1.pdf. (in Russ.)
24. http://habrahabr.ru/post/123671/. (in Russ.)
25. HtmlUnit – JavaScript Tutorial, https://htmlunit.sourceforge.io/javascript-howto.html.
26. https://timeweb.com/ru/community/articles/poddomeny-chto-eto-takoe-i-zachem-oni-nuzhny. (in Russ.)
27. RFC1035: Domain Names – Implementation and Specification. Network Working Group, November 1987, http://www.faqs.org/rfcs/rfc1035.htm>.
28. https://habr.com/ru/company/click/blog/478758/. (in Russ.)
29. A Standard for Robot Exclusion, http://www.robotstxt.org/orig.html.
30. Kuleshov S., Zaytseva A., Aksenov A. Natural Language Search and Associative-Ontology Matching Algorithms Based on Graph Representation of Texts in Intelligent Systems Applications in Software Engineering. Advances in Intelligent Systems and Computing, Springer, Cham, 2019, vol. 1046, DOI 10.1007/978-3-030-30329-7_26.
31. Mikhailov S.N., Kuleshov S.V. Izvestiya Yugo-Zapadnogo gosudarstvennogo universiteta (Proceedings of the Southwest State University), 2013, no. 6-2(51), pp. 40–43. (in Russ.)
32. Zaytseva A.А., Kuleshov S.V., Mikhailov S.N. SPIIRAS Proceedings, 2014, no. 37, pp. 144–155. (in Russ.)
33. Moskalenko A.A., Laponina O.R., Sukhomlin V.A. Modern Information Technology and IT-education, 2019, no. 2(15), pp. 413–420. (in Russ.)
34. Ignatiev A.G., Lindre Yu.A. Aktual'nyye trendy regulirovaniya Interneta: ot otkrytogo prostranstva bezgranichnoy svobody k regional'noy i stranovoy fragmentatsii (Current Trends in Internet Regulation: from an open Space of Un limited Freedom to Regional and Country Fragmentation), Moscow, 2023, 30 р., EDN EHZLLW. (in Russ.)
35. Kulikova A.V. Indeks bezopasnosti, 2015, no. 1(21), pp. 115–120, EDN XBFPKZ. (in Russ.)
36.
Review
For citations:
Kuleshov S.V., Zaytseva A. Phenomenological Description of Internet Documents Collecting and Processing. Journal of Instrument Engineering. 2023;66(12):1002-1010. (In Russ.) https://doi.org/10.17586/0021-3454-2023-66-12-1002-1010