Preview

Journal of Instrument Engineering

Advanced search

Phenomenological Description of Internet Documents Collecting and Processing

https://doi.org/10.17586/0021-3454-2023-66-12-1002-1010

Abstract

The state of the Internet as a repository of information resources is analyzed from the point of view of a bot a program that collects data for the purpose of monitoring resources, filling a search engine, or other commercial or research purposes. An approach is proposed to describe the problem under study through a set of phenomena that arise when collecting documents on the Internet. The described phenomena must be taken into account when developing monitoring systems or search engines. A number of features that arise during web scraping, harvesting and other cases of using bots to collect data on the Internet are given. The problems of using subdomains, recursive subdomains, dynamically loaded content technologies, search engine optimization of text content and others are described. It is shown that the task of collecting data from Internet resources is not only technological, but also to a greater extent knowledge intensive, and since research is in an active phase, there is no “out-of-the-box” solution for it. The article will be useful to researchers in the field of Internet development, search engine developers, specialists in data retrieval and Internet technologies, as well as specialists in the field of creation and support of Internet resources and in the field of Internet marketing.

About the Authors

S. V. Kuleshov
St. Petersburg Federal Research Center of the RAS
Russian Federation

Sergey V. Kuleshov — Dr. Sci., Professor RAS; St. Petersburg Federal Research Center of the RAS,
St. Petersburg Institute for Informatics and Automation of the RAS, Research Automation Laboratory; Senior Researcher



A. A. Zaytseva
St. Petersburg Federal Research Center of the RAS
Russian Federation

Alexandra A. Zaytseva — PhD; St. Petersburg Federal Research Center of the RAS



References

1. Berners-Lee T. Information Management: A Proposal, CERN, March 1989, May 1990.

2. RFC 1945, https://datatracker.ietf.org/doc/html/rfc1945.

3. Barnet B. Memory Machines: The Evolution of Hypertext, Anthem Press, 2013.

4. Olston C. and Najork M. Information Retrieval, 2010, no. 3(4), pp. 175–246.

5. Najork M., Heydon A. High-Performance Web Crawling in Handbook of Massive Data Sets. Massive Computing, Springer, 2002, vol. 4, https://doi.org/10.1007/978-1-4615-0005-6_2.

6. Laliwala Z., Shaikh A. Web Crawling and Data Mining with Apache Nutch, Packt Publishing, 2013.

7. Nasraoui O. ACM SIGKDD Explorations Newsletter, 2008, DOI: https://doi.org/10.1145/1540276.1540281.

8. Chakrabarti S. Mining the Web: Discovering knowledge from hypertext data, Elsevier, 2003.

9. Castillo C. ACM SIGIR Forum, 2005, DOI: https://doi.org/10.1145/1067268.1067287.

10. Boeing G., Waddell P. Journal of Planning Education and Research, 2017, no. 4(37), DOI:10.2139/ssrn.2781297.

11. Practical Web Scraping for Data Science, Apress, Berkeley, CA, https://doi.org/10.1007/978-1-4842-3582-9_6.

12. Bloch J. Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, 2006, рр. 506–507.

13. Robillard M.P. et al. IEEE Transactions on Software Engineering, 2012, no. 5(39), pp. 613–637.

14. Ofoeda J., Boateng R., Effah J. International Journal of Enterprise Information Systems (IJEIS), 2019, no. 3(15), pp. 76–95.

15. Qi L. et al. IEEE transactions on big data, 2020, no. 3(8), pp. 685–698.

16. https://eais.rkn.gov.ru/. (in Russ.)

17. HTML::LinkExtor - Extract links from an HTML document, http://search.cpan.org/dist/HTML.-Parser/lib/HTML/LinkExtor.pm.

18. http://habrahabr.ru/post/185816/. (in Russ.)

19. http://seopult.ru/subscribe.html?id=76. (in Russ.)

20. http://habrahabr.ru/post/23456/. (in Russ.)

21. http://habrahabr.ru/post/130258/. (in Russ.)

22. http://socio.escience.ifmo.ru/content/files/file/network+centered.pdf. (in Russ.)

23. http://download.yandex.ru/company/techno/YandexTech_1.pdf. (in Russ.)

24. http://habrahabr.ru/post/123671/. (in Russ.)

25. HtmlUnit – JavaScript Tutorial, https://htmlunit.sourceforge.io/javascript-howto.html.

26. https://timeweb.com/ru/community/articles/poddomeny-chto-eto-takoe-i-zachem-oni-nuzhny. (in Russ.)

27. RFC1035: Domain Names – Implementation and Specification. Network Working Group, November 1987, http://www.faqs.org/rfcs/rfc1035.htm>.

28. https://habr.com/ru/company/click/blog/478758/. (in Russ.)

29. A Standard for Robot Exclusion, http://www.robotstxt.org/orig.html.

30. Kuleshov S., Zaytseva A., Aksenov A. Natural Language Search and Associative-Ontology Matching Algorithms Based on Graph Representation of Texts in Intelligent Systems Applications in Software Engineering. Advances in Intelligent Systems and Computing, Springer, Cham, 2019, vol. 1046, DOI 10.1007/978-3-030-30329-7_26.

31. Mikhailov S.N., Kuleshov S.V. Izvestiya Yugo-Zapadnogo gosudarstvennogo universiteta (Proceedings of the Southwest State University), 2013, no. 6-2(51), pp. 40–43. (in Russ.)

32. Zaytseva A.А., Kuleshov S.V., Mikhailov S.N. SPIIRAS Proceedings, 2014, no. 37, pp. 144–155. (in Russ.)

33. Moskalenko A.A., Laponina O.R., Sukhomlin V.A. Modern Information Technology and IT-education, 2019, no. 2(15), pp. 413–420. (in Russ.)

34. Ignatiev A.G., Lindre Yu.A. Aktual'nyye trendy regulirovaniya Interneta: ot otkrytogo prostranstva bezgranichnoy svobody k regional'noy i stranovoy fragmentatsii (Current Trends in Internet Regulation: from an open Space of Un limited Freedom to Regional and Country Fragmentation), Moscow, 2023, 30 р., EDN EHZLLW. (in Russ.)

35. Kulikova A.V. Indeks bezopasnosti, 2015, no. 1(21), pp. 115–120, EDN XBFPKZ. (in Russ.)

36.


Review

For citations:


Kuleshov S.V., Zaytseva A. Phenomenological Description of Internet Documents Collecting and Processing. Journal of Instrument Engineering. 2023;66(12):1002-1010. (In Russ.) https://doi.org/10.17586/0021-3454-2023-66-12-1002-1010

Views: 28


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 0021-3454 (Print)
ISSN 2500-0381 (Online)