Contents
Introduction
The Internet used to be 4 computers in size. Now, there are around 50 Billion connected devices
In 2020, there was around 1.8 Billion websites on the internet. Finding a website without knowing it's address is much more difficult now.
Search Engine Indexing
A web crawler, often called a "spider", crawls the web, and creates an index of the sites it finds, using the PageRank algorithm to give each page a weight.
Websites are constantly being made, and some removed from the web
This means that the index must be kept up to date, to ensure that up to date sites are available
Googles crawler is known as "GoogleBot"
When a person searches into a search engine, they are not searching the whole web, this would be too slow, and impractial. They search the index of the pages.
Robots
Not all parts of a site are available to index, a file known as robots.txt
exists to prevent the spiders from going where they
shouldn't, for example, in locations where a login is necessary
For example, ChatGPT keeps their robots.txt
here: https://chatgpt.com/robots.txt