Search Engine Indexing

Introduction

The Internet used to be 4 computers in size. Now, there are around 50 Billion connected devices

In 2020, there was around 1.8 Billion websites on the internet. Finding a website without knowing it's address is much more difficult now.

Search Engine Indexing

A web crawler, often called a "spider", crawls the web, and creates an index of the sites it finds, using the PageRank algorithm to give each page a weight.

Websites are constantly being made, and some removed from the web

This means that the index must be kept up to date, to ensure that up to date sites are available

Googles crawler is known as "GoogleBot"

When a person searches into a search engine, they are not searching the whole web, this would be too slow, and impractial. They search the index of the pages.

Robots

Not all parts of a site are available to index, a file known as robots.txt exists to prevent the spiders from going where they shouldn't, for example, in locations where a login is necessary

For example, ChatGPT keeps their robots.txt here: https://chatgpt.com/robots.txt