"Robots": Control what search engines index

TL;DR

in the <head> section of a page’s HTML you can use the tag <meta name=“robots”> to control if a search engine should index the site
<meta name="robots" content="index> will lead to the resource being indexed
<meta name="robots" content="noindex> will lead to the resource not being indexed
the X-Robots-Tag response header may also define whether a resource should be indexed or not
- it is especially useful for non-HTML documents (as they cannot define a robots HTML tag)
add X-Robots-Tag in nginx: add_header X-Robots-Tag "index"
add X-Robots-Tag in apache: Header Set X-Robots-Tag "index" or RequestHeader append X-Robots-Tag "index"
in order for the rules to work, the resource (i.e. a page or resource on the server like a PDF) must not be excluded from crawling via the robots.txt

robots.txt

The robots.txt must be in the root path of your directory. Mine is at https://miriam-mueller.com/robots.txt. Those text files are simple and contain text in a specified format, most bots in the internet understand. There is no mechanism enforcing that bots must follow the rules you put in the robots.txt, but most “friendly” bots respect it.

do not use robots.txt to control indexing

If you use the “Disallow” rule in the robots.txt for example a search crawler will not access it. That does not mean, that it will not be indexed. If the bot encounters a link to the page you disallowed, but then cannot access it, it cannot read the info from a robots meta-tag or X-Robots-Tag header and the site might end up being indexed.

Robots meta-tag configuration explained

When using the robots meta-tag there are two main configurations you need to understand: index and noindex, as well as follow and nofollow. Index and noindex control, whether you want to index a resource. Follow and nofollow are used to indicate, if the search engine crawler should follow links on that resource.

Configuration	Consequence	Use Case
`<meta name="robots" content="index">`	The page is indexed	you want the resource to be indexed
`<meta name="robots" content="noindex">`	The page is not indexed	you do not want the resource to be indexed
`<meta name="robots" content="noindex, follow">`	The page is not indexed; the crawler follows links	when having many internal links, you want the crawler to follow those links but they might not be of use to the user and thus should not be indexed
`<meta name="robots" content="index, nofollow">`	The page is indexed; the crawler does not follow links	if you are running a forum for example and you do not want the bot to follow links users post in the comments

Fine tune link following

Instead of using the robots meta-tag, you can also specify on a single link, if it should be/not be followed. This is done via the “rel” attribute. In 2019, an update was made to the rel-attributes and how they are treated in search engine crawlers.

<a href="https://something.com rel="nofollow">

Nofollow - a recommendation

When first introduced, nofollow meant what it spells: do not follow. The above-mentioned change in 2019, as outlined in this Google Search Central Article on “Evoling ’nofollow’” article, lead to the current status quo. It is, that the search algorithms treat all rel-attributes as hints and the content will be processed.

X-Robots-Tag response header

This header is not part of any specification, but it is the de-facto standard method for telling a search engine crawler whether to index a resource or not. Syntax:

X-Robots-Tag: <indexing-rule>
X-Robots-Tag: <indexing-rule>, …, <indexing-ruleN>