TL;DR
- in the
<head>section of a page’s HTML you can use the tag <meta name=“robots”> to control if a search engine should index the site <meta name="robots" content="index>will lead to the resource being indexed<meta name="robots" content="noindex>will lead to the resource not being indexed- the X-Robots-Tag response header may also define whether a resource should be indexed or not
- it is especially useful for non-HTML documents (as they cannot define a robots HTML tag)
- add X-Robots-Tag in nginx:
add_header X-Robots-Tag "index" - add X-Robots-Tag in apache:
Header Set X-Robots-Tag "index"orRequestHeader append X-Robots-Tag "index" - in order for the rules to work, the resource (i.e. a page or resource on the server like a PDF) must not be excluded from crawling via the robots.txt
robots.txt
The robots.txt must be in the root path of your directory. Mine is at https://miriam-mueller.com/robots.txt. Those text files are simple and contain text in a specified format, most bots in the internet understand. There is no mechanism enforcing that bots must follow the rules you put in the robots.txt, but most “friendly” bots respect it.
do not use robots.txt to control indexing
If you use the “Disallow” rule in the robots.txt for example a search crawler will not access it. That does not mean, that it will not be indexed. If the bot encounters a link to the page you disallowed, but then cannot access it, it cannot read the info from a robots meta-tag or X-Robots-Tag header and the site might end up being indexed.
Robots meta-tag configuration explained
When using the robots meta-tag there are two main configurations you need to understand: index and noindex, as well as follow and nofollow. Index and noindex control, whether you want to index a resource. Follow and nofollow are used to indicate, if the search engine crawler should follow links on that resource.
See also: mdn <meta name=“robots”
| Configuration | Consequence | Use Case |
|---|---|---|
<meta name="robots" content="index"> | The page is indexed | you want the resource to be indexed |
<meta name="robots" content="noindex"> | The page is not indexed | you do not want the resource to be indexed |
<meta name="robots" content="noindex, follow"> | The page is not indexed; the crawler follows links | when having many internal links, you want the crawler to follow those links but they might not be of use to the user and thus should not be indexed |
<meta name="robots" content="index, nofollow"> | The page is indexed; the crawler does not follow links | if you are running a forum for example and you do not want the bot to follow links users post in the comments |
Fine tune link following
Instead of using the robots meta-tag, you can also specify on a single link, if it should be/not be followed. This is done via the “rel” attribute. In 2019, an update was made to the rel-attributes and how they are treated in search engine crawlers.
<a href="https://something.com rel="nofollow">
See also: mdn - HTML attribute: rel
Nofollow - a recommendation
When first introduced, nofollow meant what it spells: do not follow. The above-mentioned change in 2019, as outlined in this Google Search Central Article on “Evoling ’nofollow’” article, lead to the current status quo. It is, that the search algorithms treat all rel-attributes as hints and the content will be processed.
X-Robots-Tag response header
This header is not part of any specification, but it is the de-facto standard method for telling a search engine crawler whether to index a resource or not. Syntax:
X-Robots-Tag: <indexing-rule>
X-Robots-Tag: <indexing-rule>, …, <indexing-ruleN>
See also: mdn X-Robots-Tag header
Further reading
- Google Search Central - Introduction to robots.txt
- Google Search Central - Robots meta tag, data-nosnippet, and X-Robots-Tag specifications
German: