RFC 9309 | Robots Exclusion Protocol (REP) | September 2022 |
Koster, et al. | Standards Track | [Page] |
This document specifies and extends the "Robots Exclusion Protocol" method originally defined by Martijn Koster in 1994 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers. Specifically, it adds definition language for the protocol, instructions for handling errors, and instructions for caching.¶
This is an Internet Standards Track document.¶
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841.¶
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc9309.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
This document applies to services that provide resources that clients can access through URIs as defined in [RFC3986]. For example, in the context of HTTP, a browser is a client that displays the content of a web page.¶
Crawlers are automated clients. Search engines, for instance, have crawlers to recursively traverse links for indexing as defined in [RFC8288].¶
It may be inconvenient for service owners if crawlers visit the entirety of their URI space. This document specifies the rules originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT] that crawlers are requested to honor when accessing URIs.¶
These rules are not a form of access authorization.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The protocol language consists of rule(s) and group(s) that the service makes available in a file named "robots.txt" as described in Section 2.3:¶
Below is an Augmented Backus-Naur Form (ABNF) description, as described in [RFC5234].¶
robotstxt = *(group / emptyline) group = startgroupline ; We start with a user-agent ; line *(startgroupline / emptyline) ; ... and possibly more ; user-agent lines *(rule / emptyline) ; followed by rules relevant ; for the preceding ; user-agent lines startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL rule = *WS ("allow" / "disallow") *WS ":" *WS (path-pattern / empty-pattern) EOL ; parser implementors: define additional lines you need (for ; example, Sitemaps). product-token = identifier / "*" path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern empty-pattern = *WS identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) comment = "#" *(UTF8-char-noctl / WS / "#") emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have ; optional trailing comment NL = %x0D / %x0A / %x0D.0A WS = %x20 / %x09 ; UTF8 derived from RFC 3629, but excluding control characters UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, "#" UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / %xF4 %x80-8F 2UTF8-tail UTF8-tail = %x80-BF¶
Crawlers set their own name, which is called a product token, to find relevant groups. The product token MUST contain only uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-"). The product token SHOULD be a substring of the identification string that the crawler sends to the service. For example, in the case of HTTP [RFC9110], the product token SHOULD be a substring in the User-Agent header. The identification string SHOULD describe the purpose of the crawler. Here's an example of a User-Agent HTTP request header with a link pointing to a page describing the purpose of the ExampleBot crawler, which appears as a substring in the User-Agent HTTP header and as a product token in the robots.txt user-agent line:¶
Note that the product token (ExampleBot) is a substring of the User-Agent HTTP header.¶
Crawlers MUST use case-insensitive matching to find the group that matches the product token and then obey the rules of the group. If there is more than one group matching the user-agent, the matching groups' rules MUST be combined into one group and parsed according to Section 2.2.2.¶
If no matching group exists, crawlers MUST obey the group with a user-agent line with the "*" value, if present.¶
If no group matches the product token and there is no group with a user-agent line with the "*" value, or no groups are present at all, no rules apply.¶
These lines indicate whether accessing a URI that matches the corresponding path is allowed or disallowed.¶
To evaluate if access to a URI is allowed, a crawler MUST match the paths in "allow" and "disallow" rules against the URI. The matching SHOULD be case sensitive. The matching MUST start with the first octet of the path. The most specific match found MUST be used. The most specific match is the match that has the most octets. Duplicate rules in a group MAY be deduplicated. If an "allow" rule and a "disallow" rule are equivalent, then the "allow" rule SHOULD be used. If no match is found amongst the rules in a group for a matching user-agent or there are no rules in the group, the URI is allowed. The /robots.txt URI is implicitly allowed.¶
Octets in the URI and robots.txt paths outside the range of the ASCII coded character set, and those in the reserved range defined by [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to comparison.¶
If a percent-encoded ASCII octet is encountered in the URI, it MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by [RFC3986] or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.¶
For example:¶
The crawler SHOULD ignore "disallow" and "allow" rules that are not in any group (for example, any rule that precedes the first user-agent line).¶
Implementors MAY bridge encoding mismatches if they detect that the robots.txt file is not UTF-8 encoded.¶
Crawlers MUST support the following special characters:¶
If crawlers match special characters verbatim in the URI, crawlers SHOULD use "%" encoding. For example:¶
Crawlers MAY interpret other records that are not part of the robots.txt protocol -- for example, "Sitemaps" [SITEMAPS]. Crawlers MAY be lenient when interpreting other records. For example, crawlers may accept common misspellings of the record.¶
Parsing of other records MUST NOT interfere with the parsing of explicitly defined records in Section 2. For example, a "Sitemaps" record MUST NOT terminate a group.¶
The rules MUST be accessible in a file named "/robots.txt" (all lowercase) in the top-level path of the service. The file MUST be UTF-8 encoded (as defined in [RFC3629]) and Internet Media Type "text/plain" (as defined in [RFC2046]).¶
As per [RFC3986], the URI of the robots.txt file is:¶
"scheme:[//authority]/robots.txt"¶
For example, in the context of HTTP or FTP, the URI is:¶
https://www.example.com/robots.txt ftp://ftp.example.com/robots.txt¶
If the crawler successfully downloads the robots.txt file, the crawler MUST follow the parseable rules.¶
It's possible that a server responds to a robots.txt fetch request with a redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers SHOULD follow at least five consecutive redirects, even across authorities (for example, hosts in the case of HTTP).¶
If a robots.txt file is reached within five consecutive redirects, the robots.txt file MUST be fetched, parsed, and its rules followed in the context of the initial authority.¶
If there are more than five consecutive redirects, crawlers MAY assume that the robots.txt file is unavailable.¶
If the robots.txt file is unreachable due to server or network errors, this means the robots.txt file is undefined and the crawler MUST assume complete disallow. For example, in the context of HTTP, server errors are identified by status codes in the 500-599 range.¶
If the robots.txt file is undefined for a reasonably long period of time (for example, 30 days), crawlers MAY assume that the robots.txt file is unavailable as defined in Section 2.3.1.3 or continue to use a cached copy.¶
Crawlers MUST try to parse each line of the robots.txt file. Crawlers MUST use the parseable rules.¶
The Robots Exclusion Protocol is not a substitute for valid content security measures. Listing paths in the robots.txt file exposes them publicly and thus makes the paths discoverable. To control access to the URI paths in a robots.txt file, users of the protocol should employ a valid security measure relevant to the application layer on which the robots.txt file is served -- for example, in the case of HTTP, HTTP Authentication as defined in [RFC9110].¶
To protect against attacks against their system, implementors of robots.txt parsing and matching logic should take the following considerations into account:¶
This document has no IANA actions.¶
The following example shows:¶
User-Agent: * Disallow: *.gif$ Disallow: /example/ Allow: /publications/ User-Agent: foobot Disallow:/ Allow:/example/page.html Allow:/example/allowed.gif User-Agent: barbot User-Agent: bazbot Disallow: /example/page.html User-Agent: quxbot EOF¶
The following example shows that in the case of two rules, the longest one is used for matching. In the following case, /example/page/disallowed.gif MUST be used for the URI example.com/example/page/disallow.gif.¶
User-Agent: foobot Allow: /example/page/ Disallow: /example/page/disallowed.gif¶