Diff: rfc9309.original

	rfc9309.original	rfc9309.txt


	Network Working Group M. Koster, Ed.	Internet Engineering Task Force (IETF) M. Koster
	Internet-Draft Stalworthy Computing, Ltd.	Request for Comments: 9309
	Intended status: Standards Track G. Illyes, Ed.	Category: Standards Track G. Illyes
	Expires: 7 January 2023 H. Zeller, Ed.	ISSN: 2070-1721 H. Zeller
	L. Sassman, Ed.	L. Sassman
	Google LLC.	Google LLC
	6 July 2022	September 2022

	Robots Exclusion Protocol	Robots Exclusion Protocol

	draft-koster-rep-12

	Abstract	Abstract

	This document specifies and extends the "Robots Exclusion Protocol"	This document specifies and extends the "Robots Exclusion Protocol"

	method originally defined by Martijn Koster in 1996 for service	method originally defined by Martijn Koster in 1994 for service
	owners to control how content served by their services may be	owners to control how content served by their services may be
	accessed, if at all, by automatic clients known as crawlers.	accessed, if at all, by automatic clients known as crawlers.

	Specifically, it adds definition language for the protocol and	Specifically, it adds definition language for the protocol,
	instructions for handling errors and caching.	instructions for handling errors, and instructions for caching.

	Status of This Memo	Status of This Memo


	This Internet-Draft is submitted in full conformance with the	This is an Internet Standards Track document.
	provisions of BCP 78 and BCP 79.

	Internet-Drafts are working documents of the Internet Engineering
	Task Force (IETF). Note that other groups may also distribute
	working documents as Internet-Drafts. The list of current Internet-
	Drafts is at https://datatracker.ietf.org/drafts/current/.


	Internet-Drafts are draft documents valid for a maximum of six months	This document is a product of the Internet Engineering Task Force
	and may be updated, replaced, or obsoleted by other documents at any	(IETF). It represents the consensus of the IETF community. It has
	time. It is inappropriate to use Internet-Drafts as reference	received public review and has been approved for publication by the
	material or to cite them other than as "work in progress."	Internet Engineering Steering Group (IESG). Further information on
		Internet Standards is available in Section 2 of RFC 7841.


	This Internet-Draft will expire on 7 January 2023.	Information about the current status of this document, any errata,
		and how to provide feedback on it may be obtained at
		https://www.rfc-editor.org/info/rfc9309.

	Copyright Notice	Copyright Notice

	Copyright (c) 2022 IETF Trust and the persons identified as the	Copyright (c) 2022 IETF Trust and the persons identified as the
	document authors. All rights reserved.	document authors. All rights reserved.

	This document is subject to BCP 78 and the IETF Trust's Legal	This document is subject to BCP 78 and the IETF Trust's Legal

	Provisions Relating to IETF Documents (https://trustee.ietf.org/	Provisions Relating to IETF Documents
	license-info) in effect on the date of publication of this document.	(https://trustee.ietf.org/license-info) in effect on the date of
	Please review these documents carefully, as they describe your rights	publication of this document. Please review these documents
	and restrictions with respect to this document. Code Components	carefully, as they describe your rights and restrictions with respect
	extracted from this document must include Revised BSD License text as	to this document. Code Components extracted from this document must
	described in Section 4.e of the Trust Legal Provisions and are	include Revised BSD License text as described in Section 4.e of the
	provided without warranty as described in the Revised BSD License.	Trust Legal Provisions and are provided without warranty as described
		in the Revised BSD License.

	Table of Contents	Table of Contents


	1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2	1. Introduction
	1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3	1.1. Requirements Language
	2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3	2. Specification
	2.1. Protocol Definition . . . . . . . . . . . . . . . . . . . 3	2.1. Protocol Definition
	2.2. Formal Syntax . . . . . . . . . . . . . . . . . . . . . . 3	2.2. Formal Syntax
	2.2.1. The User-Agent Line . . . . . . . . . . . . . . . . . 4	2.2.1. The User-Agent Line
	2.2.2. The Allow and Disallow Lines . . . . . . . . . . . . 6	2.2.2. The "Allow" and "Disallow" Lines
	2.2.3. Special Characters . . . . . . . . . . . . . . . . . 7	2.2.3. Special Characters
	2.2.4. Other Records . . . . . . . . . . . . . . . . . . . . 8	2.2.4. Other Records
	2.3. Access Method . . . . . . . . . . . . . . . . . . . . . . 9	2.3. Access Method
	2.3.1. Access Results . . . . . . . . . . . . . . . . . . . 9	2.3.1. Access Results
	2.3.1.1. Successful Access . . . . . . . . . . . . . . . . 9	2.3.1.1. Successful Access
	2.3.1.2. Redirects . . . . . . . . . . . . . . . . . . . . 9	2.3.1.2. Redirects
	2.3.1.3. Unavailable Status . . . . . . . . . . . . . . . 9	2.3.1.3. "Unavailable" Status
	2.3.1.4. Unreachable Status . . . . . . . . . . . . . . . 10	2.3.1.4. "Unreachable" Status
	2.3.1.5. Parsing Errors . . . . . . . . . . . . . . . . . 10	2.3.1.5. Parsing Errors
	2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 10	2.4. Caching
	2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 10	2.5. Limits
	3. Security Considerations . . . . . . . . . . . . . . . . . . . 10	3. Security Considerations
	4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11	4. IANA Considerations
	5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 11	5. Examples
	5.1. Simple Example . . . . . . . . . . . . . . . . . . . . . 11	5.1. Simple Example
	5.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 12	5.2. Longest Match
	6. References . . . . . . . . . . . . . . . . . . . . . . . . . 12	6. References
	6.1. Normative References . . . . . . . . . . . . . . . . . . 12	6.1. Normative References
	6.2. Informative References . . . . . . . . . . . . . . . . . 13	6.2. Informative References
	Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13	Authors' Addresses

	1. Introduction	1. Introduction

	This document applies to services that provide resources that clients	This document applies to services that provide resources that clients
	can access through URIs as defined in [RFC3986]. For example, in the	can access through URIs as defined in [RFC3986]. For example, in the
	context of HTTP, a browser is a client that displays the content of a	context of HTTP, a browser is a client that displays the content of a
	web page.	web page.


	Crawlers are automated clients. Search engines for instance have	Crawlers are automated clients. Search engines, for instance, have
	crawlers to recursively traverse links for indexing as defined in	crawlers to recursively traverse links for indexing as defined in
	[RFC8288].	[RFC8288].

	It may be inconvenient for service owners if crawlers visit the	It may be inconvenient for service owners if crawlers visit the
	entirety of their URI space. This document specifies the rules	entirety of their URI space. This document specifies the rules
	originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT]	originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT]
	that crawlers are requested to honor when accessing URIs.	that crawlers are requested to honor when accessing URIs.

	These rules are not a form of access authorization.	These rules are not a form of access authorization.

	1.1. Requirements Language	1.1. Requirements Language

	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
	"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and	"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and

	"OPTIONAL" in this document are to be interpreted as described in BCP	"OPTIONAL" in this document are to be interpreted as described in
	14 [RFC2119] [RFC8174] when, and only when, they appear in all	BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
	capitals, as shown here.	capitals, as shown here.

	2. Specification	2. Specification

	2.1. Protocol Definition	2.1. Protocol Definition

	The protocol language consists of rule(s) and group(s) that the	The protocol language consists of rule(s) and group(s) that the

	service makes available in a file named 'robots.txt' as described in	service makes available in a file named "robots.txt" as described in
	Section 2.3:	Section 2.3:


	* Rule: A line with a key-value pair that defines how a crawler may	Rule: A line with a key-value pair that defines how a crawler may
	access URIs. See Section 2.2.2.	access URIs. See Section 2.2.2.


	* Group: One or more user-agent lines that is followed by one or	Group: One or more user-agent lines that are followed by one or more
	more rules. The group is terminated by a user-agent line or end	rules. The group is terminated by a user-agent line or end of
	of file. See Section 2.2.1. The last group may have no rules,	file. See Section 2.2.1. The last group may have no rules, which
	which means it implicitly allows everything.	means it implicitly allows everything.

	2.2. Formal Syntax	2.2. Formal Syntax

	Below is an Augmented Backus-Naur Form (ABNF) description, as	Below is an Augmented Backus-Naur Form (ABNF) description, as
	described in [RFC5234].	described in [RFC5234].

	robotstxt = *(group / emptyline)	robotstxt = *(group / emptyline)
	group = startgroupline ; We start with a user-agent	group = startgroupline ; We start with a user-agent

		; line
	*(startgroupline / emptyline) ; ... and possibly more	*(startgroupline / emptyline) ; ... and possibly more

	; user-agents	; user-agent lines
	*(rule / emptyline) ; followed by rules relevant	*(rule / emptyline) ; followed by rules relevant

	; for UAs	; for the preceding
		; user-agent lines

	startgroupline = WS "user-agent" WS ":" *WS product-token EOL	startgroupline = WS "user-agent" WS ":" *WS product-token EOL

	rule = WS ("allow" / "disallow") WS ":"	rule = WS ("allow" / "disallow") WS ":"
	*WS (path-pattern / empty-pattern) EOL	*WS (path-pattern / empty-pattern) EOL

	; parser implementors: define additional lines you need (for	; parser implementors: define additional lines you need (for

	; example, sitemaps).	; example, Sitemaps).

	product-token = identifier / "*"	product-token = identifier / "*"
	path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern	path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
	empty-pattern = *WS	empty-pattern = *WS

	identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)	identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
	comment = "#" *(UTF8-char-noctl / WS / "#")	comment = "#" *(UTF8-char-noctl / WS / "#")
	emptyline = EOL	emptyline = EOL
	EOL = *WS [comment] NL ; end-of-line may have	EOL = *WS [comment] NL ; end-of-line may have
	; optional trailing comment	; optional trailing comment
	NL = %x0D / %x0A / %x0D.0A	NL = %x0D / %x0A / %x0D.0A
	WS = %x20 / %x09	WS = %x20 / %x09


	; UTF8 derived from RFC3629, but excluding control characters	; UTF8 derived from RFC 3629, but excluding control characters

	UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4	UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4

	UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'	UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, "#"
	UTF8-2 = %xC2-DF UTF8-tail	UTF8-2 = %xC2-DF UTF8-tail
	UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /	UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
	%xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail	%xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
	UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /	UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
	%xF4 %x80-8F 2UTF8-tail	%xF4 %x80-8F 2UTF8-tail

	UTF8-tail = %x80-BF	UTF8-tail = %x80-BF

	2.2.1. The User-Agent Line	2.2.1. The User-Agent Line

	Crawlers set their own name, which is called a product token, to find	Crawlers set their own name, which is called a product token, to find

	relevant groups. The product token MUST contain only upper and	relevant groups. The product token MUST contain only uppercase and
	lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens	lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens
	("-"). The product token SHOULD be a substring of the identification	("-"). The product token SHOULD be a substring of the identification

	string that the crawler sends to the service (for example, in the	string that the crawler sends to the service. For example, in the
	case of HTTP, the product token SHOULD be a substring in the user-	case of HTTP [RFC9110], the product token SHOULD be a substring in
	agent header). The identification string SHOULD describe the purpose	the User-Agent header. The identification string SHOULD describe the
	of the crawler. Here's an example of a user-agent HTTP request	purpose of the crawler. Here's an example of a User-Agent HTTP
	header with a link pointing to a page describing the purpose of the	request header with a link pointing to a page describing the purpose
	ExampleBot crawler, which appears as a substring in the user-agent	of the ExampleBot crawler, which appears as a substring in the User-
	HTTP header and as a product token in the robots.txt user-agent line:	Agent HTTP header and as a product token in the robots.txt user-agent
		line:


	+===================================+=================+	+==========================================+========================+
	\| user-agent HTTP header \| robots.txt \|	\| User-Agent HTTP header \| robots.txt user-agent \|
	\| \| user-agent line \|	\| \| line \|
	+===================================+=================+	+==========================================+========================+
	\| user-agent: Mozilla/5.0 \| user-agent: \|	\| User-Agent: Mozilla/5.0 (compatible; \| user-agent: ExampleBot \|
	\| (compatible; ExampleBot/0.1; \| ExampleBot \|	\| ExampleBot/0.1; \| \|
	\| https://www.example.com/bot.html) \| \|	\| https://www.example.com/bot.html) \| \|
	+-----------------------------------+-----------------+	+------------------------------------------+------------------------+


	Table 1: Example of a user-agent HTTP header and	Figure 1: Example of a User-Agent HTTP header and robots.txt
	robots.txt user-agent line for the ExampleBot	user-agent line for the ExampleBot product token
	product token. Note that the product token
	(ExampleBot) is a substring of the user-agent HTTP	Note that the product token (ExampleBot) is a substring of the User-
	header	Agent HTTP header.

	Crawlers MUST use case-insensitive matching to find the group that	Crawlers MUST use case-insensitive matching to find the group that

	matches the product token, and then obey the rules of the group. If	matches the product token and then obey the rules of the group. If
	there is more than one group matching the user-agent, the matching	there is more than one group matching the user-agent, the matching
	groups' rules MUST be combined into one group and parsed according to	groups' rules MUST be combined into one group and parsed according to
	Section 2.2.2.	Section 2.2.2.


	+========================+================+	+========================================+========================+
	\| Two groups that match \| Merged group \|	\| Two groups that match the same product \| Merged group \|
	\| the same product token \| \|	\| token exactly \| \|
	\| exactly \| \|	+========================================+========================+
	+========================+================+	\| user-agent: ExampleBot \| user-agent: ExampleBot \|
	\| user-agent: ExampleBot \| user-agent: \|	\| disallow: /foo \| disallow: /foo \|
	\| disallow: /foo \| ExampleBot \|	\| disallow: /bar \| disallow: /bar \|
	\| disallow: /bar \| disallow: /foo \|	\| \| disallow: /baz \|
	\| \| disallow: /bar \|	\| user-agent: ExampleBot \| \|
	\| user-agent: ExampleBot \| disallow: /baz \|	\| disallow: /baz \| \|
	\| disallow: /baz \| \|	+----------------------------------------+------------------------+
	+------------------------+----------------+


	Table 2: Example of how to merge two	Figure 2: Example of how to merge two robots.txt groups that
	robots.txt groups that match the same	match the same product token
	product token

	If no matching group exists, crawlers MUST obey the group with a	If no matching group exists, crawlers MUST obey the group with a
	user-agent line with the "*" value, if present.	user-agent line with the "*" value, if present.


	+====================+=============+	+==================================+======================+
	\| Two groups that \| Applicable \|	\| Two groups that don't explicitly \| Applicable group for \|
	\| don't explicitly \| group for \|	\| match ExampleBot \| ExampleBot \|
	\| match ExampleBot \| ExampleBot \|	+==================================+======================+
	+====================+=============+	\| user-agent: * \| user-agent: * \|
	\| user-agent: * \| user-agent: \|	\| disallow: /foo \| disallow: /foo \|
	\| disallow: /foo \| * \|	\| disallow: /bar \| disallow: /bar \|
	\| disallow: /bar \| disallow: \|	\| \| \|
	\| \| /foo \|	\| user-agent: BazBot \| \|
	\| user-agent: BazBot \| disallow: \|	\| disallow: /baz \| \|
	\| disallow: /baz \| /bar \|	+----------------------------------+----------------------+
	+--------------------+-------------+


	Table 3: Example of no matching	Figure 3: Example of no matching groups other than the "*" for
	groups other than the '*' for	the ExampleBot product token
	the ExampleBot product token

	If no group matches the product token and there is no group with a	If no group matches the product token and there is no group with a
	user-agent line with the "*" value, or no groups are present at all,	user-agent line with the "*" value, or no groups are present at all,
	no rules apply.	no rules apply.


	2.2.2. The Allow and Disallow Lines	2.2.2. The "Allow" and "Disallow" Lines

	These lines indicate whether accessing a URI that matches the	These lines indicate whether accessing a URI that matches the
	corresponding path is allowed or disallowed.	corresponding path is allowed or disallowed.

	To evaluate if access to a URI is allowed, a crawler MUST match the	To evaluate if access to a URI is allowed, a crawler MUST match the

	paths in allow and disallow rules against the URI. The matching	paths in "allow" and "disallow" rules against the URI. The matching
	SHOULD be case sensitive. The matching MUST start with the first	SHOULD be case sensitive. The matching MUST start with the first
	octet of the path. The most specific match found MUST be used. The	octet of the path. The most specific match found MUST be used. The
	most specific match is the match that has the most octets. Duplicate	most specific match is the match that has the most octets. Duplicate

	rules in a group MAY be deduplicated. If an allow and disallow rule	rules in a group MAY be deduplicated. If an "allow" rule and a
	are equivalent, then the allow rule SHOULD be used. If no match is	"disallow" rule are equivalent, then the "allow" rule SHOULD be used.
	found amongst the rules in a group for a matching user-agent, or	If no match is found amongst the rules in a group for a matching
	there are no rules in the group, the URI is allowed. The /robots.txt	user-agent or there are no rules in the group, the URI is allowed.
	URI is implicitly allowed.	The /robots.txt URI is implicitly allowed.


	Octets in the URI and robots.txt paths outside the range of the US-	Octets in the URI and robots.txt paths outside the range of the ASCII
	ASCII coded character set, and those in the reserved range defined by	coded character set, and those in the reserved range defined by
	[RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to	[RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to
	comparison.	comparison.


	If a percent-encoded US-ASCII octet is encountered in the URI, it	If a percent-encoded ASCII octet is encountered in the URI, it MUST
	MUST be unencoded prior to comparison, unless it is a reserved	be unencoded prior to comparison, unless it is a reserved character
	character in the URI as defined by [RFC3986] or the character is	in the URI as defined by [RFC3986] or the character is outside the
	outside the unreserved character range. The match evaluates	unreserved character range. The match evaluates positively if and
	positively if and only if the end of the path from the rule is	only if the end of the path from the rule is reached before a
	reached before a difference in octets is encountered.	difference in octets is encountered.

	For example:	For example:


	+===================+======================+======================+	+==================+=======================+=======================+
	\| Path \| Encoded Path \| Path to Match \|	\| Path \| Encoded Path \| Path to Match \|
	+===================+======================+======================+	+==================+=======================+=======================+
	\| /foo/bar?baz=quz \| /foo/bar?baz=quz \| /foo/bar?baz=quz \|	\| /foo/bar?baz=quz \| /foo/bar?baz=quz \| /foo/bar?baz=quz \|
	+-------------------+----------------------+----------------------+	+------------------+-----------------------+-----------------------+
	\| /foo/bar?baz=http \| /foo/bar?baz=http%3A \| /foo/bar?baz=http%3A \|	\| /foo/bar?baz= \| /foo/bar?baz= \| /foo/bar?baz= \|
	\| ://foo.bar \| %2F%2Ffoo.bar \| %2F%2Ffoo.bar \|	\| https://foo.bar \| https%3A%2F%2Ffoo.bar \| https%3A%2F%2Ffoo.bar \|
	+-------------------+----------------------+----------------------+	+------------------+-----------------------+-----------------------+
	\| /foo/bar/U+E38384 \| /foo/bar/%E3%83%84 \| /foo/bar/%E3%83%84 \|	\| /foo/bar/ \| /foo/bar/%E3%83%84 \| /foo/bar/%E3%83%84 \|
	+-------------------+----------------------+----------------------+	\| U+E38384 \| \| \|
	\| /foo/ \| /foo/bar/%E3%83%84 \| /foo/bar/%E3%83%84 \|	+------------------+-----------------------+-----------------------+
	\| bar/%E3%83%84 \| \| \|	\| /foo/ \| /foo/bar/%E3%83%84 \| /foo/bar/%E3%83%84 \|
	+-------------------+----------------------+----------------------+	\| bar/%E3%83%84 \| \| \|
	\| /foo/ \| /foo/bar/%62%61%7A \| /foo/bar/baz \|	+------------------+-----------------------+-----------------------+
	\| bar/%62%61%7A \| \| \|	\| /foo/ \| /foo/bar/%62%61%7A \| /foo/bar/baz \|
	+-------------------+----------------------+----------------------+	\| bar/%62%61%7A \| \| \|
		+------------------+-----------------------+-----------------------+


	Table 4: Examples of matching percent-encoded URI components	Figure 4: Examples of matching percent-encoded URI components

	The crawler SHOULD ignore "disallow" and "allow" rules that are not	The crawler SHOULD ignore "disallow" and "allow" rules that are not
	in any group (for example, any rule that precedes the first user-	in any group (for example, any rule that precedes the first user-
	agent line).	agent line).


	Implementers MAY bridge encoding mismatches if they detect that the	Implementors MAY bridge encoding mismatches if they detect that the
	robots.txt file is not UTF8 encoded.	robots.txt file is not UTF-8 encoded.

	2.2.3. Special Characters	2.2.3. Special Characters


	Crawlers MUST allow the following special characters:	Crawlers MUST support the following special characters:

	+===========+===================+==============================+	+===========+===================+==============================+
	\| Character \| Description \| Example \|	\| Character \| Description \| Example \|
	+===========+===================+==============================+	+===========+===================+==============================+

	\| "#" \| Designates an end \| "allow: / # comment in line" \|	\| # \| Designates a line \| allow: / # comment in line \|
	\| \| of line comment. \| \|	\| \| comment. \| \|
	\| \| \| "# comment on its own line" \|	\| \| \| # comment on its own line \|
	+-----------+-------------------+------------------------------+	+-----------+-------------------+------------------------------+

	\| "$" \| Designates the \| "allow: /this/path/exactly$" \|	\| $ \| Designates the \| allow: /this/path/exactly$ \|
	\| \| end of the match \| \|	\| \| end of the match \| \|
	\| \| pattern. \| \|	\| \| pattern. \| \|
	+-----------+-------------------+------------------------------+	+-----------+-------------------+------------------------------+

	\| "" \| Designates 0 or \| "allow: /this//exactly" \|	\| * \| Designates 0 or \| allow: /this/*/exactly \|
	\| \| more instances of \| \|	\| \| more instances of \| \|
	\| \| any character. \| \|	\| \| any character. \| \|
	+-----------+-------------------+------------------------------+	+-----------+-------------------+------------------------------+


	Table 5: List of special characters in robots.txt files	Figure 5: List of special characters in robots.txt files

	If crawlers match special characters verbatim in the URI, crawlers	If crawlers match special characters verbatim in the URI, crawlers
	SHOULD use "%" encoding. For example:	SHOULD use "%" encoding. For example:


	+============================+===============================+	+============================+====================================+
	\| Percent-encoded Pattern \| URI \|	\| Percent-encoded Pattern \| URI \|
	+============================+===============================+	+============================+====================================+
	\| /path/file-with-a-%2A.html \| https://www.example.com/path/ \|	\| /path/file-with-a-%2A.html \| https://www.example.com/path/ \|
	\| \| file-with-a-*.html \|	\| \| file-with-a-*.html \|
	+----------------------------+-------------------------------+	+----------------------------+------------------------------------+
	\| /path/foo-%24 \| https://www.example.com/path/ \|	\| /path/foo-%24 \| https://www.example.com/path/foo-$ \|
	\| \| foo-$ \|	+----------------------------+------------------------------------+
	+----------------------------+-------------------------------+


	Table 6: Example of percent-encoding	Figure 6: Example of percent-encoding

	2.2.4. Other Records	2.2.4. Other Records

	Crawlers MAY interpret other records that are not part of the	Crawlers MAY interpret other records that are not part of the

	robots.txt protocol. For example, 'sitemap' [SITEMAPS]. Crawlers	robots.txt protocol -- for example, "Sitemaps" [SITEMAPS]. Crawlers
	MAY be lenient when interpreting other records. For example,	MAY be lenient when interpreting other records. For example,

	crawlers may accept common typos of the record.	crawlers may accept common misspellings of the record.

	Parsing of other records MUST NOT interfere with the parsing of	Parsing of other records MUST NOT interfere with the parsing of

	explicitly defined records in Section 2.	explicitly defined records in Section 2. For example, a "Sitemaps"
		record MUST NOT terminate a group.

	2.3. Access Method	2.3. Access Method


	The rules MUST be accessible in a file named "/robots.txt" (all lower	The rules MUST be accessible in a file named "/robots.txt" (all
	case) in the top level path of the service. The file MUST be UTF-8	lowercase) in the top-level path of the service. The file MUST be
	encoded (as defined in [RFC3629]) and Internet Media Type "text/	UTF-8 encoded (as defined in [RFC3629]) and Internet Media Type
	plain" (as defined in [RFC2046]).	"text/plain" (as defined in [RFC2046]).


	As per [RFC3986], the URI of the robots.txt is:	As per [RFC3986], the URI of the robots.txt file is:

	"scheme:[//authority]/robots.txt"	"scheme:[//authority]/robots.txt"

	For example, in the context of HTTP or FTP, the URI is:	For example, in the context of HTTP or FTP, the URI is:

	https://www.example.com/robots.txt	https://www.example.com/robots.txt

	ftp://ftp.example.com/robots.txt	ftp://ftp.example.com/robots.txt

	2.3.1. Access Results	2.3.1. Access Results

	2.3.1.1. Successful Access	2.3.1.1. Successful Access


	If the crawler successfully downloads the robots.txt, the crawler	If the crawler successfully downloads the robots.txt file, the
	MUST follow the parseable rules.	crawler MUST follow the parseable rules.

	2.3.1.2. Redirects	2.3.1.2. Redirects

	It's possible that a server responds to a robots.txt fetch request	It's possible that a server responds to a robots.txt fetch request

	with a redirect, such as HTTP 301 and HTTP 302 in case of HTTP. The	with a redirect, such as HTTP 301 or HTTP 302 in the case of HTTP.
	crawlers SHOULD follow at least five consecutive redirects, even	The crawlers SHOULD follow at least five consecutive redirects, even
	across authorities (for example, hosts in case of HTTP), as defined	across authorities (for example, hosts in the case of HTTP).
	in [RFC1945].

	If a robots.txt file is reached within five consecutive redirects,	If a robots.txt file is reached within five consecutive redirects,
	the robots.txt file MUST be fetched, parsed, and its rules followed	the robots.txt file MUST be fetched, parsed, and its rules followed
	in the context of the initial authority.	in the context of the initial authority.

	If there are more than five consecutive redirects, crawlers MAY	If there are more than five consecutive redirects, crawlers MAY

	assume that the robots.txt is unavailable.	assume that the robots.txt file is unavailable.


	2.3.1.3. Unavailable Status	2.3.1.3. "Unavailable" Status


	Unavailable means the crawler tries to fetch the robots.txt, and the	"Unavailable" means the crawler tries to fetch the robots.txt file
	server responds with unavailable status codes. For example, in the	and the server responds with status codes indicating that the
	context of HTTP, unavailable status codes are in the 400-499 range.	resource in question is unavailable. For example, in the context of
		HTTP, such status codes are in the 400-499 range.

	If a server status code indicates that the robots.txt file is	If a server status code indicates that the robots.txt file is
	unavailable to the crawler, then the crawler MAY access any resources	unavailable to the crawler, then the crawler MAY access any resources
	on the server.	on the server.


	2.3.1.4. Unreachable Status	2.3.1.4. "Unreachable" Status


	If the robots.txt is unreachable due to server or network errors,	If the robots.txt file is unreachable due to server or network
	this means the robots.txt is undefined and the crawler MUST assume	errors, this means the robots.txt file is undefined and the crawler
	complete disallow. For example, in the context of HTTP, an	MUST assume complete disallow. For example, in the context of HTTP,
	unreachable robots.txt has a response code in the 500-599 range.	server errors are identified by status codes in the 500-599 range.


	If the robots.txt is undefined for a reasonably long period of time	If the robots.txt file is undefined for a reasonably long period of
	(for example, 30 days), crawlers MAY assume the robots.txt is	time (for example, 30 days), crawlers MAY assume that the robots.txt
	unavailable as defined in Section 2.3.1.3 or continue to use a cached	file is unavailable as defined in Section 2.3.1.3 or continue to use
	copy.	a cached copy.

	2.3.1.5. Parsing Errors	2.3.1.5. Parsing Errors

	Crawlers MUST try to parse each line of the robots.txt file.	Crawlers MUST try to parse each line of the robots.txt file.
	Crawlers MUST use the parseable rules.	Crawlers MUST use the parseable rules.

	2.4. Caching	2.4. Caching

	Crawlers MAY cache the fetched robots.txt file's contents. Crawlers	Crawlers MAY cache the fetched robots.txt file's contents. Crawlers
	MAY use standard cache control as defined in [RFC9111]. Crawlers	MAY use standard cache control as defined in [RFC9111]. Crawlers
	SHOULD NOT use the cached version for more than 24 hours, unless the	SHOULD NOT use the cached version for more than 24 hours, unless the

	robots.txt is unreachable.	robots.txt file is unreachable.

	2.5. Limits	2.5. Limits

	Crawlers SHOULD impose a parsing limit to protect their systems; see	Crawlers SHOULD impose a parsing limit to protect their systems; see
	Section 3. The parsing limit MUST be at least 500 kibibytes [KiB].	Section 3. The parsing limit MUST be at least 500 kibibytes [KiB].

	3. Security Considerations	3. Security Considerations


	The Robots Exclusion Protocol is not a substitute for more valid	The Robots Exclusion Protocol is not a substitute for valid content
	content security measures. Listing paths in the robots.txt file	security measures. Listing paths in the robots.txt file exposes them
	exposes them publicly and thus makes the paths discoverable. To	publicly and thus makes the paths discoverable. To control access to
	control access to the URI paths in a robots.txt file, users of the	the URI paths in a robots.txt file, users of the protocol should
	protocol should employ a valid security measure relevant to the	employ a valid security measure relevant to the application layer on
	application layer on which the robots.txt file is served. For	which the robots.txt file is served -- for example, in the case of
	example, in case of HTTP, HTTP Authentication defined in [RFC9110].	HTTP, HTTP Authentication as defined in [RFC9110].

	To protect against attacks against their system, implementors of	To protect against attacks against their system, implementors of
	robots.txt parsing and matching logic should take the following	robots.txt parsing and matching logic should take the following
	considerations into account:	considerations into account:


	* Memory management: Section 2.5 defines the lower limit of bytes	Memory management: Section 2.5 defines the lower limit of bytes that
	that must be processed, which inherently also protects the parser	must be processed, which inherently also protects the parser from
	from out of memory scenarios.	out-of-memory scenarios.


	* Invalid characters: Section 2.2 defines a set of characters that	Invalid characters: Section 2.2 defines a set of characters that
	parsers and matchers can expect in robots.txt files. Out of bound	parsers and matchers can expect in robots.txt files. Out-of-bound
	characters should be rejected as invalid, which limits the	characters should be rejected as invalid, which limits the
	available attack vectors that attempt to compromise the system.	available attack vectors that attempt to compromise the system.


	* Untrusted content: Implementors should treat the content of a	Untrusted content: Implementors should treat the content of a
	robots.txt file as untrusted content, as defined by the	robots.txt file as untrusted content, as defined by the
	specification of the application layer used. For example, in the	specification of the application layer used. For example, in the

	context of HTTP, implementors should follow the security	context of HTTP, implementors should follow the Security
	considerations section of [RFC9110].	Considerations section of [RFC9110].

	4. IANA Considerations	4. IANA Considerations


	This document has no actions for IANA.	This document has no IANA actions.

	5. Examples	5. Examples

	5.1. Simple Example	5.1. Simple Example

	The following example shows:	The following example shows:


	* *: A group that's relevant to all user-agents that don't have an	*: A group that's relevant to all user agents that don't have an
	explicitly defined matching group. It allows access to the URLs	explicitly defined matching group. It allows access to the URLs

	with the /publications/ path prefix, and restricts access to the	with the /publications/ path prefix, and it restricts access to
	URLs with the /example/ path prefix and to all URLs with .gif	the URLs with the /example/ path prefix and to all URLs with a
	suffix. The * character designates any character, including the	.gif suffix. The "*" character designates any character,
	otherwise required forward slash; see Section 2.2.	including the otherwise-required forward slash; see Section 2.2.


	* foobot: A regular case. A single user-agent followed by rules.	foobot: A regular case. A single user agent followed by rules. The
	The crawler only has access to two URL path prefixes on the site,	crawler only has access to two URL path prefixes on the site --
	/example/page.html and /example/allowed.gif. The rules of the	/example/page.html and /example/allowed.gif. The rules of the

	group are missing the optional whitespace character, which is	group are missing the optional space character, which is
	acceptable as defined in Section 2.2.	acceptable as defined in Section 2.2.


	* barbot and bazbot: A group that's relevant for more than one user-	barbot and bazbot: A group that's relevant for more than one user
	agent. The crawlers are not allowed to access the URLs with the	agent. The crawlers are not allowed to access the URLs with the

	/example/page.html path prefix, but otherwise have unrestricted	/example/page.html path prefix but otherwise have unrestricted
	access to the rest of the URLs on the site.	access to the rest of the URLs on the site.


	* quxbot: An empty group at end of the file. The crawler has	quxbot: An empty group at the end of the file. The crawler has
	unrestricted access to the URLs on the site.	unrestricted access to the URLs on the site.


	User-agent: *	User-Agent: *
	Disallow: *.gif$	Disallow: *.gif$
	Disallow: /example/	Disallow: /example/
	Allow: /publications/	Allow: /publications/

	User-Agent: foobot	User-Agent: foobot
	Disallow:/	Disallow:/
	Allow:/example/page.html	Allow:/example/page.html
	Allow:/example/allowed.gif	Allow:/example/allowed.gif

	User-Agent: barbot	User-Agent: barbot

	skipping to change at page 12, line 38 ¶	skipping to change at line 516 ¶
	example.com/example/page/disallow.gif.	example.com/example/page/disallow.gif.

	User-Agent: foobot	User-Agent: foobot
	Allow: /example/page/	Allow: /example/page/
	Disallow: /example/page/disallowed.gif	Disallow: /example/page/disallowed.gif

	6. References	6. References

	6.1. Normative References	6.1. Normative References


	[RFC1945] Berners-Lee, T., Fielding, R., and H. Frystyk, "Hypertext
	Transfer Protocol -- HTTP/1.0", RFC 1945,
	DOI 10.17487/RFC1945, May 1996,
	<https://www.rfc-editor.org/info/rfc1945>.

	[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail	[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
	Extensions (MIME) Part Two: Media Types", RFC 2046,	Extensions (MIME) Part Two: Media Types", RFC 2046,
	DOI 10.17487/RFC2046, November 1996,	DOI 10.17487/RFC2046, November 1996,
	<https://www.rfc-editor.org/info/rfc2046>.	<https://www.rfc-editor.org/info/rfc2046>.

	[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate	[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
	Requirement Levels", BCP 14, RFC 2119,	Requirement Levels", BCP 14, RFC 2119,
	DOI 10.17487/RFC2119, March 1997,	DOI 10.17487/RFC2119, March 1997,
	<https://www.rfc-editor.org/info/rfc2119>.	<https://www.rfc-editor.org/info/rfc2119>.


	skipping to change at page 13, line 39 ¶	skipping to change at line 560 ¶
	DOI 10.17487/RFC9110, June 2022,	DOI 10.17487/RFC9110, June 2022,
	<https://www.rfc-editor.org/info/rfc9110>.	<https://www.rfc-editor.org/info/rfc9110>.

	[RFC9111] Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke,	[RFC9111] Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke,
	Ed., "HTTP Caching", STD 98, RFC 9111,	Ed., "HTTP Caching", STD 98, RFC 9111,
	DOI 10.17487/RFC9111, June 2022,	DOI 10.17487/RFC9111, June 2022,
	<https://www.rfc-editor.org/info/rfc9111>.	<https://www.rfc-editor.org/info/rfc9111>.

	6.2. Informative References	6.2. Informative References


	[KiB] "Kibibyte - Simple English Wikipedia, the free	[KiB] "Kibibyte", Simple English Wikipedia, the free
	encyclopedia", n.d.,	encyclopedia, 17 September 2020,
	<https://simple.wikipedia.org/wiki/Kibibyte>.	<https://simple.wikipedia.org/wiki/Kibibyte>.

	[ROBOTSTXT]	[ROBOTSTXT]

	"Robots Exclusion Protocol", n.d.,	"The Web Robots Pages (including /robots.txt)", 2007,
	<http://www.robotstxt.org/>.	<https://www.robotstxt.org/>.


	[SITEMAPS] "Sitemaps Protocol", n.d.,	[SITEMAPS] "What are Sitemaps? (Sitemap protocol)", April 2020,
	<https://www.sitemaps.org/index.html>.	<https://www.sitemaps.org/index.html>.

	Authors' Addresses	Authors' Addresses

	Martijn Koster (editor)
	Stalworthy Computing, Ltd.	Martijn Koster
		Stalworthy Manor Farm
	Suton Lane	Suton Lane
	Wymondham, Norfolk	Wymondham, Norfolk
	NR18 9JG	NR18 9JG
	United Kingdom	United Kingdom
	Email: m.koster@greenhills.co.uk	Email: m.koster@greenhills.co.uk


	Gary Illyes (editor)	Gary Illyes
	Google LLC.	Google LLC
	Brandschenkestrasse 110	Brandschenkestrasse 110

	CH-8002 Zurich	CH-8002 Zürich
	Switzerland	Switzerland
	Email: garyillyes@google.com	Email: garyillyes@google.com


	Henner Zeller (editor)	Henner Zeller
	Google LLC.	Google LLC
	1600 Amphitheatre Pkwy	1600 Amphitheatre Pkwy

	Mountain View, CA, 94043	Mountain View, CA 94043
	United States of America	United States of America
	Email: henner@google.com	Email: henner@google.com


	Lizzi Sassman (editor)	Lizzi Sassman
	Google LLC.	Google LLC
	Brandschenkestrasse 110	Brandschenkestrasse 110

	CH-8002 Zurich	CH-8002 Zürich
	Switzerland	Switzerland
	Email: lizzi@google.com	Email: lizzi@google.com

End of changes. 83 change blocks.
	252 lines changed or deleted	245 lines changed or added
This html diff was produced by rfcdiff 1.48.