The Robots Exclusion Protocol has been one of the critical components of the web today. Commonly known as robots.txt, it’s a set of standards that allows websites to regulate the behavior of web robots when indexing their pages.
However, 25 years since the Dutch software engineer, Martijn Koster wrote the first standard for web crawlers in 1994, robots.txt remains to be an ‘unofficial’ Internat standard today. This led to different interpretations of the said protocol over time.
This issue posed a significant challenge to website owners since the contrasting interpretation of REP made it difficult to write rules correctly. To eliminate this problem, Google now wants to formalize the specifications for robots.txt.
In a statement, Google said:
“We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers. Together with the original author of the protocol, webmasters, and other search engines, we’ve documented how the REP is used on the modern web, and submitted it to the IETF.”
Formalizing Robots Exclusion Protocol
According to Google, the proposal they submitted to the Internet Engineering Task Force is a reflection of its experience using the Robots Exclusion Protocol for over 20 years.
The company said that it doesn’t change the fundamental rules written by Koster in 1994. Instead, the draft defined all the undefined scenarios for robots.txt parsing and matching, including:
- Any URI based transfer protocol can use robots.txt. For example, it’s not limited to HTTP anymore and can be used for FTP or CoAP as well.
- Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.
- A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren’t overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time.
- The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.
The draft also includes an updated augmented Backus–Naur Form (ABNF) to define the REP syntax better. Google said the draft had been uploaded to IETF so they could receive feedback from developers “who care about the basic building blocks of the internet.”
If you have any comments or suggestions regarding Google’s move to formalize Robots Exclusion Protocol, you may share them via Twitter or through the Webmaster Community.
Comments (0)
Most Recent