Spider and Robot Exclusion
Internet spiders are programs run on servers that scan through documents on the World Wide Web(WWW). They function by beginning at a given starting page then following links on that page to other websites and WWW pages. Since these robots operate without specific instructions on what pages to visit, they are also called spiders or wanderers. The process of wandering WWW pages is sometimes called crawling, because the program drags its way through various webpages via hyperlinks. WWW spiders and robots exist for varying reasons. Typically their purpose is to gather information from the World Wide Web, either about pages on the web or from the pages' content. Some spiders, such as those used by search engines, do both.
Robots often visit a huge number of WWW servers as they are crawling the web. Sometimes, however, they are not wanted or welcome to visit a certain web server. This can happen for a wide variety of reasons but there are a couple that are more common. Sometimes a robot's behavior causes it to be unwanted. Some robots can bombard servers with too many requests at once, slowing down page loading for human users. Some other robots request the same file repeatedly, which is redundant and can also slow down a server. Other times, the server contains documents and pages that the robots have no need to visit, such as relatively private information or deep files that are only used by the server behind the scenes. Robots reaching unwanted parts of servers led to a standard for blocking them. The standard solution is both simple and powerful.
Robots are excluded from a server through a file that contains the server's access policy. This file is named "robots.txt" and must be located in the web server's public root directory. This is the standard because "robots.txt" works with the file naming protocol of all common operating systems. Additionally, it is easy to install on servers, easy to remember, and most likely will not interfere with the existing WWW server operations.
The contents of the "robots.txt" file must follow a certain format. The file contains records separated by at least one blank line. Comments are indicated the same way as on UNIX systems; a "#" character anywhere on a line causes it to be ignored by a spider. The file's records contain at least one user-agent line preceding disallow lines. If a spider or robot cannot understand a header, it is ignored. The user-agent line supplies the name of the robot whose access restrictions are being specified. If there is more than one user-agent listed for the access record, that record applies to all of the robots named. To give rules for any robot that is visiting, the value must be "*", this value is allowed only once in the "robots.txt" file.
The final piece of the "robots.txt" file is the disallow field. The value contained in this field is a partial URL that may not be visited by the robot. It can be an exact filepath, or just the beginning of a given filepath. For example, Disallow: /about/index.html would only disallow that page, while Disallow: /about would block any file in the /about/ directory. An empty field or file means that any location on the server is allowed to be accessed.
Written By: Edson Farnell | Email |