class Robots
- Robots
- Reference
- Object
Overview
Parses robots.txt files for the perusal of a single user-agent.
The behaviour implemented is guided by the following sources, though as there is no widely accepted standard, it may differ from other implementations. If you consider its behaviour to be in error, please contact the author.
http://www.robotstxt.org/orig.html
- the original, now imprecise and outdated version http://www.robotstxt.org/norobots-rfc.txt
- a much more precise, outdated version http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
- a few hints at modern protocol extensions.
This parser only considers lines starting with (case-insensitively:) Useragent: User-agent: Allow: Disallow: Sitemap:
The file is divided into sections, each of which contains one or more User-agent: lines, followed by one or more Allow: or Disallow: rules.
The first section that contains a User-agent: line that matches the robot's user-agent, is the only section that relevent to that robot. The sections are checked in the same order as they appear in the file.
(The * character is taken to mean "any number of any characters" during matching of user-agents)
Within that section, the first Allow: or Disallow: rule that matches the expression is taken as authoritative. If no rule in a section matches, the access is Allowed.
(The order of matching is as in the RFC, Google matches all Allows and then all Disallows, while Bing matches the most specific rule, I'm sure there are other interpretations)
When matching urls, all % encodings are normalised (except for /?=& which have meaning) and "*"s match any number of any character.
If a pattern ends with a $, then the pattern must match the entire path, or the entire path with query string.
Defined in:
robots.crrobots/version.cr
Constant Summary
-
VERSION =
"0.1.0"
Constructors
Instance Method Summary
-
#allowed?(uri)
Given a URI object, or a string representing one, determine whether this robots.txt would allow access to the path.
- #body : String
- #rules : Array(Tuple(String, Array(Rule)))
- #sitemaps : Array(String)
- #user_agent : String
Constructor Detail
Instance Method Detail
Given a URI object, or a string representing one, determine whether this robots.txt would allow access to the path.