class App::Parser::HTML::Basic
- App::Parser::HTML::Basic
- App::Parser::HTML
- App::Parser
- Reference
- Object
Overview
Basic HTML parser.
The parser is not HTML syntax aware, it only adheres to basic HTML parsing rules. Because it does not qualify as a true HTML parser, its low level functions are defined here locally instead of being defined in the App::Parser::HTML base class.
The implemented design uses 2 or 3 LibC functions. This is not necessary, it was an experiment to see how it would work.
Defined in:
app/parser/html/basic.crConstant Summary
-
DATE_FORMATS =
{ {Regex.new("(?<year>\\d{4})([-\\/])(?<month>\\d{2})\\2(?<day>\\d{2})"), nil}, {Regex.new("(?<month>[A-Z][a-z]{2}) (?<day>\\d{1,2}),? (?<year>\\d{4})"), "%Y-%b-%d"}, {Regex.new("(?<month>[A-Z][a-z]{3,}) (?<day>\\d{1,2}),? (?<year>\\d{4})"), "%Y-%^b-%d"} }
-
Regex listing known/supported date formats. List the searches in order of decreasing usage/occurrence in practice.
-
TYPE_JSON =
Regex.new("([\"'])application/(?:ld+)?json\\1")
Class Method Summary
-
.find_closed(body, tags, from, force_container? = false)
Finds byte offset of first
/tags
found withinbody
. -
.find_open(body, tags, from)
Finds byte offset of first
tags
found withinbody
. -
.find_range(body, tags_open, tags_closed, from = 0, force_container? = false)
Returns byte range [b,e] in
body
of the content between the firsttags_open
andtags_closed
pair found. -
.get_range(body, tags_open, tags_closed, offset = 0, force_container? = false)
Finds any one of
tags_open
andtags_closed
pairs and returns the string contained between them.
Instance Method Summary
- #find_first_date_string(body)
-
#parse_head(body)
Identifies offset for ... and tries to find published date in it.
- #parse_scripts(body)
-
#run_strategy(body)
Implements our default strategy for identifying the publish (or equivalent) date in pages.
-
#worker
Implements parse worker.
Instance methods inherited from class App::Parser
run
run,
worker
worker
Constructor methods inherited from class App::Parser
new(processor : App::Processor, parse_tasks : Channel(NamedTuple(idx: Int32, url: String, title: String, gt: Time::Span, response: HTTP::Client::Response)), capacity)
new
Class Method Detail
Finds byte offset of first /tags
found within body
. from
controls
the starting offset. It is usually set to result of find_open() + 1
.
str = %q{ <body bgcolor="black">...</body>}
p! find_closed(str, {"</body", "</BODY"}, 3, true) # => 27
Finds byte offset of first tags
found within body
. from
controls
the starting offset.
str = %q{ <body bgcolor="black">...</body>}
p! find_open(str, {"<body", "<BODY"}, ptr) # => 2
Returns byte range [b,e] in body
of the content between the first
tags_open
and tags_closed
pair found.
While tags_*
are arrays and can be used to search for different tags,
their main purpose is to specify searching for lowercase and uppercase
version of the same tag. This is faster than invoking upcase/downcase
in program code.
The range is byte-based and includes both HTML tag arguments (if any) and body (if any).
html = "....<head anything>....</head>"
find_range(html, {"<head", "<HEAD"}, {"</head", "</HEAD"}, 0, true) # => {4,23}
find_range(html, ["<nothing"], ["</nothing"]) # => nil
Finds any one of tags_open
and tags_closed
pairs and returns the
string contained between them.
This method is similar to .find_range
, but instead of returning the
byte offsets it returns the String contained in them.
Instance Method Detail
Implements our default strategy for identifying the publish (or equivalent) date in pages.