class App::Parser::HTML::Basic

Overview

Basic HTML parser.

The parser is not HTML syntax aware, it only adheres to basic HTML parsing rules. Because it does not qualify as a true HTML parser, its low level functions are defined here locally instead of being defined in the App::Parser::HTML base class.

The implemented design uses 2 or 3 LibC functions. This is not necessary, it was an experiment to see how it would work.

Defined in:

app/parser/html/basic.cr

Constant Summary

DATE_FORMATS = { {Regex.new("(?<year>\\d{4})([-\\/])(?<month>\\d{2})\\2(?<day>\\d{2})"), nil}, {Regex.new("(?<month>[A-Z][a-z]{2}) (?<day>\\d{1,2}),? (?<year>\\d{4})"), "%Y-%b-%d"}, {Regex.new("(?<month>[A-Z][a-z]{3,}) (?<day>\\d{1,2}),? (?<year>\\d{4})"), "%Y-%^b-%d"} }

Regex listing known/supported date formats. List the searches in order of decreasing usage/occurrence in practice.

TYPE_JSON = Regex.new("([\"'])application/(?:ld+)?json\\1")

Class Method Summary

Instance Method Summary

Instance methods inherited from class App::Parser

run run, worker worker

Constructor methods inherited from class App::Parser

new(processor : App::Processor, parse_tasks : Channel(NamedTuple(idx: Int32, url: String, title: String, gt: Time::Span, response: HTTP::Client::Response)), capacity) new

Class Method Detail

def self.find_closed(body, tags, from, force_container? = false) #

Finds byte offset of first /tags found within body. from controls the starting offset. It is usually set to result of find_open() + 1.

str = %q{  <body bgcolor="black">...</body>}
p! find_closed(str, {"</body", "</BODY"}, 3, true) # => 27

[View source]
def self.find_open(body, tags, from) #

Finds byte offset of first tags found within body. from controls the starting offset.

str = %q{  <body bgcolor="black">...</body>}
p! find_open(str, {"<body", "<BODY"}, ptr) # => 2

[View source]
def self.find_range(body, tags_open, tags_closed, from = 0, force_container? = false) #

Returns byte range [b,e] in body of the content between the first tags_open and tags_closed pair found.

While tags_* are arrays and can be used to search for different tags, their main purpose is to specify searching for lowercase and uppercase version of the same tag. This is faster than invoking upcase/downcase in program code.

The range is byte-based and includes both HTML tag arguments (if any) and body (if any).

html = "....<head anything>....</head>"
find_range(html, {"<head", "<HEAD"}, {"</head", "</HEAD"}, 0, true) # => {4,23}
find_range(html, ["<nothing"], ["</nothing"]) # => nil

[View source]
def self.get_range(body, tags_open, tags_closed, offset = 0, force_container? = false) #

Finds any one of tags_open and tags_closed pairs and returns the string contained between them.

This method is similar to .find_range, but instead of returning the byte offsets it returns the String contained in them.


[View source]

Instance Method Detail

def find_first_date_string(body) #

[View source]
def parse_head(body) #

Identifies offset for ... and tries to find published date in it.


[View source]
def parse_scripts(body) #

[View source]
def run_strategy(body) #

Implements our default strategy for identifying the publish (or equivalent) date in pages.


[View source]
def worker #

Implements parse worker. Usually spawned in dedicated Fibers.


[View source]