module HTML5
Overview
HTML5 module implements an HTML5-compliant Tokenizer and Parser.
The relevant specifications include:
https://html.spec.whatwg.org/multipage/syntax.html and
https://html.spec.whatwg.org/multipage/syntax.html#tokenization
Tokenization is done by creating a Tokenizer for an IO. It is the caller
responsibility to ensure that provided IO provides UTF-8 encoded HTML.
The tokenization algorithm implemented by this package is not a line-by-line
transliteration of the relatively verbose state-machine in the WHATWG
specification. A more direct approach is used instead, where the program
counter implies the state, such as whether it is tokenizing a tag or a text
node. Specification compliance is verified by checking expected and actual
outputs over a test suite rather than aiming for algorithmic fidelity.
Parsing is done by calling HTML5.parse with either a String containing HTML
or an IO instance. HTML5.parse returns a document root as HTML5::Node instance.
Extended Modules
Defined in:
html5.crhtml5/const.cr
html5/css/selector.cr
html5/doctype.cr
html5/entity.cr
html5/escape.cr
html5/foreign.cr
html5/insertion_mode.cr
html5/node.cr
html5/parser.cr
html5/streaming.cr
html5/token.cr
html5/xpath/xpath.cr
Constant Summary
-
VERSION =
{{ (`shards version \"/srv/crystaldoc.info/github-naqvis-crystal-html5-v0.7.0/src\"`).chomp.stringify.downcase }}
Class Method Summary
-
.each_token(io : IO, &block : Token -> ) : Nil
Iterates over each token in the HTML input without building a parse tree.
-
.each_token(html : String, &block : Token -> ) : Nil
Iterates over each token in the HTML string without building a parse tree.
-
.parse(io : IO, **opts)
parse returns the parse tree for the HTML from the given io into an
HTML5::Node. -
.parse(html : String, **opts)
parse returns the parse tree for the HTML from the given html string into an
HTML5::Node. -
.parse_fragment(io : IO, context : Node | Nil = nil, **opts)
parse_fragment parses a fragment of HTML5 and returns the nodes that were found.
- .parse_fragment(html : String, context : Node | Nil = nil, **opts)
-
.reparent_children(dst, src : Node)
reparents all of src's child nodes to dst
-
.stream(io : IO, handler : StreamingHandler, **opts)
Parses HTML from an
IOand emits SAX-style streaming events to the given handler. -
.stream(html : String, handler : StreamingHandler, **opts)
Parses HTML from a
Stringand emits SAX-style streaming events to the given handler. -
.token_iterator(io : IO) : Iterator(Token)
Returns an
Iteratorover the tokens in the HTML input. -
.token_iterator(html : String) : Iterator(Token)
Returns an
Iteratorover the tokens in the HTML string.
Instance Method Summary
-
#escape_string(s : String) : String
escape_string escapes special characters like "<" to become "<".
-
#unescape_string(s : String) : String
unescape_string unescapes entities like "<" to become "<".
Class Method Detail
Iterates over each token in the HTML input without building a parse tree.
This is the lightest-weight streaming option — it tokenizes the HTML and yields
each Token as it's produced. No tree is built, no memory is accumulated beyond
the tokenizer's internal buffer. Runs in constant memory regardless of input size.
Tokens are yielded in document order: start tags, text, end tags, comments, doctypes. Note that without tree construction, the token stream reflects the raw markup — it does not include the implicit tags or tree corrections that the full parser would apply.
Example
# Extract all text content
HTML5.each_token(io) do |token|
print token.data if token.type.text?
end
# Find all image sources
HTML5.each_token(html_string) do |token|
if token.type.start_tag? && token.data == "img"
token.attr.each do |a|
puts a.val if a.key == "src"
end
end
end
Iterates over each token in the HTML string without building a parse tree.
parse returns the parse tree for the HTML from the given io into an HTML5::Node.
It implements the HTML5 parsing algorithm
https://html.spec.whatwg.org/multipage/syntax.html#tree-construction,
which is very complicated. The resultant tree can contain implicitly created
nodes that have no explicit
The input is assumed to be UTF-8 encoded.
parse returns the parse tree for the HTML from the given html string into an HTML5::Node.
parse_fragment parses a fragment of HTML5 and returns the nodes that were found. If the fragment is the InnerHTML for an existing element, pass that element in context.
It has the same intricacies as HTML5.parse.
reparents all of src's child nodes to dst
Parses HTML from an IO and emits SAX-style streaming events to the given handler.
The parser builds the full DOM tree internally (required for correct HTML5 parsing), but the handler receives events incrementally as nodes are constructed. This is useful for processing large documents where you want to react to elements as they appear without waiting for the entire document to be parsed.
Returns the complete document Node tree (same as HTML5.parse).
Example
class LinkExtractor
include HTML5::StreamingHandler
getter links = [] of String
def on_element_open(tag : String, attrs : Array(HTML5::Attribute), namespace : String)
if tag == "a"
attrs.each do |attr|
links << attr.val if attr.key == "href"
end
end
end
end
extractor = LinkExtractor.new
doc = HTML5.stream(io, extractor)
puts extractor.links
Parses HTML from a String and emits SAX-style streaming events to the given handler.
Returns an Iterator over the tokens in the HTML input.
Example
HTML5.token_iterator(io).each do |token|
puts token.data if token.type.start_tag?
end
Returns an Iterator over the tokens in the HTML string.
Instance Method Detail
escape_string escapes special characters like "<" to become "<". It
escapes only five such characters: <, >, &, ' and ".
unescape_string(escape_string(s)) == s always holds, but the converse isn't
always true.
unescape_string unescapes entities like "<" to become "<". It unescapes a larger range of entities than escape_string escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". unescape_string(escape_string(s)) == s always holds, but the converse isn't always true.