module HTML5
Overview
HTML5 module implements an HTML5-compliant Tokenizer
and Parser
.
The relevant specifications include:
https://html.spec.whatwg.org/multipage/syntax.html and
https://html.spec.whatwg.org/multipage/syntax.html#tokenization
Tokenization is done by creating a Tokenizer
for an IO
. It is the caller
responsibility to ensure that provided IO provides UTF-8 encoded HTML.
The tokenization algorithm implemented by this package is not a line-by-line
transliteration of the relatively verbose state-machine in the WHATWG
specification. A more direct approach is used instead, where the program
counter implies the state, such as whether it is tokenizing a tag or a text
node. Specification compliance is verified by checking expected and actual
outputs over a test suite rather than aiming for algorithmic fidelity.
Parsing is done by calling HTML5.parse
with either a String containing HTML
or an IO instance. HTML5.parse
returns a document root as HTML5::Node
instance.
Extended Modules
Defined in:
html5.crhtml5/const.cr
html5/css/selector.cr
html5/doctype.cr
html5/entity.cr
html5/escape.cr
html5/foreign.cr
html5/insertion_mode.cr
html5/node.cr
html5/parser.cr
html5/token.cr
html5/xpath/xpath.cr
Constant Summary
-
VERSION =
"0.5.0"
Class Method Summary
-
.parse(io : IO, **opts)
parse returns the parse tree for the HTML from the given io into an
HTML5::Node
. -
.parse(html : String, **opts)
parse returns the parse tree for the HTML from the given html string into an
HTML5::Node
. -
.parse_fragment(io : IO, context : Node | Nil = nil, **opts)
parse_fragment parses a fragment of HTML5 and returns the nodes that were found.
- .parse_fragment(html : String, context : Node | Nil = nil, **opts)
-
.reparent_children(dst, src : Node)
reparents all of src's child nodes to dst
Instance Method Summary
-
#escape_string(s : String) : String
escape_string escapes special characters like "<" to become "<".
-
#unescape_string(s : String) : String
unescape_string unescapes entities like "<" to become "<".
Class Method Detail
parse returns the parse tree for the HTML from the given io into an HTML5::Node
.
It implements the HTML5 parsing algorithm
https://html.spec.whatwg.org/multipage/syntax.html#tree-construction,
which is very complicated. The resultant tree can contain implicitly created
nodes that have no explicit
The input is assumed to be UTF-8 encoded.
parse returns the parse tree for the HTML from the given html string into an HTML5::Node
.
parse_fragment parses a fragment of HTML5 and returns the nodes that were found. If the fragment is the InnerHTML for an existing element, pass that element in context.
It has the same intricacies as HTML5.parse
.
reparents all of src's child nodes to dst
Instance Method Detail
escape_string escapes special characters like "<" to become "<". It
escapes only five such characters: <, >, &, ' and ".
unescape_string(escape_string(s)) == s
always holds, but the converse isn't
always true.
unescape_string unescapes entities like "<" to become "<". It unescapes a larger range of entities than escape_string escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". unescape_string(escape_string(s)) == s always holds, but the converse isn't always true.