module HTML5

Overview

HTML5 module implements an HTML5-compliant Tokenizer and Parser. The relevant specifications include: https://html.spec.whatwg.org/multipage/syntax.html and https://html.spec.whatwg.org/multipage/syntax.html#tokenization Tokenization is done by creating a Tokenizer for an IO. It is the caller responsibility to ensure that provided IO provides UTF-8 encoded HTML. The tokenization algorithm implemented by this package is not a line-by-line transliteration of the relatively verbose state-machine in the WHATWG specification. A more direct approach is used instead, where the program counter implies the state, such as whether it is tokenizing a tag or a text node. Specification compliance is verified by checking expected and actual outputs over a test suite rather than aiming for algorithmic fidelity.

Parsing is done by calling HTML5.parse with either a String containing HTML or an IO instance. HTML5.parse returns a document root as HTML5::Node instance.

Extended Modules

Defined in:

html5.cr
html5/const.cr
html5/css/selector.cr
html5/doctype.cr
html5/entity.cr
html5/escape.cr
html5/foreign.cr
html5/insertion_mode.cr
html5/node.cr
html5/parser.cr
html5/token.cr
html5/xpath/xpath.cr

Constant Summary

VERSION = "0.5.0"

Class Method Summary

Instance Method Summary

Class Method Detail

def self.parse(io : IO, **opts) #

parse returns the parse tree for the HTML from the given io into an HTML5::Node.

It implements the HTML5 parsing algorithm https://html.spec.whatwg.org/multipage/syntax.html#tree-construction, which is very complicated. The resultant tree can contain implicitly created nodes that have no explicit listed in passed io's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end s. Conversely, explicit s in passed io's data can be silently dropped, with no corresponding node in the resulting tree.

The input is assumed to be UTF-8 encoded.


[View source]
def self.parse(html : String, **opts) #

parse returns the parse tree for the HTML from the given html string into an HTML5::Node.


[View source]
def self.parse_fragment(io : IO, context : Node | Nil = nil, **opts) #

parse_fragment parses a fragment of HTML5 and returns the nodes that were found. If the fragment is the InnerHTML for an existing element, pass that element in context.

It has the same intricacies as HTML5.parse.


[View source]
def self.parse_fragment(html : String, context : Node | Nil = nil, **opts) #

[View source]
def self.reparent_children(dst, src : Node) #

reparents all of src's child nodes to dst


[View source]

Instance Method Detail

def escape_string(s : String) : String #

escape_string escapes special characters like "<" to become "<". It escapes only five such characters: <, >, &, ' and ". unescape_string(escape_string(s)) == s always holds, but the converse isn't always true.


[View source]
def unescape_string(s : String) : String #

unescape_string unescapes entities like "<" to become "<". It unescapes a larger range of entities than escape_string escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". unescape_string(escape_string(s)) == s always holds, but the converse isn't always true.


[View source]