module HTML5

Overview

HTML5 module implements an HTML5-compliant Tokenizer and Parser. The relevant specifications include: https://html.spec.whatwg.org/multipage/syntax.html and https://html.spec.whatwg.org/multipage/syntax.html#tokenization Tokenization is done by creating a Tokenizer for an IO. It is the caller responsibility to ensure that provided IO provides UTF-8 encoded HTML. The tokenization algorithm implemented by this package is not a line-by-line transliteration of the relatively verbose state-machine in the WHATWG specification. A more direct approach is used instead, where the program counter implies the state, such as whether it is tokenizing a tag or a text node. Specification compliance is verified by checking expected and actual outputs over a test suite rather than aiming for algorithmic fidelity.

Parsing is done by calling HTML5.parse with either a String containing HTML or an IO instance. HTML5.parse returns a document root as HTML5::Node instance.

Extended Modules

Defined in:

html5.cr
html5/const.cr
html5/css/selector.cr
html5/doctype.cr
html5/entity.cr
html5/escape.cr
html5/foreign.cr
html5/insertion_mode.cr
html5/node.cr
html5/parser.cr
html5/streaming.cr
html5/token.cr
html5/xpath/xpath.cr

Constant Summary

VERSION = {{ (`shards version \"/srv/crystaldoc.info/github-naqvis-crystal-html5-v0.7.0/src\"`).chomp.stringify.downcase }}

Class Method Summary

Instance Method Summary

Class Method Detail

def self.each_token(io : IO, &block : Token -> ) : Nil #

Iterates over each token in the HTML input without building a parse tree.

This is the lightest-weight streaming option — it tokenizes the HTML and yields each Token as it's produced. No tree is built, no memory is accumulated beyond the tokenizer's internal buffer. Runs in constant memory regardless of input size.

Tokens are yielded in document order: start tags, text, end tags, comments, doctypes. Note that without tree construction, the token stream reflects the raw markup — it does not include the implicit tags or tree corrections that the full parser would apply.

Example

# Extract all text content
HTML5.each_token(io) do |token|
  print token.data if token.type.text?
end
# Find all image sources
HTML5.each_token(html_string) do |token|
  if token.type.start_tag? && token.data == "img"
    token.attr.each do |a|
      puts a.val if a.key == "src"
    end
  end
end

[View source]
def self.each_token(html : String, &block : Token -> ) : Nil #

Iterates over each token in the HTML string without building a parse tree.


[View source]
def self.parse(io : IO, **opts) #

parse returns the parse tree for the HTML from the given io into an HTML5::Node.

It implements the HTML5 parsing algorithm https://html.spec.whatwg.org/multipage/syntax.html#tree-construction, which is very complicated. The resultant tree can contain implicitly created nodes that have no explicit listed in passed io's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end s. Conversely, explicit s in passed io's data can be silently dropped, with no corresponding node in the resulting tree.

The input is assumed to be UTF-8 encoded.


[View source]
def self.parse(html : String, **opts) #

parse returns the parse tree for the HTML from the given html string into an HTML5::Node.


[View source]
def self.parse_fragment(io : IO, context : Node | Nil = nil, **opts) #

parse_fragment parses a fragment of HTML5 and returns the nodes that were found. If the fragment is the InnerHTML for an existing element, pass that element in context.

It has the same intricacies as HTML5.parse.


[View source]
def self.parse_fragment(html : String, context : Node | Nil = nil, **opts) #

[View source]
def self.reparent_children(dst, src : Node) #

reparents all of src's child nodes to dst


[View source]
def self.stream(io : IO, handler : StreamingHandler, **opts) #

Parses HTML from an IO and emits SAX-style streaming events to the given handler.

The parser builds the full DOM tree internally (required for correct HTML5 parsing), but the handler receives events incrementally as nodes are constructed. This is useful for processing large documents where you want to react to elements as they appear without waiting for the entire document to be parsed.

Returns the complete document Node tree (same as HTML5.parse).

Example

class LinkExtractor
  include HTML5::StreamingHandler
  getter links = [] of String

  def on_element_open(tag : String, attrs : Array(HTML5::Attribute), namespace : String)
    if tag == "a"
      attrs.each do |attr|
        links << attr.val if attr.key == "href"
      end
    end
  end
end

extractor = LinkExtractor.new
doc = HTML5.stream(io, extractor)
puts extractor.links

[View source]
def self.stream(html : String, handler : StreamingHandler, **opts) #

Parses HTML from a String and emits SAX-style streaming events to the given handler.


[View source]
def self.token_iterator(io : IO) : Iterator(Token) #

Returns an Iterator over the tokens in the HTML input.

Example

HTML5.token_iterator(io).each do |token|
  puts token.data if token.type.start_tag?
end

[View source]
def self.token_iterator(html : String) : Iterator(Token) #

Returns an Iterator over the tokens in the HTML string.


[View source]

Instance Method Detail

def escape_string(s : String) : String #

escape_string escapes special characters like "<" to become "<". It escapes only five such characters: <, >, &, ' and ". unescape_string(escape_string(s)) == s always holds, but the converse isn't always true.


[View source]
def unescape_string(s : String) : String #

unescape_string unescapes entities like "<" to become "<". It unescapes a larger range of entities than escape_string escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". unescape_string(escape_string(s)) == s always holds, but the converse isn't always true.


[View source]