Package: golang.org/x/net/html

package html

Import Path
	golang.org/x/net/html (on go.dev)

Dependency Relation
	imports 9 packages, and imported by one package

Involved Source Files

	    const.go
	  d doc.go
		Package html implements an HTML5-compliant tokenizer and parser.
		
		Tokenization is done by creating a Tokenizer for an io.Reader r. It is the
		caller's responsibility to ensure that r provides UTF-8 encoded HTML.
		
			z := html.NewTokenizer(r)
		
		Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(),
		which parses the next token and returns its type, or an error:
		
			for {
				tt := z.Next()
				if tt == html.ErrorToken {
					// ...
					return ...
				}
				// Process the current token.
			}
		
		There are two APIs for retrieving the current token. The high-level API is to
		call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs
		allow optionally calling Raw after Next but before Token, Text, TagName, or
		TagAttr. In EBNF notation, the valid call sequence per token is:
		
			Next {Raw} [ Token | Text | TagName {TagAttr} ]
		
		Token returns an independent data structure that completely describes a token.
		Entities (such as "&lt;") are unescaped, tag names and attribute keys are
		lower-cased, and attributes are collected into a []Attribute. For example:
		
			for {
				if z.Next() == html.ErrorToken {
					// Returning io.EOF indicates success.
					return z.Err()
				}
				emitToken(z.Token())
			}
		
		The low-level API performs fewer allocations and copies, but the contents of
		the []byte values returned by Text, TagName and TagAttr may change on the next
		call to Next. For example, to extract an HTML page's anchor text:
		
			depth := 0
			for {
				tt := z.Next()
				switch tt {
				case html.ErrorToken:
					return z.Err()
				case html.TextToken:
					if depth > 0 {
						// emitBytes should copy the []byte it receives,
						// if it doesn't process it immediately.
						emitBytes(z.Text())
					}
				case html.StartTagToken, html.EndTagToken:
					tn, _ := z.TagName()
					if len(tn) == 1 && tn[0] == 'a' {
						if tt == html.StartTagToken {
							depth++
						} else {
							depth--
						}
					}
				}
			}
		
		Parsing is done by calling Parse with an io.Reader, which returns the root of
		the parse tree (the document element) as a *Node. It is the caller's
		responsibility to ensure that the Reader provides UTF-8 encoded HTML. For
		example, to process each anchor node in depth-first order:
		
			doc, err := html.Parse(r)
			if err != nil {
				// ...
			}
			var f func(*html.Node)
			f = func(n *html.Node) {
				if n.Type == html.ElementNode && n.Data == "a" {
					// Do something with n...
				}
				for c := n.FirstChild; c != nil; c = c.NextSibling {
					f(c)
				}
			}
			f(doc)
		
		The relevant specifications include:
		https://html.spec.whatwg.org/multipage/syntax.html and
		https://html.spec.whatwg.org/multipage/syntax.html#tokenization
		
		# Security Considerations
		
		Care should be taken when parsing and interpreting HTML, whether full documents
		or fragments, within the framework of the HTML specification, especially with
		regard to untrusted inputs.
		
		This package provides both a tokenizer and a parser, which implement the
		tokenization, and tokenization and tree construction stages of the WHATWG HTML
		parsing specification respectively. While the tokenizer parses and normalizes
		individual HTML tokens, only the parser constructs the DOM tree from the
		tokenized HTML, as described in the tree construction stage of the
		specification, dynamically modifying or extending the docuemnt's DOM tree.
		
		If your use case requires semantically well-formed HTML documents, as defined by
		the WHATWG specification, the parser should be used rather than the tokenizer.
		
		In security contexts, if trust decisions are being made using the tokenized or
		parsed content, the input must be re-serialized (for instance by using Render or
		Token.String) in order for those trust decisions to hold, as the process of
		tokenization or parsing may alter the content.

	    doctype.go
	    entity.go
	    escape.go
	    foreign.go
	    node.go
	    parse.go
	    render.go
	    token.go
Code Examples

	Parse
		package main
		
		import (
			"fmt"
			"log"
			"strings"
		
			"golang.org/x/net/html"
		)
		
		func main() {
			s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
			doc, err := html.Parse(strings.NewReader(s))
			if err != nil {
				log.Fatal(err)
			}
			var f func(*html.Node)
			f = func(n *html.Node) {
				if n.Type == html.ElementNode && n.Data == "a" {
					for _, a := range n.Attr {
						if a.Key == "href" {
							fmt.Println(a.Val)
							break
						}
					}
				}
				for c := n.FirstChild; c != nil; c = c.NextSibling {
					f(c)
				}
			}
			f(doc)
		}


Package-Level Type Names (total 7)

	/* sort by: alphabet | popularity */
	 type Attribute (struct)
		An Attribute is an attribute namespace-key-value triple. Namespace is
		non-empty for foreign attributes like xlink, Key is alphabetic (and hence
		does not contain escapable characters like '&', '<' or '>'), and Val is
		unescaped (it looks like "a<b" rather than "a&lt;b").
		
		Namespace is only used by the parser, not the tokenizer.

		Fields (total 3)
			Key string
			Namespace string
			Val string

	 type Node (struct)
		A Node consists of a NodeType and some Data (tag name for element nodes,
		content for text) and are part of a tree of Nodes. Element nodes may also
		have a Namespace and contain a slice of Attributes. Data is unescaped, so
		that it looks like "a<b" rather than "a&lt;b". For element nodes, DataAtom
		is the atom for Data, or zero if Data is not a known tag name.
		
		An empty Namespace implies a "http://www.w3.org/1999/xhtml" namespace.
		Similarly, "math" is short for "http://www.w3.org/1998/Math/MathML", and
		"svg" is short for "http://www.w3.org/2000/svg".

		Fields (total 10)
			Attr []Attribute
			Data string
			DataAtom atom.Atom
			FirstChild *Node
			LastChild *Node
			Namespace string
			NextSibling *Node
			Parent *Node
			PrevSibling *Node
			Type NodeType
		Methods (total 3)
			(*Node) AppendChild(c *Node)
				AppendChild adds a node c as a child of n.
				
				It will panic if c already has a parent or siblings.

			(*Node) InsertBefore(newChild, oldChild *Node)
				InsertBefore inserts newChild as a child of n, immediately before oldChild
				in the sequence of n's children. oldChild may be nil, in which case newChild
				is appended to the end of n's children.
				
				It will panic if newChild already has a parent or siblings.

			(*Node) RemoveChild(c *Node)
				RemoveChild removes a node c that is a child of n. Afterwards, c will have
				no parent and no siblings.
				
				It will panic if c's parent is not n.

		As Outputs Of (at least 4)
			func Parse(r io.Reader) (*Node, error)
			func ParseFragment(r io.Reader, context *Node) ([]*Node, error)
			func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOption) ([]*Node, error)
			func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error)
		As Inputs Of (at least 6)
			func ParseFragment(r io.Reader, context *Node) ([]*Node, error)
			func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOption) ([]*Node, error)
			func Render(w io.Writer, n *Node) error
			func (*Node).AppendChild(c *Node)
			func (*Node).InsertBefore(newChild, oldChild *Node)
			func (*Node).RemoveChild(c *Node)

	 type NodeType uint32 (basic type)
		A NodeType is the type of a Node.

		As Types Of (total 7)
			const CommentNode
			const DoctypeNode
			const DocumentNode
			const ElementNode
			const ErrorNode
			const RawNode
			const TextNode

	 type ParseOption (func)
		ParseOption configures a parser.

		As Outputs Of (at least one exported)
			func ParseOptionEnableScripting(enable bool) ParseOption
		As Inputs Of (at least 2)
			func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOption) ([]*Node, error)
			func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error)

	 type Token (struct)
		A Token consists of a TokenType and some Data (tag name for start and end
		tags, content for text, comments and doctypes). A tag Token may also contain
		a slice of Attributes. Data is unescaped for all Tokens (it looks like "a<b"
		rather than "a&lt;b"). For tag Tokens, DataAtom is the atom for Data, or
		zero if Data is not a known tag name.

		Fields (total 4)
			Attr []Attribute
			Data string
			DataAtom atom.Atom
			Type TokenType
		Methods (only one)
			( Token) String() string
				String returns a string representation of the Token.

		Implements (at least 2)
			 Token : fmt.Stringer
			 Token : github.com/ChrisTrenkamp/goxpath/tree.Result
		As Outputs Of (at least one exported)
			func (*Tokenizer).Token() Token

	 type Tokenizer (struct)
		A Tokenizer returns a stream of HTML Tokens.

		Methods (total 11)
			(*Tokenizer) AllowCDATA(allowCDATA bool)
				AllowCDATA sets whether or not the tokenizer recognizes <![CDATA[foo]]> as
				the text "foo". The default value is false, which means to recognize it as
				a bogus comment "<!-- [CDATA[foo]] -->" instead.
				
				Strictly speaking, an HTML5 compliant tokenizer should allow CDATA if and
				only if tokenizing foreign content, such as MathML and SVG. However,
				tracking foreign-contentness is difficult to do purely in the tokenizer,
				as opposed to the parser, due to HTML integration points: an <svg> element
				can contain a <foreignObject> that is foreign-to-SVG but not foreign-to-
				HTML. For strict compliance with the HTML5 tokenization algorithm, it is the
				responsibility of the user of a tokenizer to call AllowCDATA as appropriate.
				In practice, if using the tokenizer without caring whether MathML or SVG
				CDATA is text or comments, such as tokenizing HTML to find all the anchor
				text, it is acceptable to ignore this responsibility.

			(*Tokenizer) Buffered() []byte
				Buffered returns a slice containing data buffered but not yet tokenized.

			(*Tokenizer) Err() error
				Err returns the error associated with the most recent ErrorToken token.
				This is typically io.EOF, meaning the end of tokenization.

			(*Tokenizer) Next() TokenType
				Next scans the next token and returns its type.

			(*Tokenizer) NextIsNotRawText()
				NextIsNotRawText instructs the tokenizer that the next token should not be
				considered as 'raw text'. Some elements, such as script and title elements,
				normally require the next token after the opening tag to be 'raw text' that
				has no child elements. For example, tokenizing "<title>a<b>c</b>d</title>"
				yields a start tag token for "<title>", a text token for "a<b>c</b>d", and
				an end tag token for "</title>". There are no distinct start tag or end tag
				tokens for the "<b>" and "</b>".
				
				This tokenizer implementation will generally look for raw text at the right
				times. Strictly speaking, an HTML5 compliant tokenizer should not look for
				raw text if in foreign content: <title> generally needs raw text, but a
				<title> inside an <svg> does not. Another example is that a <textarea>
				generally needs raw text, but a <textarea> is not allowed as an immediate
				child of a <select>; in normal parsing, a <textarea> implies </select>, but
				one cannot close the implicit element when parsing a <select>'s InnerHTML.
				Similarly to AllowCDATA, tracking the correct moment to override raw-text-
				ness is difficult to do purely in the tokenizer, as opposed to the parser.
				For strict compliance with the HTML5 tokenization algorithm, it is the
				responsibility of the user of a tokenizer to call NextIsNotRawText as
				appropriate. In practice, like AllowCDATA, it is acceptable to ignore this
				responsibility for basic usage.
				
				Note that this 'raw text' concept is different from the one offered by the
				Tokenizer.Raw method.

			(*Tokenizer) Raw() []byte
				Raw returns the unmodified text of the current token. Calling Next, Token,
				Text, TagName or TagAttr may change the contents of the returned slice.
				
				The token stream's raw bytes partition the byte stream (up until an
				ErrorToken). There are no overlaps or gaps between two consecutive token's
				raw bytes. One implication is that the byte offset of the current token is
				the sum of the lengths of all previous tokens' raw bytes.

			(*Tokenizer) SetMaxBuf(n int)
				SetMaxBuf sets a limit on the amount of data buffered during tokenization.
				A value of 0 means unlimited.

			(*Tokenizer) TagAttr() (key, val []byte, moreAttr bool)
				TagAttr returns the lower-cased key and unescaped value of the next unparsed
				attribute for the current tag token and whether there are more attributes.
				The contents of the returned slices may change on the next call to Next.

			(*Tokenizer) TagName() (name []byte, hasAttr bool)
				TagName returns the lower-cased name of a tag token (the `img` out of
				`<IMG SRC="foo">`) and whether the tag has attributes.
				The contents of the returned slice may change on the next call to Next.

			(*Tokenizer) Text() []byte
				Text returns the unescaped text of a text, comment or doctype token. The
				contents of the returned slice may change on the next call to Next.

			(*Tokenizer) Token() Token
				Token returns the current Token. The result's Data and Attr values remain
				valid after subsequent Next calls.

		As Outputs Of (at least 2)
			func NewTokenizer(r io.Reader) *Tokenizer
			func NewTokenizerFragment(r io.Reader, contextTag string) *Tokenizer

	 type TokenType uint32 (basic type)
		A TokenType is the type of a Token.

		Methods (only one)
			( TokenType) String() string
				String returns a string representation of the TokenType.

		Implements (at least 2)
			 TokenType : fmt.Stringer
			 TokenType : github.com/ChrisTrenkamp/goxpath/tree.Result
		As Outputs Of (at least one exported)
			func (*Tokenizer).Next() TokenType
		As Types Of (total 7)
			const CommentToken
			const DoctypeToken
			const EndTagToken
			const ErrorToken
			const SelfClosingTagToken
			const StartTagToken
			const TextToken


Package-Level Functions (total 10)

	 func EscapeString(s string) string
		EscapeString escapes special characters like "<" to become "&lt;". It
		escapes only five such characters: <, >, &, ' and ".
		UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
		always true.

	 func NewTokenizer(r io.Reader) *Tokenizer
		NewTokenizer returns a new HTML Tokenizer for the given Reader.
		The input is assumed to be UTF-8 encoded.

	 func NewTokenizerFragment(r io.Reader, contextTag string) *Tokenizer
		NewTokenizerFragment returns a new HTML Tokenizer for the given Reader, for
		tokenizing an existing element's InnerHTML fragment. contextTag is that
		element's tag, such as "div" or "iframe".
		
		For example, how the InnerHTML "a<b" is tokenized depends on whether it is
		for a <p> tag or a <script> tag.
		
		The input is assumed to be UTF-8 encoded.

	 func Parse(r io.Reader) (*Node, error)
		Parse returns the parse tree for the HTML from the given Reader.
		
		It implements the HTML5 parsing algorithm
		(https://html.spec.whatwg.org/multipage/syntax.html#tree-construction),
		which is very complicated. The resultant tree can contain implicitly created
		nodes that have no explicit <tag> listed in r's data, and nodes' parents can
		differ from the nesting implied by a naive processing of start and end
		<tag>s. Conversely, explicit <tag>s in r's data can be silently dropped,
		with no corresponding node in the resulting tree.
		
		The input is assumed to be UTF-8 encoded.

	 func ParseFragment(r io.Reader, context *Node) ([]*Node, error)
		ParseFragment parses a fragment of HTML and returns the nodes that were
		found. If the fragment is the InnerHTML for an existing element, pass that
		element in context.
		
		It has the same intricacies as Parse.

	 func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOption) ([]*Node, error)
		ParseFragmentWithOptions is like ParseFragment, with options.

	 func ParseOptionEnableScripting(enable bool) ParseOption
		ParseOptionEnableScripting configures the scripting flag.
		https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting
		
		By default, scripting is enabled.

	 func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error)
		ParseWithOptions is like Parse, with options.

	 func Render(w io.Writer, n *Node) error
		Render renders the parse tree n to the given writer.
		
		Rendering is done on a 'best effort' basis: calling Parse on the output of
		Render will always result in something similar to the original tree, but it
		is not necessarily an exact clone unless the original tree was 'well-formed'.
		'Well-formed' is not easily specified; the HTML5 specification is
		complicated.
		
		Calling Parse on arbitrary input typically results in a 'well-formed' parse
		tree. However, it is possible for Parse to yield a 'badly-formed' parse tree.
		For example, in a 'well-formed' parse tree, no <a> element is a child of
		another <a> element: parsing "<a><a>" results in two sibling elements.
		Similarly, in a 'well-formed' parse tree, no <a> element is a child of a
		<table> element: parsing "<p><table><a>" results in a <p> with two sibling
		children; the <a> is reparented to the <table>'s parent. However, calling
		Parse on "<a><table><a>" does not return an error, but the result has an <a>
		element with an <a> child, and is therefore not 'well-formed'.
		
		Programmatically constructed trees are typically also 'well-formed', but it
		is possible to construct a tree that looks innocuous but, when rendered and
		re-parsed, results in a different tree. A simple example is that a solitary
		text node would become a tree containing <html>, <head> and <body> elements.
		Another example is that the programmatic equivalent of "a<head>b</head>c"
		becomes "<html><head><head/><body>abc</body></html>".

	 func UnescapeString(s string) string
		UnescapeString unescapes entities like "&lt;" to become "<". It unescapes a
		larger range of entities than EscapeString escapes. For example, "&aacute;"
		unescapes to "á", as does "&#225;" and "&xE1;".
		UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
		always true.


Package-Level Variables (only one)

	  var ErrBufferExceeded error
		ErrBufferExceeded means that the buffering limit was exceeded.


Package-Level Constants (total 14)

	const CommentNode NodeType = 4
	const CommentToken TokenType = 5
		A CommentToken looks like <!--x-->.

	const DoctypeNode NodeType = 5
	const DoctypeToken TokenType = 6
		A DoctypeToken looks like <!DOCTYPE x>

	const DocumentNode NodeType = 2
	const ElementNode NodeType = 3
	const EndTagToken TokenType = 3
		An EndTagToken looks like </a>.

	const ErrorNode NodeType = 0
	const ErrorToken TokenType = 0
		ErrorToken means that an error occurred during tokenization.

	const RawNode NodeType = 6
		RawNode nodes are not returned by the parser, but can be part of the
		Node tree passed to func Render to insert raw HTML (without escaping).
		If so, this package makes no guarantee that the rendered HTML is secure
		(from e.g. Cross Site Scripting attacks) or well-formed.

	const SelfClosingTagToken TokenType = 4
		A SelfClosingTagToken tag looks like <br/>.

	const StartTagToken TokenType = 2
		A StartTagToken looks like <a>.

	const TextNode NodeType = 1
	const TextToken TokenType = 1
		TextToken means a text node.


The pages are generated with Golds v0.6.7. (GOOS=linux GOARCH=amd64)
Golds is a Go 101 project developed by Tapir Liu.
PR and bug reports are welcome and can be submitted to the issue list.
Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds.