Package: github.com/rivo/uniseg

package uniseg

Import Path
	github.com/rivo/uniseg (on go.dev)

Dependency Relation
	imports one package, and imported by one package

Involved Source Files

	  d doc.go
		Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and
		string width calculation for monospace fonts. Unicode Text Segmentation conforms
		to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode
		Line Breaking conforms to Unicode Standard Annex #14
		(https://unicode.org/reports/tr14/).
		
		In short, using this package, you can split a string into grapheme clusters
		(what people would usually refer to as a "character"), into words, and into
		sentences. Or, in its simplest case, this package allows you to count the number
		of characters in a string, especially when it contains complex characters such
		as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or
		other languages. Additionally, you can use it to implement line breaking (or
		"word wrapping"), that is, to determine where text can be broken over to the
		next line when the width of the line is not big enough to fit the entire text.
		Finally, you can use it to calculate the display width of a string for monospace
		fonts.
		
		# Getting Started
		
		If you just want to count the number of characters in a string, you can use
		[GraphemeClusterCount]. If you want to determine the display width of a string,
		you can use [StringWidth]. If you want to iterate over a string, you can use
		[Step], [StepString], or the [Graphemes] class (more convenient but less
		performant). This will provide you with all information: grapheme clusters,
		word boundaries, sentence boundaries, line breaks, and monospace character
		widths. The specialized functions [FirstGraphemeCluster],
		[FirstGraphemeClusterInString], [FirstWord], [FirstWordInString],
		[FirstSentence], and [FirstSentenceInString] can be used if only one type of
		information is needed.
		
		# Grapheme Clusters
		
		Consider the rainbow flag emoji: 🏳️‍🌈. On most modern systems, it appears as one
		character. But its string representation actually has 14 bytes, so counting
		bytes (or using len("🏳️‍🌈")) will not work as expected. Counting runes won't,
		either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function
		utf8.RuneCountInString("🏳️‍🌈") and len([]rune("🏳️‍🌈")) will both return 4.
		
		The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji.
		The Graphemes class and a variety of functions in this package will allow you to
		split strings into its grapheme clusters.
		
		# Word Boundaries
		
		Word boundaries are used in a number of different contexts. The most familiar
		ones are selection (double-click mouse selection), cursor movement ("move to
		next word" control-arrow keys), and the dialog option "Whole Word Search" for
		search and replace. This package provides methods for determining word
		boundaries.
		
		# Sentence Boundaries
		
		Sentence boundaries are often used for triple-click or some other method of
		selecting or iterating through blocks of text that are larger than single words.
		They are also used to determine whether words occur within the same sentence in
		database queries. This package provides methods for determining sentence
		boundaries.
		
		# Line Breaking
		
		Line breaking, also known as word wrapping, is the process of breaking a section
		of text into lines such that it will fit in the available width of a page,
		window or other display area. This package provides methods to determine the
		positions in a string where a line must be broken, may be broken, or must not be
		broken.
		
		# Monospace Width
		
		Monospace width, as referred to in this package, is the width of a string in a
		monospace font. This is commonly used in terminal user interfaces or text
		displays or editors that don't support proportional fonts. A width of 1
		corresponds to a single character cell. The C function [wcswidth()] and its
		implementation in other programming languages is in widespread use for the same
		purpose. However, there is no standard for the calculation of such widths, and
		this package differs from wcswidth() in a number of ways, presumably to generate
		more visually pleasing results.
		
		To start, we assume that every code point has a width of 1, with the following
		exceptions:
		
		  - Code points with grapheme cluster break properties Control, CR, LF, Extend,
		    and ZWJ have a width of 0.
		  - U+2E3A, Two-Em Dash, has a width of 3.
		  - U+2E3B, Three-Em Dash, has a width of 4.
		  - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide"
		    (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both
		    have a width of 1.)
		  - Code points with grapheme cluster break property Regional Indicator have a
		    width of 2.
		  - Code points with grapheme cluster break property Extended Pictographic have
		    a width of 2, unless their Emoji Presentation flag is "No", in which case
		    the width is 1.
		
		For Hangul grapheme clusters composed of conjoining Jamo and for Regional
		Indicators (flags), all code points except the first one have a width of 0. For
		grapheme clusters starting with an Extended Pictographic, any additional code
		point will force a total width of 2, except if the Variation Selector-15
		(U+FE0E) is included, in which case the total width is always 1. Grapheme
		clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.
		
		Note that whether these widths appear correct depends on your application's
		render engine, to which extent it conforms to the Unicode Standard, and its
		choice of font.
		
		[wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html

	    eastasianwidth.go
	    emojipresentation.go
	    grapheme.go
	    graphemeproperties.go
	    graphemerules.go
	    line.go
	    lineproperties.go
	    linerules.go
	    properties.go
	    sentence.go
	    sentenceproperties.go
	    sentencerules.go
	    step.go
	    width.go
	    word.go
	    wordproperties.go
	    wordrules.go
Package-Level Type Names (only one)

	/* sort by: alphabet | popularity */
	 type Graphemes (struct)
		Graphemes implements an iterator over Unicode grapheme clusters, or
		user-perceived characters. While iterating, it also provides information
		about word boundaries, sentence boundaries, line breaks, and monospace
		character widths.
		
		After constructing the class via [NewGraphemes] for a given string "str",
		[Graphemes.Next] is called for every grapheme cluster in a loop until it
		returns false. Inside the loop, information about the grapheme cluster as
		well as boundary information and character width is available via the various
		methods (see examples below).
		
		Using this class to iterate over a string is convenient but it is much slower
		than using this package's [Step] or [StepString] functions or any of the
		other specialized functions starting with "First".

		Methods (total 10)
			(*Graphemes) Bytes() []byte
				Bytes returns a byte slice which corresponds to the current grapheme cluster.
				If the iterator is already past the end or [Graphemes.Next] has not yet been
				called, nil is returned.

			(*Graphemes) IsSentenceBoundary() bool
				IsSentenceBoundary returns true if a sentence ends after the current
				grapheme cluster.

			(*Graphemes) IsWordBoundary() bool
				IsWordBoundary returns true if a word ends after the current grapheme
				cluster.

			(*Graphemes) LineBreak() int
				LineBreak returns whether the line can be broken after the current grapheme
				cluster. A value of [LineDontBreak] means the line may not be broken, a value
				of [LineMustBreak] means the line must be broken, and a value of
				[LineCanBreak] means the line may or may not be broken.

			(*Graphemes) Next() bool
				Next advances the iterator by one grapheme cluster and returns false if no
				clusters are left. This function must be called before the first cluster is
				accessed.

			(*Graphemes) Positions() (int, int)
				Positions returns the interval of the current grapheme cluster as byte
				positions into the original string. The first returned value "from" indexes
				the first byte and the second returned value "to" indexes the first byte that
				is not included anymore, i.e. str[from:to] is the current grapheme cluster of
				the original string "str". If [Graphemes.Next] has not yet been called, both
				values are 0. If the iterator is already past the end, both values are 1.

			(*Graphemes) Reset()
				Reset puts the iterator into its initial state such that the next call to
				[Graphemes.Next] sets it to the first grapheme cluster again.

			(*Graphemes) Runes() []rune
				Runes returns a slice of runes (code points) which corresponds to the current
				grapheme cluster. If the iterator is already past the end or [Graphemes.Next]
				has not yet been called, nil is returned.

			(*Graphemes) Str() string
				Str returns a substring of the original string which corresponds to the
				current grapheme cluster. If the iterator is already past the end or
				[Graphemes.Next] has not yet been called, an empty string is returned.

			(*Graphemes) Width() int
				Width returns the monospace width of the current grapheme cluster.

		As Outputs Of (at least one exported)
			func NewGraphemes(str string) *Graphemes


Package-Level Functions (total 16)

	 func FirstGraphemeCluster(b []byte, state int) (cluster, rest []byte, width, newState int)
		FirstGraphemeCluster returns the first grapheme cluster found in the given
		byte slice according to the rules of [Unicode Standard Annex #29, Grapheme
		Cluster Boundaries]. This function can be called continuously to extract all
		grapheme clusters from a byte slice, as illustrated in the example below.
		
		If you don't know the current state, for example when calling the function
		for the first time, you must pass -1. For consecutive calls, pass the state
		and rest slice returned by the previous call.
		
		The "rest" slice is the sub-slice of the original byte slice "b" starting
		after the last byte of the identified grapheme cluster. If the length of the
		"rest" slice is 0, the entire byte slice "b" has been processed. The
		"cluster" byte slice is the sub-slice of the input slice containing the
		identified grapheme cluster.
		
		The returned width is the width of the grapheme cluster for most monospace
		fonts where a value of 1 represents one character cell.
		
		Given an empty byte slice "b", the function returns nil values.
		
		While slightly less convenient than using the Graphemes class, this function
		has much better performance and makes no allocations. It lends itself well to
		large byte slices.
		
		[Unicode Standard Annex #29, Grapheme Cluster Boundaries]: http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

	 func FirstGraphemeClusterInString(str string, state int) (cluster, rest string, width, newState int)
		FirstGraphemeClusterInString is like [FirstGraphemeCluster] but its input and
		outputs are strings.

	 func FirstLineSegment(b []byte, state int) (segment, rest []byte, mustBreak bool, newState int)
		FirstLineSegment returns the prefix of the given byte slice after which a
		decision to break the string over to the next line can or must be made,
		according to the rules of [Unicode Standard Annex #14]. This is used to
		implement line breaking.
		
		Line breaking, also known as word wrapping, is the process of breaking a
		section of text into lines such that it will fit in the available width of a
		page, window or other display area.
		
		The returned "segment" may not be broken into smaller parts, unless no other
		breaking opportunities present themselves, in which case you may break by
		grapheme clusters (using the [FirstGraphemeCluster] function to determine the
		grapheme clusters).
		
		The "mustBreak" flag indicates whether you MUST break the line after the
		given segment (true), for example after newline characters, or you MAY break
		the line after the given segment (false).
		
		This function can be called continuously to extract all non-breaking sub-sets
		from a byte slice, as illustrated in the example below.
		
		If you don't know the current state, for example when calling the function
		for the first time, you must pass -1. For consecutive calls, pass the state
		and rest slice returned by the previous call.
		
		The "rest" slice is the sub-slice of the original byte slice "b" starting
		after the last byte of the identified line segment. If the length of the
		"rest" slice is 0, the entire byte slice "b" has been processed. The
		"segment" byte slice is the sub-slice of the input slice containing the
		identified line segment.
		
		Given an empty byte slice "b", the function returns nil values.
		
		Note that in accordance with [UAX #14 LB3], the final segment will end with
		"mustBreak" set to true. You can choose to ignore this by checking if the
		length of the "rest" slice is 0 and calling [HasTrailingLineBreak] or
		[HasTrailingLineBreakInString] on the last rune.
		
		Note also that this algorithm may break within grapheme clusters. This is
		addressed in Section 8.2 Example 6 of UAX #14. To avoid this, you can use
		the [Step] function instead.
		
		[Unicode Standard Annex #14]: https://www.unicode.org/reports/tr14/
		[UAX #14 LB3]: https://www.unicode.org/reports/tr14/#Algorithm

	 func FirstLineSegmentInString(str string, state int) (segment, rest string, mustBreak bool, newState int)
		FirstLineSegmentInString is like FirstLineSegment() but its input and outputs
		are strings.

	 func FirstSentence(b []byte, state int) (sentence, rest []byte, newState int)
		FirstSentence returns the first sentence found in the given byte slice
		according to the rules of [Unicode Standard Annex #29, Sentence Boundaries].
		This function can be called continuously to extract all sentences from a byte
		slice, as illustrated in the example below.
		
		If you don't know the current state, for example when calling the function
		for the first time, you must pass -1. For consecutive calls, pass the state
		and rest slice returned by the previous call.
		
		The "rest" slice is the sub-slice of the original byte slice "b" starting
		after the last byte of the identified sentence. If the length of the "rest"
		slice is 0, the entire byte slice "b" has been processed. The "sentence" byte
		slice is the sub-slice of the input slice containing the identified sentence.
		
		Given an empty byte slice "b", the function returns nil values.
		
		[Unicode Standard Annex #29, Sentence Boundaries]: http://unicode.org/reports/tr29/#Sentence_Boundaries

	 func FirstSentenceInString(str string, state int) (sentence, rest string, newState int)
		FirstSentenceInString is like [FirstSentence] but its input and outputs are
		strings.

	 func FirstWord(b []byte, state int) (word, rest []byte, newState int)
		FirstWord returns the first word found in the given byte slice according to
		the rules of [Unicode Standard Annex #29, Word Boundaries]. This function can
		be called continuously to extract all words from a byte slice, as illustrated
		in the example below.
		
		If you don't know the current state, for example when calling the function
		for the first time, you must pass -1. For consecutive calls, pass the state
		and rest slice returned by the previous call.
		
		The "rest" slice is the sub-slice of the original byte slice "b" starting
		after the last byte of the identified word. If the length of the "rest" slice
		is 0, the entire byte slice "b" has been processed. The "word" byte slice is
		the sub-slice of the input slice containing the identified word.
		
		Given an empty byte slice "b", the function returns nil values.
		
		[Unicode Standard Annex #29, Word Boundaries]: http://unicode.org/reports/tr29/#Word_Boundaries

	 func FirstWordInString(str string, state int) (word, rest string, newState int)
		FirstWordInString is like [FirstWord] but its input and outputs are strings.

	 func GraphemeClusterCount(s string) (n int)
		GraphemeClusterCount returns the number of user-perceived characters
		(grapheme clusters) for the given string.

	 func HasTrailingLineBreak(b []byte) bool
		HasTrailingLineBreak returns true if the last rune in the given byte slice is
		one of the hard line break code points defined in LB4 and LB5 of [UAX #14].
		
		[UAX #14]: https://www.unicode.org/reports/tr14/#Algorithm

	 func HasTrailingLineBreakInString(str string) bool
		HasTrailingLineBreakInString is like [HasTrailingLineBreak] but for a string.

	 func NewGraphemes(str string) *Graphemes
		NewGraphemes returns a new grapheme cluster iterator.

	 func ReverseString(s string) string
		ReverseString reverses the given string while observing grapheme cluster
		boundaries.

	 func Step(b []byte, state int) (cluster, rest []byte, boundaries int, newState int)
		Step returns the first grapheme cluster (user-perceived character) found in
		the given byte slice. It also returns information about the boundary between
		that grapheme cluster and the one following it as well as the monospace width
		of the grapheme cluster. There are three types of boundary information: word
		boundaries, sentence boundaries, and line breaks. This function is therefore
		a combination of [FirstGraphemeCluster], [FirstWord], [FirstSentence], and
		[FirstLineSegment].
		
		The "boundaries" return value can be evaluated as follows:
		
		  - boundaries&MaskWord != 0: The boundary is a word boundary.
		  - boundaries&MaskWord == 0: The boundary is not a word boundary.
		  - boundaries&MaskSentence != 0: The boundary is a sentence boundary.
		  - boundaries&MaskSentence == 0: The boundary is not a sentence boundary.
		  - boundaries&MaskLine == LineDontBreak: You must not break the line at the
		    boundary.
		  - boundaries&MaskLine == LineMustBreak: You must break the line at the
		    boundary.
		  - boundaries&MaskLine == LineCanBreak: You may or may not break the line at
		    the boundary.
		  - boundaries >> ShiftWidth: The width of the grapheme cluster for most
		    monospace fonts where a value of 1 represents one character cell.
		
		This function can be called continuously to extract all grapheme clusters
		from a byte slice, as illustrated in the examples below.
		
		If you don't know which state to pass, for example when calling the function
		for the first time, you must pass -1. For consecutive calls, pass the state
		and rest slice returned by the previous call.
		
		The "rest" slice is the sub-slice of the original byte slice "b" starting
		after the last byte of the identified grapheme cluster. If the length of the
		"rest" slice is 0, the entire byte slice "b" has been processed. The
		"cluster" byte slice is the sub-slice of the input slice containing the
		first identified grapheme cluster.
		
		Given an empty byte slice "b", the function returns nil values.
		
		While slightly less convenient than using the Graphemes class, this function
		has much better performance and makes no allocations. It lends itself well to
		large byte slices.
		
		Note that in accordance with [UAX #14 LB3], the final segment will end with
		a mandatory line break (boundaries&MaskLine == LineMustBreak). You can choose
		to ignore this by checking if the length of the "rest" slice is 0 and calling
		[HasTrailingLineBreak] or [HasTrailingLineBreakInString] on the last rune.
		
		[UAX #14 LB3]: https://www.unicode.org/reports/tr14/#Algorithm

	 func StepString(str string, state int) (cluster, rest string, boundaries int, newState int)
		StepString is like [Step] but its input and outputs are strings.

	 func StringWidth(s string) (width int)
		StringWidth returns the monospace width for the given string, that is, the
		number of same-size cells to be occupied by the string.


Package-Level Constants (total 7)

	const LineCanBreak = 1 // You may or may not break the line here.
		These constants define whether a given text may be broken into the next line.
		If the break is optional (LineCanBreak), you may choose to break or not based
		on your own criteria, for example, if the text has reached the available
		width.

	const LineDontBreak = 0 // You may not break the line here.
		These constants define whether a given text may be broken into the next line.
		If the break is optional (LineCanBreak), you may choose to break or not based
		on your own criteria, for example, if the text has reached the available
		width.

	const LineMustBreak = 2 // You must break the line here.
		These constants define whether a given text may be broken into the next line.
		If the break is optional (LineCanBreak), you may choose to break or not based
		on your own criteria, for example, if the text has reached the available
		width.

	const MaskLine = 3
		The bit masks used to extract boundary information returned by [Step].

	const MaskSentence = 8
		The bit masks used to extract boundary information returned by [Step].

	const MaskWord = 4
		The bit masks used to extract boundary information returned by [Step].

	const ShiftWidth = 4
		The number of bits to shift the boundary information returned by [Step] to
		obtain the monospace width of the grapheme cluster.


The pages are generated with Golds v0.6.7. (GOOS=linux GOARCH=amd64)
Golds is a Go 101 project developed by Tapir Liu.
PR and bug reports are welcome and can be submitted to the issue list.
Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds.