Source File
doc.go
Belonging Package
github.com/rivo/uniseg
/*Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, andstring width calculation for monospace fonts. Unicode Text Segmentation conformsto Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and UnicodeLine Breaking conforms to Unicode Standard Annex #14(https://unicode.org/reports/tr14/).In short, using this package, you can split a string into grapheme clusters(what people would usually refer to as a "character"), into words, and intosentences. Or, in its simplest case, this package allows you to count the numberof characters in a string, especially when it contains complex characters suchas emojis, combining characters, or characters from Asian, Arabic, Hebrew, orother languages. Additionally, you can use it to implement line breaking (or"word wrapping"), that is, to determine where text can be broken over to thenext line when the width of the line is not big enough to fit the entire text.Finally, you can use it to calculate the display width of a string for monospacefonts.# Getting StartedIf you just want to count the number of characters in a string, you can use[GraphemeClusterCount]. If you want to determine the display width of a string,you can use [StringWidth]. If you want to iterate over a string, you can use[Step], [StepString], or the [Graphemes] class (more convenient but lessperformant). This will provide you with all information: grapheme clusters,word boundaries, sentence boundaries, line breaks, and monospace characterwidths. The specialized functions [FirstGraphemeCluster],[FirstGraphemeClusterInString], [FirstWord], [FirstWordInString],[FirstSentence], and [FirstSentenceInString] can be used if only one type ofinformation is needed.# Grapheme ClustersConsider the rainbow flag emoji: 🏳️🌈. On most modern systems, it appears as onecharacter. But its string representation actually has 14 bytes, so countingbytes (or using len("🏳️🌈")) will not work as expected. Counting runes won't,either: The flag has 4 Unicode code points, thus 4 runes. The stdlib functionutf8.RuneCountInString("🏳️🌈") and len([]rune("🏳️🌈")) will both return 4.The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji.The Graphemes class and a variety of functions in this package will allow you tosplit strings into its grapheme clusters.# Word BoundariesWord boundaries are used in a number of different contexts. The most familiarones are selection (double-click mouse selection), cursor movement ("move tonext word" control-arrow keys), and the dialog option "Whole Word Search" forsearch and replace. This package provides methods for determining wordboundaries.# Sentence BoundariesSentence boundaries are often used for triple-click or some other method ofselecting or iterating through blocks of text that are larger than single words.They are also used to determine whether words occur within the same sentence indatabase queries. This package provides methods for determining sentenceboundaries.# Line BreakingLine breaking, also known as word wrapping, is the process of breaking a sectionof text into lines such that it will fit in the available width of a page,window or other display area. This package provides methods to determine thepositions in a string where a line must be broken, may be broken, or must not bebroken.# Monospace WidthMonospace width, as referred to in this package, is the width of a string in amonospace font. This is commonly used in terminal user interfaces or textdisplays or editors that don't support proportional fonts. A width of 1corresponds to a single character cell. The C function [wcswidth()] and itsimplementation in other programming languages is in widespread use for the samepurpose. However, there is no standard for the calculation of such widths, andthis package differs from wcswidth() in a number of ways, presumably to generatemore visually pleasing results.To start, we assume that every code point has a width of 1, with the followingexceptions:- Code points with grapheme cluster break properties Control, CR, LF, Extend,and ZWJ have a width of 0.- U+2E3A, Two-Em Dash, has a width of 3.- U+2E3B, Three-Em Dash, has a width of 4.- Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide"(W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) bothhave a width of 1.)- Code points with grapheme cluster break property Regional Indicator have awidth of 2.- Code points with grapheme cluster break property Extended Pictographic havea width of 2, unless their Emoji Presentation flag is "No", in which casethe width is 1.For Hangul grapheme clusters composed of conjoining Jamo and for RegionalIndicators (flags), all code points except the first one have a width of 0. Forgrapheme clusters starting with an Extended Pictographic, any additional codepoint will force a total width of 2, except if the Variation Selector-15(U+FE0E) is included, in which case the total width is always 1. Graphemeclusters ending with Variation Selector-16 (U+FE0F) have a width of 2.Note that whether these widths appear correct depends on your application'srender engine, to which extent it conforms to the Unicode Standard, and itschoice of font.[wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html*/package uniseg
![]() |
The pages are generated with Golds v0.6.7. (GOOS=linux GOARCH=amd64) Golds is a Go 101 project developed by Tapir Liu. PR and bug reports are welcome and can be submitted to the issue list. Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds. |