normalizeUnicode
Normalize Unicode Characters

Description

Convert a vector of strings using one of several defined types of Unicode normalization.

Usage

normalizeUnicode(x, form = "NCF")

Arguments

x a vector of strings.
form a character string specifying the type of Unicode normalization to be used. Should be one of the strings "NFC", "NFD", "NFKC", "NFKD", "NFKC_CF" or "NFKC_Casefold".

Details

Unicode allows multiple character sequences to represent the same string. For example, the string "capital A with two dots" can be represented as a single character "\u00C4", or as the two characters "A\u0308". The Unicode standard defines multiple ways to "normalize" a Unicode string so different ways of representing a given string map to the same "canonical form" (see http://unicode.org/reports/tr15/). Normalizing Unicode strings is necessary in order to consistently compare or sort strings in languages with accented characters.
The forms "NFKC_CF" or "NFKC_Casefold" (which are equivalent) are described in http://unicode.org/reports/tr31/.
Each string is converted to UTF-8 before conversion, and the resulting strings all have the UTF-8 encoding.
Value
A vector of strings, with each element of x converted according to the specified normalization form. Attributes from x are copied to the output value.
See Also
Encoding.
Examples
all.equal(normalizeUnicode('\u212B','NFC'), '\u00C5')
Package terrUtils version 6.1.4-13
Package Index