normalizeUnicode
Normalize Unicode Characters
Description
Convert a vector of strings using one of several
defined types of Unicode normalization.
Usage
normalizeUnicode(x, form = "NCF")
Arguments
x |
a vector of strings.
|
form |
a character string specifying the type of Unicode normalization
to be used. Should be one of the strings
"NFC", "NFD",
"NFKC", "NFKD",
"NFKC_CF" or "NFKC_Casefold".
|
Details
Unicode allows multiple character sequences to represent the same
string. For example, the string "capital A with two dots" can be
represented as a single character
"\u00C4", or as the two
characters
"A\u0308". The Unicode standard defines multiple ways to
"normalize" a Unicode string so different ways of representing a given
string map to the same "canonical form" (see
http://unicode.org/reports/tr15/). Normalizing Unicode strings is
necessary in order to consistently compare or sort strings in
languages with accented characters.
Each string is converted to UTF-8 before conversion, and the resulting
strings all have the UTF-8 encoding.
Value
A vector of strings, with each element of x converted according
to the specified normalization form. Attributes from x are
copied to the output value.
See Also
Examples
all.equal(normalizeUnicode('\u212B','NFC'), '\u00C5')