Normalize Unicode Characters

x	a vector of strings.
form	a character string specifying the type of Unicode normalization to be used. Should be one of the strings "NFC", "NFD", "NFKC", "NFKD", "NFKC_CF" or "NFKC_Casefold".

Details

Unicode allows multiple character sequences to represent the same string. For example, the string "capital A with two dots" can be represented as a single character "\u00C4", or as the two characters "A\u0308". The Unicode standard defines multiple ways to "normalize" a Unicode string so different ways of representing a given string map to the same "canonical form" (see http://unicode.org/reports/tr15/). Normalizing Unicode strings is necessary in order to consistently compare or sort strings in languages with accented characters.

The forms "NFKC_CF" or "NFKC_Casefold" (which are equivalent) are described in http://unicode.org/reports/tr31/.

Each string is converted to UTF-8 before conversion, and the resulting strings all have the UTF-8 encoding.

A vector of strings, with each element of x converted according to the specified normalization form. Attributes from x are copied to the output value.

Description

Usage

Arguments

Details