Character Mapping
Character maps are used to normalize text data before comparisons are performed. Normalization consists of one or more of:
|
•
|
Fold letter case: This applies the Unicode Consortium defined rules for letter case folding to map all alphabetic characters to a common letter case. |
|
•
|
Fold diacritics: This applies the Unicode Consortium defined rules for stripping letters of their diacritic marks and other character normalization. |
|
•
|
Special Character mappings: Using this a particular character or class of characters to be mapped to another character. Currently there are two character classes defined: whitespace and punctuation. The definition of these classes is as specified by the Unicode Consortium with the exception of characters in the ASCII range, where all non-alphanumeric characters except the standard white space characters are considered punctuation characters. |
The precedence of mappings, from lowest to highest is:
|
2.
|
diacritic folding and character normalization |
|
3.
|
character class mapping |
|
4.
|
explicitly defined mappings. |
Thus it is possible to override letter folding and normalization or character class mappings using explicit character maps. For example, if we wish to create a character map that folds all letters to a common letter case except for 'A', and maps all punctuation to blank except for ampersand we can do this by adding a special mapping that maps ’A’ to ’A’ and ’&’ to ’&’.
Character maps must be created and assigned a name before they can be used. Once created they cannot be deleted or updated.
The predefined character maps are:
|
•
|
DVK_CMAP_STDNAME is the standard character mapping applied by default. It maps all forms of whitespace to the blank character (code point 0x0020), all punctuation characters except the ampersand to the blank character (code point 0x0020), folds letter case and folds diacritics (normalizes characters). For a specific list of the punctuation code-points mapped to the blank character, see Punctuation and Whitespace Code Points Mapped by Built-in Character Maps. |
|
•
|
DVK_CMAP_PUNCTNAME is a punctuation sensitive character map. It is the same as DVK_CMAP_STDNAME except that punctuation characters are not mapped (remain as themselves). |
To create your own custom character map use the lkt_create_charmap function. You can list all existing character maps, or check for the existence of a particular character map using the lkt_charmap_list function.