String Representation
TERR and open-source R vary in their string encodings.
Open-source R allows strings to be represented in any of four encodings:
"unknown"
(the default character encoding for the system)"latin1"
"UTF-8"
"bytes"
TERR
currently creates all strings as
"unknown"
by default (note that
"unknown"
encoding is hard-wired to use UTF-8). The
functions
Encoding
and
iconv
can be used for constructing strings with other
encodings.
TERR
allows adding Unicode characters into a string using an escape sequence such as
"\u30A4"
or
"\U{30A4}"
to create a string containing a single
Japanese character. Alternatively, it is possible to add Unicode characters
into a typed string by typing them, or copy-and-pasting them. Exactly which
characters can be typed or printed depends on how the console is set up
(described below).
The
TERR
string-manipulation functions (substring
,
nchar
,
paste
, and so on) correctly handle UTF-8 strings.
Functions for searching strings with regular expressions
(regexpr
,
grep
,
strsplit
, and so on) correctly handle UTF-8 strings as
the data or the pattern strings.
TERR
implements several functions for constructing strings from integers or raw
bytes:
intToUtf8
,
utf8ToInt
,
charToRaw
,
rawToChar.