String Representation

TERR and open-source R vary in their string encodings.

Open-source R allows strings to be represented in any of four encodings:

"unknown" (the default character encoding for the system)
"latin1"
"UTF-8"
"bytes"

TERR currently creates all strings as "unknown" by default (note that "unknown" encoding is hard-wired to use UTF-8). The functions Encoding and iconv can be used for constructing strings with other encodings.

TERR allows adding Unicode characters into a string using an escape sequence such as "\u30A4" or "\U{30A4}" to create a string containing a single Japanese character. Alternatively, it is possible to add Unicode characters into a typed string by typing them, or copy-and-pasting them. Exactly which characters can be typed or printed depends on how the console is set up (described below).

The TERR string-manipulation functions (substring, nchar, paste, and so on) correctly handle UTF-8 strings. Functions for searching strings with regular expressions (regexpr, grep, strsplit, and so on) correctly handle UTF-8 strings as the data or the pattern strings.

TERR implements several functions for constructing strings from integers or raw bytes: intToUtf8, utf8ToInt, charToRaw, rawToChar.