Encoding
String Encodings of a Character Vector

Description

Reads or sets the string encodings for the elements of a character vector.

Usage

Encoding(x)
Encoding(x) <- value

Arguments

x a character vector.
value a character vector specifying the new string encodings to be assigned to the elements of the mode. Each encoding is specified as one of the following four strings:
  • "unknown"
  • "latin1"
  • "UTF-8"
  • "bytes"
Any other string is treated the same as "unknown".
  • If value has length zero, an error is generated.
  • If value is shorter than x, it is repeated as necessary.
  • If value is longer than x, its extra elements are ignored.

Details

In memory, a string is just a sequence of bytes. A string encoding specifies how these bytes are interpreted as characters. Some encodings (such as "latin1") interpret each byte as a single character, whereas other encodings (such as "UTF-8") use multiple bytes to specify some characters (therefore they can represent many more characters).
Each element of a character vector has one of the four encodings, described below. Different elements in the same character vector can have different encodings. The function Encoding retrieves the encodings for the elements of a character vector. These encodings are interpreted as follows:
"unknown" This is the default encoding used for most strings. In TIBCO Enterprise Runtime for R, the 'unknown' encoding is defined to use the "UTF-8" character set, whereas open-source R uses the "native" encoding for this computer (see getNativeEncoding). This encoding is also used for representing strings with any arbitrary encoding, which can be created by iconv.
"latin1" Strings encoded with the "latin1" (or "ISO-8859-1") character set, where each character is represented with one byte.
"UTF-8" Strings encoded with the "UTF-8" character set, where each character is represented with a sequence of from one to four bytes. This encoding can represent a wide range of Unicode characters.
"bytes" Strings to be considered as a sequence of raw bytes, without any special encoding. Some string functions will cause an error when given a 'bytes' string, such as nchar(some.bytes.string, type='chars'). Some functions such as grep with a useBytes argument will act as if useBytes=TRUE when processing a string with 'bytes' encoding. String with 'bytes' encoding are printed a little differently: non-alphabetic bytes are printed as \xXX escape sequences.
The function Encoding<- sets the encodings for the string elements, without changing the bytes used to represent these strings (which can be examined with charToRaw). Therefore, it is possible to make a string look and act weird by changing its encoding incorrectly. Use iconv to convert the bytes from one string encoding to another. Encoding<- changes no attributes of x, so it can be used to change the encodings of elements of a matrix.
Value
Encoding(x)returns a character vector the same length as x, giving the string encoding of each element of x. This result does not include any of the attributes of x.
Encoding(x) <- valuereturns value.
String Comparison
The functions that compare strings, such as compare the characters of the strings, rather than the bytes. If two strings do not have the same encoding, they are converted to sequences of Unicode characters, and then these sequences are compared.
This behavior might be counter-intuitive: Even though two strings might be different according to Encoding and charToRaw, the function identical might say that they are equal. One result of this is that the programmer does not need to worry about the different string encoding in most cases. That is, code can compare a string with a constant string "foo" without concern for how the string was encoded.
One exception to this comparison rule is for strings with encoding 'bytes'. You can use functions such as == or identical to compare the bytes of two strings with 'bytes' encoding, but these functions treat a string with encoding 'bytes' as unequal to (and less than) any string with a different encoding.
Differences between TIBCO Enterprise Runtime for R and Open-source R
General warning: TIBCO Enterprise Runtime for R might not choose the same string encoding as open-source R in all cases. Indeed, open-source R does not use the same string encoding consistently between different platforms and locales. TIBCO Enterprise Runtime for R was designed to be more consistent between different platforms.
See Also
charToRaw, getConsoleEncoding, getNativeEncoding, iconv, grep, nchar, getValidUtf8.
Examples
# create a latin1-encoded string
x <- rawToChar(as.raw(c(0x61,0xC4,0x61)))
Encoding(x) <- 'latin1'
# create a UTF-8-encoded string
y <- rawToChar(as.raw(c(0x61,0xC3,0x84,0x61)))
Encoding(y) <- 'UTF-8'
# read encodings
Encoding(c(x,y))
# both strings are considered equal, though they have different bytes
x == y
Package base version 6.0.0-69
Package Index