String Encodings of a Character Vector

a character vector.

value

a character vector specifying the new string encodings to be assigned to the elements of the mode. Each encoding is specified as one of the following four strings:

"unknown"
"latin1"
"UTF-8"
"bytes"

Any other string is treated the same as "unknown".

If value has length zero, an error is generated.
If value is shorter than x, it is repeated as necessary.
If value is longer than x, its extra elements are ignored.

Details

Each element of a character vector has one of the four encodings, described below. Different elements in the same character vector can have different encodings. The function Encoding retrieves the encodings for the elements of a character vector. These encodings are interpreted as follows:

	"unknown"	This is the default encoding used for most strings. In TIBCO Enterprise Runtime for R, the 'unknown' encoding is defined to use the "UTF-8" character set, whereas open-source R uses the "native" encoding for this computer (see getNativeEncoding). This encoding is also used for representing strings with any arbitrary encoding, which can be created by iconv.
	"latin1"	Strings encoded with the "latin1" (or "ISO-8859-1") character set, where each character is represented with one byte.
	"UTF-8"	Strings encoded with the "UTF-8" character set, where each character is represented with a sequence of from one to four bytes. This encoding can represent a wide range of Unicode characters.
	"bytes"	Strings to be considered as a sequence of raw bytes, without any special encoding. Some string functions will cause an error when given a 'bytes' string, such as nchar(some.bytes.string, type='chars'). Some functions such as grep with a useBytes argument will act as if useBytes=TRUE when processing a string with 'bytes' encoding. String with 'bytes' encoding are printed a little differently: non-alphabetic bytes are printed as \xXX escape sequences.

The function Encoding<- sets the encodings for the string elements, without changing the bytes used to represent these strings (which can be examined with charToRaw). Therefore, it is possible to make a string look and act weird by changing its encoding incorrectly. Use iconv to convert the bytes from one string encoding to another. Encoding<- changes no attributes of x, so it can be used to change the encodings of elements of a matrix.

The functions that compare strings, such as

==
identical
all.equal
match
duplicated
sort
and so on

compare the characters of the strings, rather than the bytes. If two strings do not have the same encoding, they are converted to sequences of Unicode characters, and then these sequences are compared.

This behavior might be counter-intuitive: Even though two strings might be different according to Encoding and charToRaw, the function identical might say that they are equal. One result of this is that the programmer does not need to worry about the different string encoding in most cases. That is, code can compare a string with a constant string "foo" without concern for how the string was encoded.

One exception to this comparison rule is for strings with encoding 'bytes'. You can use functions such as == or identical to compare the bytes of two strings with 'bytes' encoding, but these functions treat a string with encoding 'bytes' as unequal to (and less than) any string with a different encoding.

General warning: TIBCO Enterprise Runtime for R might not choose the same string encoding as open-source R in all cases. Indeed, open-source R does not use the same string encoding consistently between different platforms and locales. TIBCO Enterprise Runtime for R was designed to be more consistent between different platforms.

TIBCO Enterprise Runtime for R always uses the UTF-8 character set for the "unknown" encoding, whereas open-source R uses the native string encoding for the particular OS where it is running, perhaps because it relies on native OS routines that depend on that encoding.
In open-source R, the Encoding<- will not change a string's encoding in some cases. For example, if a string consists totally of simple ASCII characters, it seems to have its encoding stuck at "unknown". In TIBCO Enterprise Runtime for R, Encoding<- will always change the encoding.
In open-source R, comparing a string with 'bytes' encoding to a string with 'latin1' or 'UTF-8' encoding gives an error. Comparing xb (a 'bytes' string) with xu (a 'unknown' string) does not give an error, but it acts weird: in some cases, xb==xu, xb<xu, and xb>xu all return FALSE. TIBCO Enterprise Runtime for R is more consistant: any 'bytes' string is not equal and sorts as less than any non-'bytes' string.
TIBCO Enterprise Runtime for R supports automatically converting string encodings on input and output to the console. See getConsoleEncoding. Open-source R does not support this.
TIBCO Enterprise Runtime for R provides the function getNativeEncoding for reading the name of the "native" encoding. Open-source R does not support this.
It is possible to construct strings with the "UTF-8" encoding that are not valid UTF-8 byte sequences. In open-source R, attempting to manipulate these strings can give errors such as "invalid multibyte string". TIBCO Enterprise Runtime for R internally converts any string to a valid UTF-8 string when doing string manipulation, so this error should not occur in TIBCO Enterprise Runtime for R. Note that TIBCO Enterprise Runtime for R does not actually change the bytes in an invalid UTF-8 string (as viewed with charToRaw), but rather converts the invalid UTF-8 bytes to a valid UTF-8 sequence when needed (see getValidUtf8).

Encoding(x)	returns a character vector the same length as x, giving the string encoding of each element of x. This result does not include any of the attributes of x.
Encoding(x) <- value	returns value.

Description

Usage

Arguments

Details