Encoding
String Encodings of a Character Vector
Description
Reads or sets the string encodings for the elements of a character vector.
Usage
Encoding(x)
Encoding(x) <- value
Arguments
x |
a character vector.
|
value |
a character vector specifying the new string encodings
to be assigned to the elements of the mode.
Each encoding is specified as one of the following four strings:
- "unknown"
- "latin1"
- "UTF-8"
- "bytes"
Any other string is treated the same as "unknown".
- If value has length zero, an error is generated.
- If value is shorter than x, it is repeated as necessary.
- If value is longer than x, its extra elements are ignored.
|
Details
In memory, a string is just a sequence of bytes.
A string encoding specifies how these bytes are interpreted as characters.
Some encodings (such as "latin1") interpret each byte as a single character,
whereas other encodings (such as "UTF-8") use multiple bytes to specify some
characters (therefore they can represent many more characters).
Each element of a character vector has one of the four encodings, described below.
Different elements in the same character vector can have different encodings.
The function
Encoding retrieves the encodings for the elements
of a character vector.
These encodings are interpreted as follows:
| "unknown" | This is the default encoding used for most
strings. In TIBCO Enterprise Runtime for R, the 'unknown' encoding is defined to use the
"UTF-8" character set, whereas open-source R uses the "native" encoding
for this computer (see getNativeEncoding).
This encoding is also used for representing strings with any arbitrary
encoding, which can be created by iconv. |
| "latin1" | Strings encoded with the "latin1" (or "ISO-8859-1")
character set, where each character is represented with one byte. |
| "UTF-8" | Strings encoded with the "UTF-8" character set, where
each character is represented with a sequence of from one to four
bytes. This encoding can represent a wide range of Unicode characters. |
| "bytes" | Strings to be considered as a sequence of raw bytes,
without any special encoding. Some string functions will cause an
error when given a 'bytes' string, such as
nchar(some.bytes.string, type='chars'). Some functions such as
grep with a useBytes argument will act as if
useBytes=TRUE when processing a string with 'bytes' encoding.
String with 'bytes' encoding are printed a little differently:
non-alphabetic bytes are printed as \xXX escape sequences. |
|
The function
Encoding<- sets the encodings for the string elements,
without changing the bytes used to represent these strings
(which can be examined with
charToRaw).
Therefore, it is possible to make a string look and act weird
by changing its encoding incorrectly.
Use
iconv to convert the bytes from one
string encoding to another.
Encoding<- changes no attributes of
x,
so it can be used to change the encodings of elements of a matrix.
Value
Encoding(x) | returns a character vector the same length as x,
giving the string encoding of each element of x.
This result does not include any of the attributes of x. |
Encoding(x) <- value | returns value. |
String Comparison
The functions that compare strings, such as
- ==
- identical
- all.equal
- match
- duplicated
- sort
- and so on
compare the
characters of the strings, rather
than the
bytes. If two strings do not have the same encoding,
they are converted to sequences of Unicode characters, and then these
sequences are compared.
This behavior might be counter-intuitive: Even though two strings might be
different according to
Encoding and
charToRaw,
the function
identical might say that they are equal. One result
of this is that the programmer does not need to worry about the
different string encoding in most cases. That is, code can compare a
string with a constant string "foo" without concern for how the string was
encoded.
One exception to this comparison rule is for strings with encoding
'bytes'. You can use functions such as == or identical
to compare the bytes of two strings with 'bytes' encoding, but these
functions treat a string with encoding 'bytes' as unequal to (and less
than) any string with a different encoding.
Differences between TIBCO Enterprise Runtime for R and Open-source R
General warning: TIBCO Enterprise Runtime for R might not
choose the same string encoding as open-source R in all cases. Indeed,
open-source R does not use the same string encoding consistently between
different platforms and locales. TIBCO Enterprise Runtime for R was
designed to be more consistent between different platforms.
- TIBCO Enterprise Runtime for R always uses the UTF-8 character set for the "unknown"
encoding, whereas open-source R uses the native string encoding for
the particular OS where it is running, perhaps because it relies on
native OS routines that depend on that encoding.
- In open-source R, the Encoding<- will not change a string's encoding in
some cases. For example, if a string consists totally of simple ASCII
characters, it seems to have its encoding stuck at "unknown". In
TIBCO Enterprise Runtime for R, Encoding<- will always change the encoding.
- In open-source R, comparing a string with 'bytes' encoding to a string with
'latin1' or 'UTF-8' encoding gives an error. Comparing xb (a
'bytes' string) with xu (a 'unknown' string) does not give an
error, but it acts weird: in some cases, xb==xu, xb<xu,
and xb>xu all return FALSE. TIBCO Enterprise Runtime for R is more consistant: any
'bytes' string is not equal and sorts as less than any non-'bytes'
string.
- TIBCO Enterprise Runtime for R supports automatically converting string encodings on input and
output to the console. See getConsoleEncoding. Open-source R does not support this.
- TIBCO Enterprise Runtime for R provides the function getNativeEncoding for
reading the name of the "native" encoding. Open-source R does not support this.
- It is possible to construct strings with the "UTF-8" encoding that are
not valid UTF-8 byte sequences. In open-source R, attempting to manipulate these
strings can give errors such as "invalid multibyte string". TIBCO Enterprise Runtime for R
internally converts any string to a valid UTF-8 string when doing
string manipulation, so this error should not occur in TIBCO Enterprise Runtime for R. Note
that TIBCO Enterprise Runtime for R does not actually change the bytes in an invalid UTF-8
string (as viewed with charToRaw), but rather converts
the invalid UTF-8 bytes to a valid UTF-8 sequence when needed (see
getValidUtf8).
See Also
Examples
# create a latin1-encoded string
x <- rawToChar(as.raw(c(0x61,0xC4,0x61)))
Encoding(x) <- 'latin1'
# create a UTF-8-encoded string
y <- rawToChar(as.raw(c(0x61,0xC3,0x84,0x61)))
Encoding(y) <- 'UTF-8'
# read encodings
Encoding(c(x,y))
# both strings are considered equal, though they have different bytes
x == y