iconv
Convert Character Vector between Encodings
Description
Converts the strings in a character vector from one string encoding to
another.
Usage
iconv(x, from = "", to = "", sub = NA, mark = TRUE, toRaw = FALSE)
iconvlist()
Arguments
x |
a character vector. It can also be a list of raw vectors and
NULL values (see the toRaw argument).
|
from |
a string specifying the string encoding used for reading the elements
of x. If this is the empty string "" or the string
"native.enc", the native encoding is used (see
getNativeEncoding).
|
to |
a string specifying the string encoding that the elements of
x should be converted to.
- If to = "" (an empty string -- the default), or if to = "native.enc", then
the native encoding is used (see getNativeEncoding).
- If to = "ASCII//TRANSLIT", then iconv
attempts to transliterate non-ASCII characters to the appropriate ASCII
sequences.
- If to = "ASCII//IGNORE", then iconv drops any non-ASCII characters.
- If to = "ASCII//TRANSLIT//IGNORE", then iconv tries transliterating non-ASCII
characters and drops any that cannot be converted.
|
sub |
a string or NA value controlling what happens if some of the
characters cannot be converted into the "to" encoding.
- If sub is NA, then any string containing characters that cannot be
converted is converted to NA.
- If sub is a string, then this string is
substituted for any character that cannot be converted.
- If sub is the string "Unicode", then any character that cannot be converted
is turned into a string specifying its Unicode code point, such as
"<U+30A4>".
|
mark |
a logical value.
- If mark is TRUE (the default), then the
converted strings are marked with encoding "latin1" or
"UTF-8" if the to encoding is one of these encodings
(or an alias for these encodings, as determined by
getEncodingAliases).
- If mark is FALSE,
or if the to encoding is not one of these encodings, the
output strings have encoding "unknown".
|
toRaw |
a logical value.
- If toRaw is TRUE, then the converted
strings are output as a list containing raw vectors (with the bytes
from the converted strings) and NULL values (representing NA
strings), and the mark argument is ignored.
- If toRaw is
FALSE, then the converted strings are output as a vector of strings.
|
Details
iconv converts a character vector from one string encoding to
another. It does not use the encoding information associated with the
elements of
x, as set by
Encoding. Instead, it
interprets the raw bytes of the elements of
x according to the
from encoding, and converts these to the
to encoding.
Note:
iconv ignores upper and lower case
differences in encoding names, as well as all characters other
than letters and digits. Thus, even though the value returned by
iconvlist includes
"LATIN1", you could specify the same
encoding with
"latin1" or
"Latin1" or
"latin-1".
The function
isValidEncoding returns
TRUE for
any encoding string that is accepted by
iconv.
Some conversions cause an error because the converted value
has embedded zero bytes, which are not permitted in strings. For
example, the string "abc" cannot be converted from "latin1" to
"UTF-16" because in "latin1" each character in "abc" is represented
by a single byte, but in "UTF-16" each character is represented by two
bytes, one of which is zero. String encodings such as "UTF-16" are
primarily useful when converting encodings from or to external files
via functions such as
read.table.
If the argument toRaw is TRUE, then the converted strings are
output as a list of raw vectors (with the bytes from the converted
strings) and NULL values (representing NA strings). If the
argument x is a list of raw vectors and NULL values, it
is interpreted in the same way. Using this alternative form for
representing strings, it is possible to manipulate strings in "UTF-16"
encoding with embedded zero bytes. Note that such a list does not have
an associated encoding, like a string does.
Value
iconv | returns a character vector with the same
attributes as x (dim, and so on), where all elements have
been converted from the from string encoding to the to
string encoding. If toRaw is TRUE, it returns a list
of raw vectors and NULL values. |
iconvlist | returns a character vector listing all of the
encoding names accepted by iconv. |
Differences between TIBCO Enterprise Runtime for R and Open-source R
- In open-source R, iconv does not accept the string "native.enc" to
specify the native encoding like other functions.
- Like TIBCO Enterprise Runtime for R, open-source R ignores upper/lower case differences in encoding names,
but open-source R does not ignore characters other than letters and digits. Thus,
open-source R accepts "latin1" but not "latin-1".
- In open-source R, the to argument must exactly match the strings
"latin1" or "UTF-8" for these string encodings
to be marked when mark=TRUE. In TIBCO Enterprise Runtime for R, we also detect aliases
of these encodings.
- In open-source R, if an unconvertable character is found, and sub is not an
NA, the string sub is substituted once for every byte
in the original string character. TIBCO Enterprise Runtime for R substitutes it only once
for the entire unconvertable character.
- In TIBCO Enterprise Runtime for R, the sub string can have only a limited number of
characters (10), or an error is generated.
See Also
Examples
x <- "a\u00C4b"
Encoding(x) # "UTF-8"
charToRaw(x) # prints [1] 61 c3 84 62
y <- iconv(x, from="UTF-8", to="latin1")
charToRaw(y) # prints [1] 61 c4 62
z <- iconv(x, from="UTF-8", to="CP437")
charToRaw(z) # prints [1] 61 8e 62
a <- iconv(x, from="latin1", to="UTF-8")
charToRaw(a) # prints [1] 61 c3 83 c2 84 62
# even though x is UTF-8, iconv interprets each byte as latin1
# c3 -> c3 83
# 84 -> c2 84
iconv(x, from="UTF-8", to="ASCII")
# NA - since 2nd character can't be converted to ASCII
iconv(x, from="UTF-8", to="ASCII", sub="<?>")
# "a<?>b" - sub substituted for unconvertable character
iconv(x, from="UTF-8", to="ASCII//TRANSLIT")
# "aAb" - 2nd char (A-with-2-dots) transliterated to "A"
iconv(x, from="UTF-8", to="ASCII", sub="Unicode")
# "a<U+00C4>b" - 2nd char converted to string with Unicode codepoint
Encoding(iconv(x, from="UTF-8", to="latin1"))
# "latin1"
Encoding(iconv(x, from="UTF-8", to="latin1", mark=FALSE))
# "unknown"
iconv("abc", from="latin1", to="UTF-16")
# gives error: embedded nul in string
iconv("abc", from="latin1", to="UTF-16", toRaw=TRUE)
# [[1]]
# [1] fe ff 00 61 00 62 00 63