iconv
Convert Character Vector between Encodings

Description

Converts the strings in a character vector from one string encoding to another.

Usage

iconv(x, from = "", to = "", sub = NA, mark = TRUE, toRaw = FALSE)
iconvlist()

Arguments

x a character vector. It can also be a list of raw vectors and NULL values (see the toRaw argument).
from a string specifying the string encoding used for reading the elements of x. If this is the empty string "" or the string "native.enc", the native encoding is used (see getNativeEncoding).
to a string specifying the string encoding that the elements of x should be converted to.
  • If to = "" (an empty string -- the default), or if to = "native.enc", then the native encoding is used (see getNativeEncoding).
  • If to = "ASCII//TRANSLIT", then iconv attempts to transliterate non-ASCII characters to the appropriate ASCII sequences.
  • If to = "ASCII//IGNORE", then iconv drops any non-ASCII characters.
  • If to = "ASCII//TRANSLIT//IGNORE", then iconv tries transliterating non-ASCII characters and drops any that cannot be converted.
sub a string or NA value controlling what happens if some of the characters cannot be converted into the "to" encoding.
  • If sub is NA, then any string containing characters that cannot be converted is converted to NA.
  • If sub is a string, then this string is substituted for any character that cannot be converted.
  • If sub is the string "Unicode", then any character that cannot be converted is turned into a string specifying its Unicode code point, such as "<U+30A4>".
mark a logical value.
  • If mark is TRUE (the default), then the converted strings are marked with encoding "latin1" or "UTF-8" if the to encoding is one of these encodings (or an alias for these encodings, as determined by getEncodingAliases).
  • If mark is FALSE, or if the to encoding is not one of these encodings, the output strings have encoding "unknown".
toRaw a logical value.
  • If toRaw is TRUE, then the converted strings are output as a list containing raw vectors (with the bytes from the converted strings) and NULL values (representing NA strings), and the mark argument is ignored.
  • If toRaw is FALSE, then the converted strings are output as a vector of strings.

Details

iconv converts a character vector from one string encoding to another. It does not use the encoding information associated with the elements of x, as set by Encoding. Instead, it interprets the raw bytes of the elements of x according to the from encoding, and converts these to the to encoding.
Note: iconv ignores upper and lower case differences in encoding names, as well as all characters other than letters and digits. Thus, even though the value returned by iconvlist includes "LATIN1", you could specify the same encoding with "latin1" or "Latin1" or "latin-1". The function isValidEncoding returns TRUE for any encoding string that is accepted by iconv.
Some conversions cause an error because the converted value has embedded zero bytes, which are not permitted in strings. For example, the string "abc" cannot be converted from "latin1" to "UTF-16" because in "latin1" each character in "abc" is represented by a single byte, but in "UTF-16" each character is represented by two bytes, one of which is zero. String encodings such as "UTF-16" are primarily useful when converting encodings from or to external files via functions such as read.table.
If the argument toRaw is TRUE, then the converted strings are output as a list of raw vectors (with the bytes from the converted strings) and NULL values (representing NA strings). If the argument x is a list of raw vectors and NULL values, it is interpreted in the same way. Using this alternative form for representing strings, it is possible to manipulate strings in "UTF-16" encoding with embedded zero bytes. Note that such a list does not have an associated encoding, like a string does.
Value
iconvreturns a character vector with the same attributes as x (dim, and so on), where all elements have been converted from the from string encoding to the to string encoding. If toRaw is TRUE, it returns a list of raw vectors and NULL values.
iconvlistreturns a character vector listing all of the encoding names accepted by iconv.
Differences between TIBCO Enterprise Runtime for R and Open-source R
See Also
Encoding, getNativeEncoding, getEncodingAliases, isValidEncoding, read.table.
Examples
x <- "a\u00C4b"
Encoding(x) # "UTF-8"
charToRaw(x) # prints [1] 61 c3 84 62
y <- iconv(x, from="UTF-8", to="latin1")
charToRaw(y) # prints [1] 61 c4 62
z <- iconv(x, from="UTF-8", to="CP437")
charToRaw(z) # prints [1] 61 8e 62

a <- iconv(x, from="latin1", to="UTF-8") charToRaw(a) # prints [1] 61 c3 83 c2 84 62 # even though x is UTF-8, iconv interprets each byte as latin1 # c3 -> c3 83 # 84 -> c2 84

iconv(x, from="UTF-8", to="ASCII") # NA - since 2nd character can't be converted to ASCII iconv(x, from="UTF-8", to="ASCII", sub="<?>") # "a<?>b" - sub substituted for unconvertable character iconv(x, from="UTF-8", to="ASCII//TRANSLIT") # "aAb" - 2nd char (A-with-2-dots) transliterated to "A" iconv(x, from="UTF-8", to="ASCII", sub="Unicode") # "a<U+00C4>b" - 2nd char converted to string with Unicode codepoint

Encoding(iconv(x, from="UTF-8", to="latin1")) # "latin1" Encoding(iconv(x, from="UTF-8", to="latin1", mark=FALSE)) # "unknown"

iconv("abc", from="latin1", to="UTF-16") # gives error: embedded nul in string iconv("abc", from="latin1", to="UTF-16", toRaw=TRUE) # [[1]] # [1] fe ff 00 61 00 62 00 63

Package base version 6.0.0-69
Package Index