validUTF8
Check if a Character Vector is Validly Encoded
Description
Verifies that a character vector is composed of validly encoded bytes.
Usage
validUTF8(x)
validEnc(x)
Arguments
Details
Not all byte sequences are valid UTF-8 byte sequences. For example,
it is impossible to have a UTF-8 byte sequence consisting of a single
byte greater than 0x7F, because UTF-8 reserves these bytes as part of
multi-byte characters. In TIBCO Enterprise Runtime for R, it is possible to construct strings
with the "UTF-8" encoding that are not valid UTF-8 byte sequences.
- validUTF8 tests whether the elements of a string vector have
valid UTF-8 byte sequences.
- validEnc tests whether the elements of a string vector are
valid according to their declared encoding. Any string with encoding
"latin1" or "bytes" is valid, because these encodings allow any byte
sequence. Strings with encoding "unknown" or "UTF-8" are valid only
if they contain a valid UTF-8 byte sequence.
Value
validUTF8 | returns a logical vector similar to the input
with TRUE values for the strings whose bytes are valid UTF-8
byte sequences. |
validEnc | returns a logical vector similar to the input
with TRUE values for the strings whose bytes are valid
according to their declared encoding. |
See Also
Examples
x <- c("aa", "aa\30A4", "\xFF")
Encoding(x) <- "UTF-8"
validUTF8(x) ## [1] TRUE TRUE FALSE
validEnc(x) ## [1] TRUE TRUE FALSE
Encoding(x) <- "bytes"
validUTF8(x) ## [1] TRUE TRUE FALSE
validEnc(x) ## [1] TRUE TRUE TRUE