Check if a Character Vector is Validly Encoded

validUTF8

Description

Verifies that a character vector is composed of validly encoded bytes.

Usage

validUTF8(x)
validEnc(x)

Arguments

x	a character vector.

Details

Not all byte sequences are valid UTF-8 byte sequences. For example, it is impossible to have a UTF-8 byte sequence consisting of a single byte greater than 0x7F, because UTF-8 reserves these bytes as part of multi-byte characters. In TIBCO Enterprise Runtime for R, it is possible to construct strings with the "UTF-8" encoding that are not valid UTF-8 byte sequences.

validUTF8 tests whether the elements of a string vector have valid UTF-8 byte sequences.
validEnc tests whether the elements of a string vector are valid according to their declared encoding. Any string with encoding "latin1" or "bytes" is valid, because these encodings allow any byte sequence. Strings with encoding "unknown" or "UTF-8" are valid only if they contain a valid UTF-8 byte sequence.

Value

validUTF8	returns a logical vector similar to the input with TRUE values for the strings whose bytes are valid UTF-8 byte sequences.
validEnc	returns a logical vector similar to the input with TRUE values for the strings whose bytes are valid according to their declared encoding.