Match Patterns in Strings

pattern	a character string specifying the pattern to search for. The interpretation of the pattern is controlled by the logical-valued arguments ignore.case, perl, fixed and useBytes.
text	a vector of character strings in which to search.
ignore.case	a logical value. If TRUE, uppercase and lowercase characters are considered equivalent when matching. The default is FALSE.
perl	a logical value. If TRUE, the pattern is interpreted as a Perl-compatible regular expression. If FALSE (the default), the pattern is interpreted as a POSIX extended regular expression.
fixed	a logical value. If TRUE, the pattern is not treated as a regular expression; rather, it is treaded as a literal sequence of characters. If both fixed=TRUE and ignore.case=TRUE, the value of ignore.case is ignored. If both fixed=TRUE and perl=TRUE, the value of perl is ignored and reset to FALSE, and the pattern is treated as a fixed sequence of characters.
useBytes	a logical value. If TRUE then the pattern or text strings are treated as a simple sequence of bytes. If this is FALSE and any of the pattern or text strings have 'bytes' encoding (see Encoding), then useBytes is set to TRUE and a warning is generated.

Details

Back References

Any atom enclosed in parentheses 'remembers' the characters that it matched, and these characters can be matched again using a 'back reference', an escape sequence of the form '\1', '\2', etc. The digit specifies the parentheses in the pattern, counting from the beginning. Thus, the pattern '(a+)b\1' will match the entire string 'aaabaaa', since '(a+)' matches three 'a' characters before the 'b', and '\1' matches these three 'a' characters after the 'b'.

Collating Elements

A bracket expression may contain a collating element of the form '[.x.]', where 'x' may be a single character. This can be used to add characters to a bracket expression that are not normally permitted, such as '[[.^.]abc]' to match any of the characters 'a', 'b', 'c', or '^'.

A collating element may also contain a symbolic name for a character, such as '[.newline.]' or '[.ESC.]'.

Equivalence Classes

A bracket expression may contain a equivalence class of the form '[=x=]', where 'x' is a character or symbolic name for a character, like a collating element. This form matches any characters that match the specified character, ignoring case and accents. Thus, '[[=a=]]' will match 'a' or 'A' or either of these with accents, tildes, etc.

This will not work the same under different locales: in a locale where 'a' is considered to be a distinct letter from 'a-with-two-dots', '[[=a=]]' will not match the second one.

Escape Sequences Matching a Single Character

The following escape sequences match a single character:

'\a'	(bell)
'\e'	(escape)
'\f'	(form feed)
'\n'	(line feed)
'\r'	(carriage return)
'\t'	(tab)
'\v'	(vertical tab)

The following escape sequences match a specified character with a given code point:

'\cX'	(control-X, where X is any character specifies the corresponding control character with code point 1-31),
'\xdd'	(matches the character with hexidecimal code point 0xdd)
'\x'	(matches the character with hexidecimal code point 0xdddd)
'\0ddd'	(matches the character with octal code point 0ddd)
'\N'	(matches the character with the specified symbolic name, such as 'newline')

Escape Sequences Matching a Character Class

The following escape sequences are abbreviations for certain character classes:

'\d'	(equivalent to [[:digit:]])
'\l'	(equivalent to [[:lower:]])
'\s'	(equivalent to [[:space:]])
'\u'	(equivalent to [[:upper:]])
'\w'	(equivalent to [[:word:]])
'\D'	(equivalent to [^[:digit:]], thus everything but a digit)
'\L'	(equivalent to [^[:lower:]])
'\S'	(equivalent to [^[:space:]])
'\U'	(equivalent to [^[:upper:]])
'\W'	(equivalent to [^[:word:]])

The following escape sequences specify a character class by name:

'\pX'	(matches the character class with single-character name X, so '\pd' is the same as '[[:d:]]').
'\p'	(matches the character class with name XXX, so '\p' is the same as '[[:digit:]]').
'\PX'	(matches all characters except the character class with single-character name X, so '\Pd' is the same as '[^[:d:]]').
'\P'	(matches all characters the character class with name XXX, so '\P' is the same as '[^[:digit:]]').

Word Boundaries

The start and end of a word are matched by the special patterns '[[:<:]]' and '[[:>:]]', respectively, where a 'word' is a sequence of 1 or more alphanumerics and underscores. The start and end of a word can also be matched by the escape sequences '\<' and '\>'.

'\b' matches any word boundary (either the start or the end), and '\B' matches anywhere except at a word boundary.

Buffer Boundaries

The following escape sequences match the beginning or end of the entire string.

'\`' or '\A'	matches the start of the buffer.
'\'' or '\z'	matched the end of the buffer.
'\Z'	matches zero or more newline characters at the end of the buffer.

Quoting Escape

The escape sequence '\Q' specifies the beginning of a sequence of characters to be 'quoted'. The characters following it until the end of the string, or the escape sequence '\E', are taken literally. Thus, the string '\Q[^a]\E' will match the string 'abc[^a]def'

Perl compatible Regular Expressions

The Perl language supports an extended version of regular expressions, accepting many forms in addition to the ones described above. For more details of Perl compatible regular expression, please visit http://perldoc.perl.org/perlre.html.

Unmatched Right Parentheses

When fixed=FALSE, parentheses are considered to be part of the pattern language and must be preceded by a (doubled) backslash to be taken literally.

Unmatched (and unescaped) parentheses usually result in an error.

An exception: If perl=FALSE and fixed=FALSE, an unmatched right parenthesis will be matched literally. Thus, regexpr('b)', 'ab)c', perl=FALSE) will match. regexpr('b)', 'ab)c', perl=TRUE) generates an error.

NOTE Over time, we may modify TIBCO Enterprise Runtime for R so it is closer to open-source R's behavior, and remove items from this list. In many cases, the TIBCO Enterprise Runtime for R matcher handles POSIX-standard forms that open-source R does not handle. We might want to detect these cases and generate an error.

In open-source R, the algorithms for POSIX and Perl-style regular expression are different from TIBCO Enterprise Runtime for R. When perl=FALSE, the TRE library(http://laurikari.net/tre/) is used for POSIX regular expression matching. When perl = TRUE, the PCRE library(http://www.pcre.org) is used for Perl compatible regular expression matching.
In open-source R, '.' does not match '\n' when perl=TRUE, but it does in TIBCO Enterprise Runtime for R. Thus, regexpr('b.', 'ab\nc', perl=TRUE) does not match in open-source R, but does in TIBCO Enterprise Runtime for R.
In TIBCO Enterprise Runtime for R, the '[:print:]' character class matches several characters such as '\n' that open-source R does not. Thus, regexpr('[[:print:]]', '\b\n\b') does not match in open-source R, but does in TIBCO Enterprise Runtime for R.
open-source R does not recognize the character classes '[:d:]', '[:l:]', '[:s:]', '[:unicode:]', '[:u:]', '[:word:]' or '[:w:]', and will generate an error if they are given in a pattern. Thus, regexpr('[[:u::]]', 'aBc') matches in TIBCO Enterprise Runtime for R, but gives an error in open-source R.
open-source R accepts a repeat qualifier of the form '{,max}', where the minimum value is taken to be zero. TIBCO Enterprise Runtime for R generates an error if this form is used. Actually, open-source R's implementation of this is a little strange. If perl=TRUE, the form is accepted, but the expression never seems to match. If perl=FALSE, the form is accepted, but is interpreted incorrectly: 'ab{,2}' acts like 'ab{0,3}', rather than 'ab{0,2}'. Thus regexpr('ab{,2}', 'abc', perl=TRUE) does not match in open-source R, and regexpr('ab{,2}', 'abbbc', perl=FALSE) matches the first four characters in the string. Either one gives an error in TIBCO Enterprise Runtime for R.
In open-source R with perl=TRUE, the character class names ('[:alnum:]', '[:alpha:]', etc.) do not match Unicode characters beyond the latin1 character set. Thus, in open-source R regexpr('[[:alpha:]]', '1\u30A42', perl=TRUE) does not match, but regexpr('[[:alpha:]]', '1\u30A42', perl=FALSE) does. In TIBCO Enterprise Runtime for R, both of these will match.
In general, there are many small differences between open-source R and TIBCO Enterprise Runtime for R in exactly how the character class names ('[:alnum:]', '[:alpha:]', etc.) classify particular Unicode characters beyond the latin1 character set.
There are also differences between open-source R on Linux and Windows on how the character class names ('[:alnum:]', '[:alpha:]', etc.) are interpreted. When comparing TIBCO Enterprise Runtime for R to open-source R, we usually compare it to open-source R running on Linux.
open-source R with pel=FALSE accepts and ignores the construct '(*)', whereas TIBCO Enterprise Runtime for R generates an error. Thus, regexpr('a(*)', 'abc', perl=FALSE) matches in open-source R, and produces an error in TIBCO Enterprise Runtime for R. regexpr('a(*)', 'abc', perl=TRUE) produces an error in both open-source R and TIBCO Enterprise Runtime for R.
open-source R does not recognize collating elements. Therefore, regexpr('[[.^.]]', 'xxx^yyy') matches in TIBCO Enterprise Runtime for R, and generates an error in open-source R.
open-source R does not recognize equivalence classes. Therefore, regexpr('[[=a=]]', 'xxx\u00C4yyy') matches in TIBCO Enterprise Runtime for R, and generates an error in open-source R.
open-source R does not recognize the following escape sequences: '\v', '\cX', '\0ddd', '\Nname'.
open-source R does not recognize the following abbreviations for character classes: '\l', '\u', '\L', '\U'. Open-source R also does not recognize the forms '\pX', '\pXXX', '\PX', '\PXXX'.
open-source R does not accept the patterns '[[:<:]]' and '[[:>:]]' to match the start and end of a word. Thus, regexpr('[[:<:]]a', 'bab aba') gives an error in open-source R, but matches in TIBCO Enterprise Runtime for R.
open-source R does not accept '\<' and '\>' to match the start and end of a word when perl=TRUE. Thus, in open-source R regexpr('\<a', 'bab aba', perl=FALSE) matches, but regexpr('\<a', 'bab aba', perl=TRUE) does not match. Both of these produce a match in TIBCO Enterprise Runtime for R. strsplit open-source R does not accept '\`' or '\`' to match the beginning or end of the buffer. open-source R only accepts '\A' or '\z' when perl=TRUE. open-source R only accepts '\Z' when perl=TRUE, but interprets it the same as '\z'.

'[:alnum:]'	(alphabetic or numeric digit)
'[:alpha:]'	(alphabetic)
'[:blank:]'	(any whitespace except for line separators)
'[:cntrl:]'	(control characters)
'[:digit:]' or '[:d:]'	(numeric digit)
'[:graph:]'	(graphical)
'[:lower:]' or '[:l:]'	(lower-case alphabetic)
'[:print:]'	(printable)
'[:punct:]'	(punctuation)
'[:space:]' or '[:s:]'	(any whitespace)
'[:unicode:]'	(any Unicode character about 0xFF)
'[:upper:]' or '[:u:]'	(upper-case alphabetic)
'[:word:]' or '[:w:]'	(any alphabetic or numeric digit, or underscore)
'[:xdigit:]'	(hexadecimal digit)

Description

Usage

Arguments

Details