Match Patterns in Strings

pattern	a character string specifying the pattern to search for. The interpretation of the pattern is controlled by the logical-valued arguments ignore.case, perl, fixed and useBytes.
text	a vector of character strings in which to search.
ignore.case	a logical value. If TRUE, uppercase and lowercase characters are considered equivalent when matching. The default is FALSE.
perl	a logical value. If FALSE (the default), the pattern is interpreted as a POSIX extended regular expression (handled by the TRE library, http://laurikari.net/tre/). If TRUE, the pattern is interpreted as a Perl-compatible regular expression (handled by the PCRE library, http://www.pcre.org).
fixed	a logical value. If TRUE, the pattern is not treated as a regular expression; rather, it is treaded as a literal sequence of characters. If both fixed=TRUE and ignore.case=TRUE, the value of ignore.case is ignored. If both fixed=TRUE and perl=TRUE, the value of perl is ignored and reset to FALSE, and the pattern is treated as a fixed sequence of characters.
useBytes	a logical value. If TRUE then the pattern or text strings are treated as a simple sequence of bytes. If this is FALSE and any of the pattern or text strings have 'bytes' encoding (see Encoding), then useBytes is set to TRUE.

Details

If fixed=FALSE, the pattern argument specifies a regular expression. Certain punctuation characters are interpreted specially, as described below. Other characters in the pattern match the same character in the text. Case is significant unless ignore.case is TRUE.

The following sections describe the POSIX standard extended regular expressions. The definition is recursive.

atom

Can be a single character other than one of the special characters '.[{()\*+?|^\$', in which case it matches itself.
Can be a period ('.'), which matches any character.
Can be '^' or '\$', which match the start or end (respectively) of the entire string.
Can be an escape sequence consisting of a backslash ('\'), followed by one or more characters. The supported escape sequences are described below.
As an example, the escape sequence '\n' matches a linefeed character. An undefined escape sequence just matches the character following the backslash, so '\\' matches the backslash character itself. Note that when typing a string containing a backslash, it must be doubled. Therefore, one would include the escape sequence '\n' within a string by typing "aaa\\nbbb".
Can also be a bracket expression (see below) or a (possibly empty) regular expression enclosed in parentheses, in which case it matches what the bracket expression or regular expression matches.

bracket expression

A list of characters or character ranges (two characters separated by a hyphen, '-') enclosed in square brackets. It matches any character in the list. If the list starts with a circumflex, '^', then it matches any character not in the remainder of the list. To include a '-' in a bracket list, make it the last entry.

Can contain a 'character class' of the form '[:name:]' where name specifies the set of characters that match it. The name can be one of the following:

'[:alnum:]'	(alphabetic or numeric digit)
'[:alpha:]'	(alphabetic)
'[:blank:]'	(any whitespace except for line separators)
'[:cntrl:]'	(control characters)
'[:digit:]'	(numeric digit)
'[:graph:]'	(graphical)
'[:lower:]'	(lower-case alphabetic)
'[:print:]'	(printable)
'[:punct:]'	(punctuation)
'[:space:]'	(any whitespace)
'[:upper:]'	(upper-case alphabetic)
'[:xdigit:]'	(hexadecimal digit)

piece of a regular expression

An atom, possibly followed by a repeat quantifier:
- an asterisk ('*', 0 or more repeats).
- a plus sign ('+', 1 or more repeats).
- a question mark, ('?', 0 or 1 repeats).
- a bound, '{min,max}' or '{count}' or '{min,}'.
  The bound {min,max} means between min and max repeats. If max is missing it is taken to be infinity. If there is no comma, then it matches exactly the given count of repeats. For example, '+' is equivalent to '{1,}', '*' is '{0,}', and '?' is '{0,1}'.

branch

A sequence of pieces, concatenated, and a 'regular expression' is a sequence of branches separated by vertical bars, '|'. The regular expression matches if any branch in it matches.

Back References

Any atom enclosed in parentheses 'remembers' the characters that it matched, and these characters can be matched again using a 'back reference', an escape sequence of the form '\1', '\2', and so on. The digit specifies the parentheses in the pattern, counting from the beginning. Thus, the pattern '(a+)b\1' matches the entire string 'aaabaaa', because '(a+)' matches three 'a' characters before the 'b', and '\1' matches these three 'a' characters after the 'b'.

Escape Sequences Matching a Single Character

The following escape sequences match a single character:

'\a'	(bell)
'\e'	(escape)
'\f'	(form feed)
'\n'	(line feed)
'\r'	(carriage return)
'\t'	(tab)
'\v'	(vertical tab)

The following escape sequences match a specified character with a given code point:

'\xdd'	(matches the character with hexidecimal code point 0xdd)
'\x{dddd}'	(matches the character with hexidecimal code point 0xdddd)

Escape Sequences Matching a Character Class

The following escape sequences are abbreviations for certain character classes:

'\d'	(equivalent to [[:digit:]])
'\s'	(equivalent to [[:space:]])
'\w'	(equivalent to [[:alnum:]_], thus common word characters)
'\D'	(equivalent to [^[:digit:]], thus everything but a digit)
'\S'	(equivalent to [^[:space:]])
'\W'	(equivalent to [^[:alnum:]_])

Word Boundaries

The start and end of a word can be matched by the escape sequences '\<' and '\>', where a 'word' is a sequence of 1 or more alphanumerics and underscores.

'\b' matches any word boundary (either the start or the end), and '\B' matches anywhere except at a word boundary.

Quoting Escape

The escape sequence '\Q' specifies the beginning of a sequence of characters to be 'quoted'. The characters following it until the end of the string, or the escape sequence '\E', are taken literally. Thus, the string '\Q[^a]\E' will match the string 'abc[^a]def'

Perl Compatible Regular Expressions

The Perl language supports an extended version of regular expressions, accepting many forms in addition to the ones described above. For more details of Perl compatible regular expression, please visit http://perldoc.perl.org/perlre.html.

Unmatched Right Parentheses

When fixed=FALSE, parentheses are considered to be part of the pattern language and must be preceded by a (doubled) backslash to be taken literally.

Unmatched (and unescaped) parentheses usually result in an error.

An exception: If perl=FALSE and fixed=FALSE, an unmatched right parenthesis will be matched literally. Thus, regexpr('b)', 'ab)c', perl=FALSE) will match. regexpr('b)', 'ab)c', perl=TRUE) generates an error.

regexp

returns a numeric vector, with one element for each element of text, giving the position in the character string of the first substring matching the regular expression. Minus ones (-1) incidate that no match was found.

An attribute, "match.length", is a numeric vector giving the length of the longest possible matching substring starting at that position, or minus one (-1) for no match.

Note that a "match.length" value can be zero when matching a regular expression such as "^".

gregexpr

produces all of the matches for the regular expression in each string, rather than only the first one.

It returns a list with one entry for each element of text. Each entry has the format of the output of regexpr, a numeric vector with the starting positions of each match within the string, with an attribute, "match.length", giving the length of each match.

If there are no matches in a string, the output entry is the value minus one (-1) with a "match.length" value of minus one (-1).

When you use a Perl regular expression containing parenthesized "capture groups", either unnamed, such as the "([0-9]+)" in "([0-9]+) *dollars", or named, such as the "(?<amount>[0-9]+)" in "(?<amount>[0-9]+) *dollars", then the following attributes giving information about each matched capture group are added to the output.

	Attribute	Description
	capture.start	an integer matrice with a column for each match group and a row for each match.
	capture.length	an integer matrice with a column for each match group and a row for each match.
	capture.names

Description

Usage

Arguments

Details