regexpr
Match Patterns in Strings

Description

Searches for pattern matching of a regular expression in character strings.

Usage

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, 
     fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, 
     fixed = FALSE, useBytes = FALSE)

Arguments

pattern a character string specifying the pattern to search for. The interpretation of the pattern is controlled by the logical-valued arguments ignore.case, perl, fixed and useBytes.
text a vector of character strings in which to search.
ignore.case a logical value. If TRUE, uppercase and lowercase characters are considered equivalent when matching. The default is FALSE.
perl a logical value. If TRUE, the pattern is interpreted as a Perl-compatible regular expression. If FALSE (the default), the pattern is interpreted as a POSIX extended regular expression.
fixed a logical value. If TRUE, the pattern is not treated as a regular expression; rather, it is treaded as a literal sequence of characters.
  • If both fixed=TRUE and ignore.case=TRUE, the value of ignore.case is ignored.
  • If both fixed=TRUE and perl=TRUE, the value of perl is ignored and reset to FALSE, and the pattern is treated as a fixed sequence of characters.
useBytes a logical value. If TRUE then the pattern or text strings are treated as a simple sequence of bytes. If this is FALSE and any of the pattern or text strings have 'bytes' encoding (see Encoding), then useBytes is set to TRUE and a warning is generated.

Details

If fixed=FALSE, the pattern argument specifies a regular expression. Certain punctuation characters are interpreted specially, as described below. Other characters in the pattern match the same character in the text. Case is significant unless ignore.case is TRUE.
Here we describe the POSIX standard extended regular expressions. The definition is recursive.
An 'atom' may be a single character other than one of the special characters '.[{()\*+?|^\$', in which case it matches itself.
An atom may be a period, '.' which matches any character.
An atom may be '^' or '\$' which match the start or end (respectively) of the entire string.
An atom may be an escape sequence consisting of a backslash '\' followed by one or more characters. The supported escape sequences are described below. As an example, the escape sequence '\n' matches a linefeed character. An undefined escape sequence just matches the character following the backslash, so '\\' matches the backslash character itself. Note that when typing an S string containing a backslash, it must be doubled. Therefore, one would include the escape sequence '\n' within a string by typing "aaa\\nbbb".
An atom may also be a bracket expression (see below) or a (possibly empty) regular expression enclosed in parentheses, in which case it matches what the bracket expression or regular expression matches.
A 'bracket expression' is a list of characters or character ranges (2 characters separated by a hyphen, '-') enclosed in square brackets. It matches any character in the list. If the list starts with a circumflex, '^', then it matches any character not in the remainder of the list. To include a '-' in a bracket list, make it the last entry.
A bracket expression may also contain a 'character class' of the form '[:name:]' where name specifies the set of characters that match it. The name can be one of the following:
'[:alnum:]' (alphabetic or numeric digit)
'[:alpha:]' (alphabetic)
'[:blank:]' (any whitespace except for line separators)
'[:cntrl:]' (control characters)
'[:digit:]' or '[:d:]' (numeric digit)
'[:graph:]' (graphical)
'[:lower:]' or '[:l:]' (lower-case alphabetic)
'[:print:]' (printable)
'[:punct:]' (punctuation)
'[:space:]' or '[:s:]' (any whitespace)
'[:unicode:]' (any Unicode character about 0xFF)
'[:upper:]' or '[:u:]' (upper-case alphabetic)
'[:word:]' or '[:w:]' (any alphabetic or numeric digit, or underscore)
'[:xdigit:]' (hexadecimal digit)
A 'piece' of a regular expression is an atom, possibly following by a repeat quantifier: an asterisk ('*', 0 or more repeats), a plus sign ('+', 1 or more repeats), a question mark, ('?', 0 or 1 repeats), or a bound, '{min,max}' or '{count}' or '{min,}'. The bound {min,max} means between min and max repeats. If max is missing it is taken to be infinity. If there is no comma then it matches exactly the given count of repeats. E.g., '+' is equivalent to '{1,}', '*' is '{0,}', and '?' is '{0,1}'.
Finally, a 'branch' is a sequence of pieces, concatenated, and a 'regular expression' is a sequence of branches separated by vertical bars, '|'. The regular expression matches if any branch in it matches.
Back References
Any atom enclosed in parentheses 'remembers' the characters that it matched, and these characters can be matched again using a 'back reference', an escape sequence of the form '\1', '\2', etc. The digit specifies the parentheses in the pattern, counting from the beginning. Thus, the pattern '(a+)b\1' will match the entire string 'aaabaaa', since '(a+)' matches three 'a' characters before the 'b', and '\1' matches these three 'a' characters after the 'b'.
Collating Elements
A bracket expression may contain a collating element of the form '[.x.]', where 'x' may be a single character. This can be used to add characters to a bracket expression that are not normally permitted, such as '[[.^.]abc]' to match any of the characters 'a', 'b', 'c', or '^'.
A collating element may also contain a symbolic name for a character, such as '[.newline.]' or '[.ESC.]'.
Equivalence Classes
A bracket expression may contain a equivalence class of the form '[=x=]', where 'x' is a character or symbolic name for a character, like a collating element. This form matches any characters that match the specified character, ignoring case and accents. Thus, '[[=a=]]' will match 'a' or 'A' or either of these with accents, tildes, etc.
This will not work the same under different locales: in a locale where 'a' is considered to be a distinct letter from 'a-with-two-dots', '[[=a=]]' will not match the second one.
Escape Sequences Matching a Single Character
The following escape sequences match a single character:
'\a' (bell)
'\e' (escape)
'\f' (form feed)
'\n' (line feed)
'\r' (carriage return)
'\t' (tab)
'\v' (vertical tab)
The following escape sequences match a specified character with a given code point:
'\cX' (control-X, where X is any character specifies the corresponding control character with code point 1-31),
'\xdd' (matches the character with hexidecimal code point 0xdd)
'\x' (matches the character with hexidecimal code point 0xdddd)
'\0ddd' (matches the character with octal code point 0ddd)
'\N' (matches the character with the specified symbolic name, such as 'newline')
Escape Sequences Matching a Character Class
The following escape sequences are abbreviations for certain character classes:
'\d' (equivalent to [[:digit:]])
'\l' (equivalent to [[:lower:]])
'\s' (equivalent to [[:space:]])
'\u' (equivalent to [[:upper:]])
'\w' (equivalent to [[:word:]])
'\D' (equivalent to [^[:digit:]], thus everything but a digit)
'\L' (equivalent to [^[:lower:]])
'\S' (equivalent to [^[:space:]])
'\U' (equivalent to [^[:upper:]])
'\W' (equivalent to [^[:word:]])
The following escape sequences specify a character class by name:
'\pX' (matches the character class with single-character name X, so '\pd' is the same as '[[:d:]]').
'\p' (matches the character class with name XXX, so '\p' is the same as '[[:digit:]]').
'\PX' (matches all characters except the character class with single-character name X, so '\Pd' is the same as '[^[:d:]]').
'\P' (matches all characters the character class with name XXX, so '\P' is the same as '[^[:digit:]]').
Word Boundaries
The start and end of a word are matched by the special patterns '[[:<:]]' and '[[:>:]]', respectively, where a 'word' is a sequence of 1 or more alphanumerics and underscores. The start and end of a word can also be matched by the escape sequences '\<' and '\>'.
'\b' matches any word boundary (either the start or the end), and '\B' matches anywhere except at a word boundary.
Buffer Boundaries
The following escape sequences match the beginning or end of the entire string.
'\`' or '\A' matches the start of the buffer.
'\'' or '\z' matched the end of the buffer.
'\Z' matches zero or more newline characters at the end of the buffer.
Quoting Escape
The escape sequence '\Q' specifies the beginning of a sequence of characters to be 'quoted'. The characters following it until the end of the string, or the escape sequence '\E', are taken literally. Thus, the string '\Q[^a]\E' will match the string 'abc[^a]def'
Perl compatible Regular Expressions
The Perl language supports an extended version of regular expressions, accepting many forms in addition to the ones described above. For more details of Perl compatible regular expression, please visit http://perldoc.perl.org/perlre.html.
Unmatched Right Parentheses
When fixed=FALSE, parentheses are considered to be part of the pattern language and must be preceded by a (doubled) backslash to be taken literally.
Unmatched (and unescaped) parentheses usually result in an error.
An exception: If perl=FALSE and fixed=FALSE, an unmatched right parenthesis will be matched literally. Thus, regexpr('b)', 'ab)c', perl=FALSE) will match. regexpr('b)', 'ab)c', perl=TRUE) generates an error.
Value
returns a numeric vector with one element for each element of text, giving the position in the character string of the first substring matching the regular expression. Minus ones indicate no match was found. An attribute, "match.length", is a numeric vector giving the length of the longest possible matching substring starting at that position, or minus one for no match. Note that a "match.length" value can be zero when matching a regular expression such as "^".
gregexpr produces all of the matches for the regular expression in each string, rather than only the first one. It returns a list with one entry for each element of text. Each entry has the format of the output of regexpr, a numeric vector with the starting positions of each match within the string, with an attribute, "match.length", giving the length of each match. If there are no matches in a string, the output entry is the value minus one with a "match.length" value of minus one.
When using a Perl regular expression containing parenthesized "capture groups", either unnamed, such as the "([0-9]+)" in "([0-9]+) *dollars", or named, such as the "(?<amount>[0-9]+)" in "(?<amount>[0-9]+) *dollars", then attributes giving information about each matched capture group are added to the output. These attributes are called match.start, match.length, and match.names. The first two are integer matrices with a column for each match group and a row for each match.
Differences between TIBCO Enterprise Runtime for R and Open-source R
NOTE Over time, we may modify TIBCO Enterprise Runtime for R so it is closer to open-source R's behavior, and remove items from this list. In many cases, the TIBCO Enterprise Runtime for R matcher handles POSIX-standard forms that open-source R does not handle. We might want to detect these cases and generate an error.
See Also
strsplit, gsub, grep, substring, Encoding
Examples
x <- c("10 Sept", "Oct 9th", "Jan 2", "4th of July")
# Find the numbers in the above strings:
w <- regexpr("[0-9]+", x)
w

# Extract the numbers: as.numeric(substring(x, w, w+attr(w, "match.length")-1))

# Extract the capitalized words w1 <- regexpr("[A-Z][a-z]*", x) substring(x, w1, w1+attr(w1, "match.length")-1) # Do the same with substituteString. Note that \\n in # the replacement string refers to the n'th parenthesized # subexpression in the pattern. sub("(.*)([A-Z][a-z]*)(.*)", "\\2", x)

# get the integer part of numbers s <- c("-14.0e-05", ".002", "1,700", "+1999.999", "$34.50") r <- regexpr("^ *[-+$]?([0-9,]+)", s) substring(s, r, r + attr(r, "match.length") - 1) regmatches(s, r) # like above substring, but omits non-matched strings

# find the ATAT... sequences in two strings gregexpr("(AT){2,}", c("GATATATCATCATATC", "ATATG"))

# perl capture groups and the perl (?:...) non-capturing group # "\u20AC" is the unicode euro currency symbol txt1 <- c("5 for $3.75", "\u20AC27 OBO") m1 <- regexpr( "(?<currency>\\$|\u20AC)(?<amount>(?<units>\\d+)(?:\\.(?<cents>\\d\\d))?)", txt1, perl=TRUE) # show the currency symbols regmatches(txt1, with(attributes(m1), structure(capture.start[,"currency"], match.length=capture.length[,"currency"]))) # show the currency amounts regmatches(txt1, with(attributes(m1), structure(capture.start[,"amount"], match.length=capture.length[,"amount"])))

txt2 <- c("$3.75 in US$, \u20AC3 in euro", "$30 for 10") m2 <- gregexpr( "(?<currency>\\$|\u20AC)(?<amount>(?<units>\\d+)(?:\\.(?<cents>\\d\\d))?)", txt2, perl=TRUE) regmatches(txt2, m2) # show entire matches

m2amount <- lapply(m2, function(m)with(attributes(m), structure(capture.start[,"amount"], match.length=capture.length[,"amount"]))) regmatches(txt2, m2amount) # show just the amounts, no currency symbol

Package base version 4.0.0-28
Package Index