regexpr
Match Patterns in Strings

Description

Searches for pattern matching of a regular expression in character strings.

Usage

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, 
        fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, 
         fixed = FALSE, useBytes = FALSE)

Arguments

pattern a character string specifying the pattern to search for. The interpretation of the pattern is controlled by the logical-valued arguments ignore.case, perl, fixed and useBytes.
text a vector of character strings in which to search.
ignore.case a logical value. If TRUE, uppercase and lowercase characters are considered equivalent when matching. The default is FALSE.
perl a logical value.
  • If FALSE (the default), the pattern is interpreted as a POSIX extended regular expression (handled by the TRE library, http://laurikari.net/tre/).
  • If TRUE, the pattern is interpreted as a Perl-compatible regular expression (handled by the PCRE library, http://www.pcre.org).
fixed a logical value. If TRUE, the pattern is not treated as a regular expression; rather, it is treaded as a literal sequence of characters.
  • If both fixed=TRUE and ignore.case=TRUE, the value of ignore.case is ignored.
  • If both fixed=TRUE and perl=TRUE, the value of perl is ignored and reset to FALSE, and the pattern is treated as a fixed sequence of characters.
useBytes a logical value. If TRUE then the pattern or text strings are treated as a simple sequence of bytes. If this is FALSE and any of the pattern or text strings have 'bytes' encoding (see Encoding), then useBytes is set to TRUE.

Details

If fixed=FALSE, the pattern argument specifies a regular expression. Certain punctuation characters are interpreted specially, as described below. Other characters in the pattern match the same character in the text. Case is significant unless ignore.case is TRUE.
The following sections describe the POSIX standard extended regular expressions. The definition is recursive.
atom
bracket expression
piece of a regular expression
branch
Back References
Any atom enclosed in parentheses 'remembers' the characters that it matched, and these characters can be matched again using a 'back reference', an escape sequence of the form '\1', '\2', and so on. The digit specifies the parentheses in the pattern, counting from the beginning. Thus, the pattern '(a+)b\1' matches the entire string 'aaabaaa', because '(a+)' matches three 'a' characters before the 'b', and '\1' matches these three 'a' characters after the 'b'.
Escape Sequences Matching a Single Character
The following escape sequences match a single character:
'\a' (bell)
'\e' (escape)
'\f' (form feed)
'\n' (line feed)
'\r' (carriage return)
'\t' (tab)
'\v' (vertical tab)
The following escape sequences match a specified character with a given code point:
'\xdd' (matches the character with hexidecimal code point 0xdd)
'\x{dddd}' (matches the character with hexidecimal code point 0xdddd)
Escape Sequences Matching a Character Class
The following escape sequences are abbreviations for certain character classes:
'\d' (equivalent to [[:digit:]])
'\s' (equivalent to [[:space:]])
'\w' (equivalent to [[:alnum:]_], thus common word characters)
'\D' (equivalent to [^[:digit:]], thus everything but a digit)
'\S' (equivalent to [^[:space:]])
'\W' (equivalent to [^[:alnum:]_])
Word Boundaries
The start and end of a word can be matched by the escape sequences '\<' and '\>', where a 'word' is a sequence of 1 or more alphanumerics and underscores.
'\b' matches any word boundary (either the start or the end), and '\B' matches anywhere except at a word boundary.
Quoting Escape
The escape sequence '\Q' specifies the beginning of a sequence of characters to be 'quoted'. The characters following it until the end of the string, or the escape sequence '\E', are taken literally. Thus, the string '\Q[^a]\E' will match the string 'abc[^a]def'
Perl Compatible Regular Expressions
The Perl language supports an extended version of regular expressions, accepting many forms in addition to the ones described above. For more details of Perl compatible regular expression, please visit http://perldoc.perl.org/perlre.html.
Unmatched Right Parentheses
When fixed=FALSE, parentheses are considered to be part of the pattern language and must be preceded by a (doubled) backslash to be taken literally.
Unmatched (and unescaped) parentheses usually result in an error.
An exception: If perl=FALSE and fixed=FALSE, an unmatched right parenthesis will be matched literally. Thus, regexpr('b)', 'ab)c', perl=FALSE) will match. regexpr('b)', 'ab)c', perl=TRUE) generates an error.
Value
regexpreturns a numeric vector, with one element for each element of text, giving the position in the character string of the first substring matching the regular expression. Minus ones (-1) incidate that no match was found.

An attribute, "match.length", is a numeric vector giving the length of the longest possible matching substring starting at that position, or minus one (-1) for no match.

Note that a "match.length" value can be zero when matching a regular expression such as "^".

gregexprproduces all of the matches for the regular expression in each string, rather than only the first one.

It returns a list with one entry for each element of text. Each entry has the format of the output of regexpr, a numeric vector with the starting positions of each match within the string, with an attribute, "match.length", giving the length of each match.

If there are no matches in a string, the output entry is the value minus one (-1) with a "match.length" value of minus one (-1).

When you use a Perl regular expression containing parenthesized "capture groups", either unnamed, such as the "([0-9]+)" in "([0-9]+) *dollars", or named, such as the "(?<amount>[0-9]+)" in "(?<amount>[0-9]+) *dollars", then the following attributes giving information about each matched capture group are added to the output.
Attribute Description
capture.start an integer matrice with a column for each match group and a row for each match.
capture.length an integer matrice with a column for each match group and a row for each match.
capture.names
See Also
strsplit, gsub, grep, substring, Encoding
Examples
x <- c("10 Sept", "Oct 9th", "Jan 2", "4th of July")
# Find the numbers in the above strings:
w <- regexpr("[0-9]+", x)
w

# Extract the numbers: as.numeric(substring(x, w, w+attr(w, "match.length")-1))

# Extract the capitalized words w1 <- regexpr("[A-Z][a-z]*", x) substring(x, w1, w1+attr(w1, "match.length")-1) # Do the same with substituteString. Note that \\n in # the replacement string refers to the n'th parenthesized # subexpression in the pattern. sub("(.*)([A-Z][a-z]*)(.*)", "\\2", x)

# get the integer part of numbers s <- c("-14.0e-05", ".002", "1,700", "+1999.999", "$34.50") r <- regexpr("^ *[-+$]?([0-9,]+)", s) substring(s, r, r + attr(r, "match.length") - 1) regmatches(s, r) # like above substring, but omits non-matched strings

# find the ATAT... sequences in two strings gregexpr("(AT){2,}", c("GATATATCATCATATC", "ATATG"))

# perl capture groups and the perl (?:...) non-capturing group # "\u20AC" is the unicode euro currency symbol txt1 <- c("5 for $3.75", "\u20AC27 OBO") m1 <- regexpr( "(?<currency>\\$|\u20AC)(?<amount>(?<units>\\d+)(?:\\.(?<cents>\\d\\d))?)", txt1, perl=TRUE) # show the currency symbols regmatches(txt1, with(attributes(m1), structure(capture.start[,"currency"], match.length=capture.length[,"currency"]))) # show the currency amounts regmatches(txt1, with(attributes(m1), structure(capture.start[,"amount"], match.length=capture.length[,"amount"])))

Package base version 6.1.1-7
Package Index