regexpr
Match Patterns in Strings
Description
Searches for pattern matching of a regular expression in character strings.
Usage
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
Arguments
pattern |
a character string specifying the pattern to search for.
The interpretation of the pattern is controlled by the logical-valued
arguments ignore.case, perl, fixed and useBytes.
|
text |
a vector of character strings in which to search.
|
ignore.case |
a logical value. If TRUE, uppercase and lowercase characters are
considered equivalent when matching. The default is FALSE.
|
perl |
a logical value. If TRUE, the pattern is interpreted as a
Perl-compatible regular expression. If FALSE (the default), the
pattern is interpreted as a POSIX extended regular expression.
|
fixed |
a logical value. If TRUE, the pattern is not treated
as a regular expression; rather, it is treaded as a literal sequence of
characters.
- If both fixed=TRUE and ignore.case=TRUE,
the value of ignore.case is ignored.
- If both fixed=TRUE and perl=TRUE,
the value of perl is ignored and reset to FALSE,
and the pattern is treated as a fixed sequence of characters.
|
useBytes |
a logical value. If TRUE then the pattern or text strings are treated as a simple sequence of bytes.
If this is FALSE and any of the pattern or text strings have 'bytes' encoding (see Encoding),
then useBytes is set to TRUE and a warning is generated.
|
Details
If fixed=FALSE, the pattern argument specifies a regular expression.
Certain punctuation characters are interpreted specially, as described below.
Other characters in the pattern match the same character in the text.
Case is significant unless ignore.case is TRUE.
Here we describe the POSIX standard extended regular expressions.
The definition is recursive.
An 'atom' may be a single character other than
one of the special characters '.[{()\*+?|^\$',
in which case it matches itself.
An atom may be a period, '.' which matches any character.
An atom may be '^' or '\$' which match the start or end (respectively)
of the entire string.
An atom may be an escape sequence consisting of a backslash '\'
followed by one or more characters.
The supported escape sequences are described below.
As an example, the escape sequence '\n' matches a linefeed character.
An undefined escape sequence just matches the character following the backslash,
so '\\' matches the backslash character itself.
Note that when typing an S string containing a backslash, it must be doubled.
Therefore, one would include the escape sequence '\n' within a string by typing "aaa\\nbbb".
An atom may also be a bracket expression
(see below) or a (possibly empty) regular expression enclosed in
parentheses, in which case it matches what the bracket expression
or regular expression matches.
A 'bracket expression' is a list of characters or character ranges
(2 characters separated by a hyphen, '-') enclosed in square brackets.
It matches any character in the list.
If the list starts with a circumflex, '^', then it matches any
character not in the remainder of the list.
To include a '-' in a bracket list, make it the last entry.
A bracket expression may also contain a 'character class'
of the form '[:name:]' where name specifies the set of characters that match it.
The name can be one of the following:
'[:alnum:]' | (alphabetic or numeric digit) |
'[:alpha:]' | (alphabetic) |
'[:blank:]' | (any whitespace except for line separators) |
'[:cntrl:]' | (control characters) |
'[:digit:]' or '[:d:]' | (numeric digit) |
'[:graph:]' | (graphical) |
'[:lower:]' or '[:l:]' | (lower-case alphabetic) |
'[:print:]' | (printable) |
'[:punct:]' | (punctuation) |
'[:space:]' or '[:s:]' | (any whitespace) |
'[:unicode:]' | (any Unicode character about 0xFF) |
'[:upper:]' or '[:u:]' | (upper-case alphabetic) |
'[:word:]' or '[:w:]' | (any alphabetic or numeric digit, or underscore) |
'[:xdigit:]' | (hexadecimal digit)
|
A 'piece' of a regular expression is an atom, possibly following
by a repeat quantifier: an asterisk ('*', 0 or more repeats),
a plus sign ('+', 1 or more repeats),
a question mark, ('?', 0 or 1 repeats),
or a bound, '{min,max}' or '{count}' or '{min,}'.
The bound {min,max} means between min and max repeats.
If max is missing it is taken to be infinity.
If there is no comma then it matches exactly the given count of repeats.
E.g., '+' is equivalent to '{1,}', '*' is '{0,}', and '?' is '{0,1}'.
Finally, a 'branch' is a sequence of pieces, concatenated, and
a 'regular expression' is a sequence of branches separated by
vertical bars, '|'. The regular expression matches if any
branch in it matches.
Back References
Any atom enclosed in parentheses 'remembers' the characters that it matched,
and these characters can be matched again using a 'back reference',
an escape sequence of the form '\1', '\2', etc.
The digit specifies the parentheses in the pattern, counting from the beginning.
Thus, the pattern '(a+)b\1' will match the entire string 'aaabaaa',
since '(a+)' matches three 'a' characters before the 'b',
and '\1' matches these three 'a' characters after the 'b'.
Collating Elements
A bracket expression may contain a collating element of the form '[.x.]',
where 'x' may be a single character.
This can be used to add characters to a bracket expression that are not normally permitted,
such as '[[.^.]abc]' to match any of the characters 'a', 'b', 'c', or '^'.
A collating element may also contain a symbolic name for a character, such as '[.newline.]' or '[.ESC.]'.
Equivalence Classes
A bracket expression may contain a equivalence class of the form '[=x=]',
where 'x' is a character or symbolic name for a character, like a collating element.
This form matches any characters that match the specified character, ignoring case and accents.
Thus, '[[=a=]]' will match 'a' or 'A' or either of these with accents, tildes, etc.
This will not work the same under different locales: in a locale where 'a' is considered to be
a distinct letter from 'a-with-two-dots', '[[=a=]]' will not match the second one.
Escape Sequences Matching a Single Character
The following escape sequences match a single character:
'\a' | (bell) |
'\e' | (escape) |
'\f' | (form feed) |
'\n' | (line feed) |
'\r' | (carriage return) |
'\t' | (tab) |
'\v' | (vertical tab)
|
The following escape sequences match a specified character with a given code point:
'\cX' | (control-X, where X is any character
specifies the corresponding control character with code point 1-31), |
'\xdd' | (matches the character with hexidecimal code point 0xdd) |
'\x' | (matches the character with hexidecimal code point 0xdddd) |
'\0ddd' | (matches the character with octal code point 0ddd) |
'\N' | (matches the character with the specified symbolic name, such as 'newline')
|
Escape Sequences Matching a Character Class
The following escape sequences are abbreviations for certain character classes:
'\d' | (equivalent to [[:digit:]]) |
'\l' | (equivalent to [[:lower:]]) |
'\s' | (equivalent to [[:space:]]) |
'\u' | (equivalent to [[:upper:]]) |
'\w' | (equivalent to [[:word:]]) |
'\D' | (equivalent to [^[:digit:]], thus everything but a digit) |
'\L' | (equivalent to [^[:lower:]]) |
'\S' | (equivalent to [^[:space:]]) |
'\U' | (equivalent to [^[:upper:]]) |
'\W' | (equivalent to [^[:word:]]) |
|
The following escape sequences specify a character class by name:
'\pX' | (matches the character class with single-character name X, so '\pd' is the same as '[[:d:]]'). |
'\p' | (matches the character class with name XXX, so '\p' is the same as '[[:digit:]]'). |
'\PX' | (matches all characters except the character class with single-character name X,
so '\Pd' is the same as '[^[:d:]]'). |
'\P' | (matches all characters the character class with name XXX,
so '\P' is the same as '[^[:digit:]]').
|
Word Boundaries
The start and end of a word are matched by the special patterns
'[[:<:]]' and '[[:>:]]', respectively,
where a 'word' is a sequence of 1 or more alphanumerics and underscores.
The start and end of a word can also be matched by the escape sequences
'\<' and '\>'.
'\b' matches any word boundary (either the start or the end),
and '\B' matches anywhere except at a word boundary.
Buffer Boundaries
The following escape sequences match the beginning or end of the entire string.
'\`' or '\A' | matches the start of the buffer. |
'\'' or '\z' | matched the end of the buffer. |
'\Z' | matches zero or more newline characters at the end of the buffer.
|
Quoting Escape
The escape sequence '\Q' specifies the beginning of a sequence of characters to be 'quoted'.
The characters following it until the end of the string, or the escape sequence '\E',
are taken literally.
Thus, the string '\Q[^a]\E' will match the string 'abc[^a]def'
Perl compatible Regular Expressions
The Perl language supports an extended version of regular expressions,
accepting many forms in addition to the ones described above.
For more details of Perl compatible regular expression,
please visit
http://perldoc.perl.org/perlre.html.
Unmatched Right Parentheses
When fixed=FALSE, parentheses are
considered to be part of the pattern language and
must be preceded by a (doubled) backslash to be taken
literally.
Unmatched (and unescaped) parentheses usually result in an error.
An exception:
If perl=FALSE and fixed=FALSE,
an unmatched right parenthesis will be matched literally.
Thus, regexpr('b)', 'ab)c', perl=FALSE) will match.
regexpr('b)', 'ab)c', perl=TRUE) generates an error.
Value
returns a numeric vector with one element for each element of text,
giving the position in the character string of the
first substring matching the regular expression.
Minus ones indicate no match was found.
An attribute, "match.length",
is a numeric vector giving the length of the longest possible matching substring
starting at that position, or minus one for no match.
Note that a "match.length" value can be zero
when matching a regular expression such as "^".
gregexpr produces all of the matches for the regular expression
in each string, rather than only the first one.
It returns a list with one entry for each element of text.
Each entry has the format of the output of regexpr,
a numeric vector with the starting positions of each match within the string,
with an attribute,
"match.length",
giving the length of each match.
If there are no matches in a string,
the output entry is the value minus one
with a "match.length" value of minus one.
When using a Perl regular expression containing parenthesized "capture groups",
either unnamed, such as the "([0-9]+)" in "([0-9]+) *dollars",
or named, such as the "(?<amount>[0-9]+)" in "(?<amount>[0-9]+) *dollars",
then attributes giving information about each matched capture group are added to the output.
These attributes are called match.start, match.length, and match.names.
The first two are integer matrices with a column for each match group and a row
for each match.
Differences between TIBCO Enterprise Runtime for R and Open-source R
NOTE Over time, we may modify TIBCO Enterprise Runtime for R so it is closer to open-source R's behavior, and remove items from this list.
In many cases, the TIBCO Enterprise Runtime for R matcher handles POSIX-standard forms that open-source R does not handle.
We might want to detect these cases and generate an error.
- In open-source R, the algorithms for POSIX and Perl-style regular expression are different from TIBCO Enterprise Runtime for R.
When perl=FALSE, the TRE library(http://laurikari.net/tre/) is used for POSIX regular expression matching.
When perl = TRUE, the PCRE library(http://www.pcre.org) is used for Perl compatible regular expression matching.
- In open-source R, '.' does not match '\n' when perl=TRUE, but it does in TIBCO Enterprise Runtime for R.
Thus, regexpr('b.', 'ab\nc', perl=TRUE)
does not match in open-source R, but does in TIBCO Enterprise Runtime for R.
- In TIBCO Enterprise Runtime for R, the '[:print:]' character class matches several characters such as '\n' that open-source R does not.
Thus, regexpr('[[:print:]]', '\b\n\b')
does not match in open-source R, but does in TIBCO Enterprise Runtime for R.
- open-source R does not recognize the character classes
'[:d:]', '[:l:]', '[:s:]', '[:unicode:]', '[:u:]', '[:word:]' or '[:w:]',
and will generate an error if they are given in a pattern.
Thus, regexpr('[[:u::]]', 'aBc') matches in TIBCO Enterprise Runtime for R, but gives an error in open-source R.
- open-source R accepts a repeat qualifier of the form '{,max}', where the minimum value is taken to be zero.
TIBCO Enterprise Runtime for R generates an error if this form is used.
Actually, open-source R's implementation of this is a little strange.
If perl=TRUE, the form is accepted, but the expression never seems to match.
If perl=FALSE, the form is accepted, but is interpreted incorrectly:
'ab{,2}' acts like 'ab{0,3}', rather than 'ab{0,2}'.
Thus regexpr('ab{,2}', 'abc', perl=TRUE) does not match in open-source R,
and regexpr('ab{,2}', 'abbbc', perl=FALSE) matches the first four characters in the string.
Either one gives an error in TIBCO Enterprise Runtime for R.
- In open-source R with perl=TRUE,
the character class names ('[:alnum:]', '[:alpha:]', etc.)
do not match Unicode characters beyond the latin1 character set.
Thus, in open-source R regexpr('[[:alpha:]]', '1\u30A42', perl=TRUE) does not match,
but regexpr('[[:alpha:]]', '1\u30A42', perl=FALSE) does.
In TIBCO Enterprise Runtime for R, both of these will match.
- In general, there are many small differences between open-source R and TIBCO Enterprise Runtime for R
in exactly how the character class names ('[:alnum:]', '[:alpha:]', etc.)
classify particular Unicode characters beyond the latin1 character set.
- There are also differences between open-source R on Linux and Windows on
how the character class names ('[:alnum:]', '[:alpha:]', etc.)
are interpreted.
When comparing TIBCO Enterprise Runtime for R to open-source R, we usually compare it to open-source R running on Linux.
- open-source R with pel=FALSE accepts and ignores the construct '(*)', whereas TIBCO Enterprise Runtime for R generates an error.
Thus, regexpr('a(*)', 'abc', perl=FALSE) matches in open-source R, and produces an error in TIBCO Enterprise Runtime for R.
regexpr('a(*)', 'abc', perl=TRUE) produces an error in both open-source R and TIBCO Enterprise Runtime for R.
- open-source R does not recognize collating elements.
Therefore, regexpr('[[.^.]]', 'xxx^yyy') matches in TIBCO Enterprise Runtime for R, and generates an error in open-source R.
- open-source R does not recognize equivalence classes.
Therefore, regexpr('[[=a=]]', 'xxx\u00C4yyy') matches in TIBCO Enterprise Runtime for R, and generates an error in open-source R.
- open-source R does not recognize the following escape sequences:
'\v', '\cX', '\0ddd', '\Nname'.
- open-source R does not recognize the following abbreviations for character classes:
'\l', '\u', '\L', '\U'.
Open-source R also does not recognize the forms '\pX', '\pXXX', '\PX', '\PXXX'.
- open-source R does not accept the patterns '[[:<:]]' and '[[:>:]]' to match the start and end of a word.
Thus, regexpr('[[:<:]]a', 'bab aba') gives an error in open-source R, but matches in TIBCO Enterprise Runtime for R.
- open-source R does not accept '\<' and '\>' to match the start and end of a word when perl=TRUE.
Thus, in open-source R regexpr('\<a', 'bab aba', perl=FALSE) matches, but
regexpr('\<a', 'bab aba', perl=TRUE) does not match.
Both of these produce a match in TIBCO Enterprise Runtime for R.
strsplit open-source R does not accept '\`' or '\`' to match the beginning or end of the buffer.
open-source R only accepts '\A' or '\z' when perl=TRUE.
open-source R only accepts '\Z' when perl=TRUE, but interprets it the same as '\z'.
See Also
Examples
x <- c("10 Sept", "Oct 9th", "Jan 2", "4th of July")
# Find the numbers in the above strings:
w <- regexpr("[0-9]+", x)
w
# Extract the numbers:
as.numeric(substring(x, w, w+attr(w, "match.length")-1))
# Extract the capitalized words
w1 <- regexpr("[A-Z][a-z]*", x)
substring(x, w1, w1+attr(w1, "match.length")-1)
# Do the same with substituteString. Note that \\n in
# the replacement string refers to the n'th parenthesized
# subexpression in the pattern.
sub("(.*)([A-Z][a-z]*)(.*)", "\\2", x)
# get the integer part of numbers
s <- c("-14.0e-05", ".002", "1,700", "+1999.999", "$34.50")
r <- regexpr("^ *[-+$]?([0-9,]+)", s)
substring(s, r, r + attr(r, "match.length") - 1)
regmatches(s, r) # like above substring, but omits non-matched strings
# find the ATAT... sequences in two strings
gregexpr("(AT){2,}", c("GATATATCATCATATC", "ATATG"))
# perl capture groups and the perl (?:...) non-capturing group
# "\u20AC" is the unicode euro currency symbol
txt1 <- c("5 for $3.75", "\u20AC27 OBO")
m1 <- regexpr(
"(?<currency>\\$|\u20AC)(?<amount>(?<units>\\d+)(?:\\.(?<cents>\\d\\d))?)",
txt1, perl=TRUE)
# show the currency symbols
regmatches(txt1,
with(attributes(m1), structure(capture.start[,"currency"],
match.length=capture.length[,"currency"])))
# show the currency amounts
regmatches(txt1,
with(attributes(m1), structure(capture.start[,"amount"],
match.length=capture.length[,"amount"])))
txt2 <- c("$3.75 in US$, \u20AC3 in euro", "$30 for 10")
m2 <- gregexpr(
"(?<currency>\\$|\u20AC)(?<amount>(?<units>\\d+)(?:\\.(?<cents>\\d\\d))?)",
txt2, perl=TRUE)
regmatches(txt2, m2) # show entire matches
m2amount <- lapply(m2, function(m)with(attributes(m),
structure(capture.start[,"amount"], match.length=capture.length[,"amount"])))
regmatches(txt2, m2amount) # show just the amounts, no currency symbol