regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
pattern | a character string specifying the pattern to search for. The interpretation of the pattern is controlled by the logical-valued arguments ignore.case, perl, fixed and useBytes. |
text | a vector of character strings in which to search. |
ignore.case | a logical value. If TRUE, uppercase and lowercase characters are considered equivalent when matching. The default is FALSE. |
perl |
a logical value.
|
fixed |
a logical value. If TRUE, the pattern is not treated
as a regular expression; rather, it is treaded as a literal sequence of
characters.
|
useBytes | a logical value. If TRUE then the pattern or text strings are treated as a simple sequence of bytes. If this is FALSE and any of the pattern or text strings have 'bytes' encoding (see Encoding), then useBytes is set to TRUE. |
As an example, the escape sequence '\n' matches a linefeed character. An undefined escape sequence just matches the character following the backslash, so '\\' matches the backslash character itself. Note that when typing a string containing a backslash, it must be doubled. Therefore, one would include the escape sequence '\n' within a string by typing "aaa\\nbbb".
'[:alnum:]' | (alphabetic or numeric digit) |
'[:alpha:]' | (alphabetic) |
'[:blank:]' | (any whitespace except for line separators) |
'[:cntrl:]' | (control characters) |
'[:digit:]' | (numeric digit) |
'[:graph:]' | (graphical) |
'[:lower:]' | (lower-case alphabetic) |
'[:print:]' | (printable) |
'[:punct:]' | (punctuation) |
'[:space:]' | (any whitespace) |
'[:upper:]' | (upper-case alphabetic) |
'[:xdigit:]' | (hexadecimal digit) |
The bound {min,max} means between min and max repeats. If max is missing it is taken to be infinity. If there is no comma, then it matches exactly the given count of repeats. For example, '+' is equivalent to '{1,}', '*' is '{0,}', and '?' is '{0,1}'.
'\a' | (bell) |
'\e' | (escape) |
'\f' | (form feed) |
'\n' | (line feed) |
'\r' | (carriage return) |
'\t' | (tab) |
'\v' | (vertical tab) |
'\xdd' | (matches the character with hexidecimal code point 0xdd) |
'\x{dddd}' | (matches the character with hexidecimal code point 0xdddd) |
'\d' | (equivalent to [[:digit:]]) |
'\s' | (equivalent to [[:space:]]) |
'\w' | (equivalent to [[:alnum:]_], thus common word characters) |
'\D' | (equivalent to [^[:digit:]], thus everything but a digit) |
'\S' | (equivalent to [^[:space:]]) |
'\W' | (equivalent to [^[:alnum:]_]) |
regexp | returns a numeric vector, with one element for each element of text,
giving the position in the character string of the
first substring matching the regular expression.
Minus ones (-1) incidate that no match was found.
An attribute, "match.length", is a numeric vector giving the length of the longest possible matching substring starting at that position, or minus one (-1) for no match. Note that a "match.length" value can be zero when matching a regular expression such as "^". |
gregexpr | produces all of the matches for the regular expression
in each string, rather than only the first one.
It returns a list with one entry for each element of text. Each entry has the format of the output of regexpr, a numeric vector with the starting positions of each match within the string, with an attribute, "match.length", giving the length of each match. If there are no matches in a string, the output entry is the value minus one (-1) with a "match.length" value of minus one (-1). |
Attribute | Description | |
capture.start | an integer matrice with a column for each match group and a row for each match. | |
capture.length | an integer matrice with a column for each match group and a row for each match. | |
capture.names |
x <- c("10 Sept", "Oct 9th", "Jan 2", "4th of July") # Find the numbers in the above strings: w <- regexpr("[0-9]+", x) w# Extract the numbers: as.numeric(substring(x, w, w+attr(w, "match.length")-1))
# Extract the capitalized words w1 <- regexpr("[A-Z][a-z]*", x) substring(x, w1, w1+attr(w1, "match.length")-1) # Do the same with substituteString. Note that \\n in # the replacement string refers to the n'th parenthesized # subexpression in the pattern. sub("(.*)([A-Z][a-z]*)(.*)", "\\2", x)
# get the integer part of numbers s <- c("-14.0e-05", ".002", "1,700", "+1999.999", "$34.50") r <- regexpr("^ *[-+$]?([0-9,]+)", s) substring(s, r, r + attr(r, "match.length") - 1) regmatches(s, r) # like above substring, but omits non-matched strings
# find the ATAT... sequences in two strings gregexpr("(AT){2,}", c("GATATATCATCATATC", "ATATG"))
# perl capture groups and the perl (?:...) non-capturing group # "\u20AC" is the unicode euro currency symbol txt1 <- c("5 for $3.75", "\u20AC27 OBO") m1 <- regexpr( "(?<currency>\\$|\u20AC)(?<amount>(?<units>\\d+)(?:\\.(?<cents>\\d\\d))?)", txt1, perl=TRUE) # show the currency symbols regmatches(txt1, with(attributes(m1), structure(capture.start[,"currency"], match.length=capture.length[,"currency"]))) # show the currency amounts regmatches(txt1, with(attributes(m1), structure(capture.start[,"amount"], match.length=capture.length[,"amount"])))