strcapture
Capture Text Matched by Parts of Regular Expressions

Description

Capture the text matched by all the parenthesized capture expressions in a regular expression

Usage

strcapture(pattern, x, proto, perl = FALSE, useBytes = FALSE)

Arguments

pattern A string representing a regular expression. It should contain parenthesized 'capture expressions'.
x A vector of character strings in which to search for matches to pattern.
proto A list or data.frame with an entry for each parenthesized capture expression in pattern.
perl A logical value. If TRUE, interpret the regular expression using the Perl C Regular Expression library. Otherwise interpret them according to the more traditional TRE library.
useBytes A logical value. This argument is passed to regexpr or regexec.

Details

A 'capture expression' in a regular expression is a pattern enclosed in parentheses. E.g., the regular expression ^.([[:alpha:]]+)([[:digit:]]).*$ contains two capture expression - the first will capture the first sequence of alphabetic characters preceding a sequence of digits and the second will capture the digits after those alphabetic characters. Perl regular expressions have a richer syntax for capture expressions. E.g., a question mark colon at the start of a parenthesizied expression, as in (?:pattern)+, means that the parentheses are used only for grouping purposes, not for capturing, so this example will match any number of repeats of pattern but will not be captured by this function.
Value
A data.frame with column number, names, and types taken from the proto argument and as many rows as there are entries in x. Each row contains the captured values from the corresponding element of x.
References
See the help file for regexpr for a description of regular expressions. Perl regular expressions are explained in http://perldoc.perl.org/perlretut.html.
See Also
regexpr, regexec.
Examples
strcapture("([A-Z][[:alpha:]]*\\.?) *([A-Z][[:alpha:]]+)[^[:digit:]]+([[:digit:]]*)",
    c("Avi Beckham is 10", "Ch. Danzig, 12", "Erin Fields (?)"),
    proto=data.frame(Given="", Surname="", Age=0))
# Allow SSN to have 2 dashes or none, ignore surrounding characters.
strcapture(perl = TRUE,
    "(?|(?:(\\d{3})-(\\d{2})-(\\d{4}))|(?:(\\d{3}) (\\d{2}) (\\d{4})))",
    proto=data.frame(Area="", Group="", Serial=""),
    c("SSN 014-40-4152", "023 42 2045", "121 22 4721 (*)"))
Package utils version 6.1.1-7
Package Index