strcapture
Capture Text Matched by Parts of Regular Expressions
Description
Capture the text matched by all the parenthesized capture expressions in a regular expression
Usage
strcapture(pattern, x, proto, perl = FALSE, useBytes = FALSE)
Arguments
pattern |
A string representing a regular expression. It should contain parenthesized 'capture expressions'.
|
x |
A vector of character strings in which to search for matches to pattern.
|
proto |
A list or data.frame with an entry for each parenthesized capture expression in pattern.
|
perl |
A logical value. If TRUE, interpret the regular expression using the Perl C Regular Expression library.
Otherwise interpret them according to the more traditional TRE library.
|
useBytes |
A logical value. This argument is passed to regexpr or regexec.
|
Details
A 'capture expression' in a regular expression is a pattern enclosed in parentheses.
E.g., the regular expression ^.([[:alpha:]]+)([[:digit:]]).*$ contains two
capture expression - the first will capture the first sequence of alphabetic characters preceding
a sequence of digits and the second will capture the digits after those alphabetic characters.
Perl regular expressions have a richer syntax for capture expressions. E.g., a question mark colon
at the start of a parenthesizied expression, as in (?:pattern)+,
means that the parentheses are used only for grouping purposes, not for capturing, so this example
will match any number of repeats of pattern but will not be captured by this function.
Value
A data.frame with column number, names, and types taken from the proto argument
and as many rows as there are entries in x.
Each row contains the captured values from the corresponding element of
x.
References
See the help file for
regexpr for a description of regular expressions.
Perl regular expressions are explained in
http://perldoc.perl.org/perlretut.html.
See Also
Examples
strcapture("([A-Z][[:alpha:]]*\\.?) *([A-Z][[:alpha:]]+)[^[:digit:]]+([[:digit:]]*)",
c("Avi Beckham is 10", "Ch. Danzig, 12", "Erin Fields (?)"),
proto=data.frame(Given="", Surname="", Age=0))
# Allow SSN to have 2 dashes or none, ignore surrounding characters.
strcapture(perl = TRUE,
"(?|(?:(\\d{3})-(\\d{2})-(\\d{4}))|(?:(\\d{3}) (\\d{2}) (\\d{4})))",
proto=data.frame(Area="", Group="", Serial=""),
c("SSN 014-40-4152", "023 42 2045", "121 22 4721 (*)"))