agrep
Approximate String Matching (Fuzzy Matching)

Description

Searches for an approximate text pattern matching, as described by a character string or regular expression.

Usage

agrep(pattern, x, max.distance = 0.1, costs = NULL,
      ignore.case = FALSE, value = FALSE, fixed = TRUE, useBytes = FALSE) 
agrepl(pattern, x, max.distance = 0.1, costs = NULL,
       ignore.case = FALSE, fixed = TRUE, useBytes = FALSE) 

Arguments

pattern a non-empty character string specifying the pattern to search for. It is interpreted as a literal sequence of characters if fixed is TRUE (the default). Otherwise, it is interpreted as a regular expression.
x a vector of character strings in which to search.
max.distance the maximum allowed match distances. It can be an integer, a fraction, or a list with the following five match distance components.
match distance component description
"cost" maximum allowed costs
"insertions" maximum allowed insertions
"deletions" maximum allowed deletions
"substitutions" maximum allowed substitutions
"all" total errors for all (insertions, deletions, and substitutions)
See the details section for more information.

costs the integer cost of each inserted, deleted, or substituted character. It can be a numeric vector or a list with partially-matched names "insertions", "deletions", and "substitutions". Unspecified costs default to 1, so the default NULL value means that all three costs ("insertions", "deletions", and "substitutions") are 1.
ignore.case a logical value. If TRUE, uppercase and lowercase characters are considered equivalent when matching. The default is FALSE.
value a logical value. If TRUE, agrep returns the matched elements of x. If FALSE (the default), agrep returns the indices of the matched elements of x.
fixed a logical value. If TRUE (the default), the pattern is represented as a literal sequence of characters. If FALSE, the pattern is represented as a regular expression.
useBytes a logical value.
  • If TRUE, the x and pattern strings are interpreted as a simple sequence of bytes.
  • If FALSE (the default), the x and pattern strings are interpreted as a simple sequence of characters.

Details

Approximate pattern matching allows matches to be close to the searched pattern under some measure of closeness.
agrep() uses TRE, a portable POSIX-compliant pattern-matching library that supports approximate (fuzzy) matching. If fixed is FALSE, the pattern is interpreted as a POSIX-extended regular expression, like grep when called with perl=FALSE.
agrep uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text to get an exact match. Each insertion, deletion, or substitution adds to the distance, or cost, of the match. The cost for each insertion, deletion, or substitution can be set with the costs argument. agrep reports matches that have a total cost lower than the threshold value specified by max.distance.
Match distance components
Value
agrepif value=FALSE, returns a numeric vector indicating which elements of x matched pattern. (The return value numeric(0) indicates that there are no matches). If value=TRUE, returns the matching elements of x. (If the matching elements are not character data, they are converted to character data.)
agreplreturns a logical vector indicating which elements of x matched pattern. These return values can be used as a subscript to retrieve the matching elements of x.
References
http://laurikari.net/tre, the official TRE website.
Also see http://en.wikipedia.org/wiki/TRE_(computing)
See Also
grep.
Examples
agrep("life", "a live b")
agrep("life", c("a live b", "a life b"))
agrep("life", c("a live b", "a life b"), max = list(sub = 0))
agrep("lifye", c("a live b", "a life b"), max = list(all = 0.1))
agrep("lifye", c("a live b", "a life b"), max = list(all = 2))

agrep("liyfe", c("a live", "A", "a Live"), max = 2) agrep("liyfe", c("a live", "A", "a Live"), max = list(all = 0.2)) agrep("liyfe", c("a live", "A", "a LiVE"), max = 2, value = TRUE) agrep("liyfe", c("a live", "A", "a LIVE"), max = 2, ignore.case = TRUE)

sName <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Louisiana", "Delaware", "Florida", "Georgia") agrep("ia$", sName, value=TRUE, fixed = FALSE) ## pattern is interpreted as a regular expression. ## returns all names that end in "ia" approximately, it is not equivalent to ## grep(), which returns the states that end in "ia" exactly!

agrep("ia$", sName, value=TRUE, fixed = FALSE, max=0) ## with max=0, returns the same values as grep

agrep("ia$", sName, value=TRUE) ## pattern is interpreted as literal character string, the result is also ## different from above example.

Package base version 6.1.1-7
Package Index