# A Beginner Guide to String Pattern Matching in R by Regular Expression Part 1-1

A regular expression is a pattern that describes a group of strings.

## 1. Escaping Characters

\: Escape metacharacters in regular expression, i.e.

$* + . ? [ ] ^ { } | ( ) \ As \ itself needs to be escaped in R, R requires double backslash to escape these metacharacters, like \?. ## 2. Special Metacharacters • \\t : Tab • \\n : New line • \\v : Vertical tab • \\f : Form feed • \\r : Carriage return ## 3. Quantifiers Quantifiers specify how many times that the preceding pattern should occur. • * : matches at least 0 times. • + : matches at least 1 times. • ? : matches at most 1 times. • {n} : matches exactly n times. • {n,} : matches at least n times. • {,m} : matches at most m times. • {n,m} : matches between n and m times. Exercise eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJhXCIsIFwiYWJcIiwgXCJhY2JcIiwgXCJhY2NiXCIsIFwiYWNjY2JcIiwgXCJhY2NjY2JcIilcblxuZ3JlcChcImFjKmJcIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKVxuXG5ncmVwKFwiYWMrYlwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhYz9iXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcImFjezJ9YlwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhY3syLH1iXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcImFjezIsM31iXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSkifQ== ## 4. Position Anchors • ^ : Start of the string. • $ : End of the string.
• \\b : Empty string at either edge of a word.
• \\B : Empty string, not at the edge of a word.
• \\< : Beginning of a word
• \\> : End of a word
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJhYmNkXCIsIFwiY2RhYlwiLCBcImNhYmRcIiwgXCJjIGFiZFwiLCBcIiphYlwiKVxuXG5ncmVwKFwiXmFiXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcImFiJFwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJcXFxcYmFiXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcIlxcXFw8YVwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJjXFxcXD5cIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKSJ9

## 5. Characters and Operators

• . : Any single character except \n

• [...] : a permitted character list. Use - inside the brackets to specify a range of characters.

• [^...] : an excluded character list. Match any characters except those inside the square brackets.

• |: an OR operator, matches patterns on either side of the |.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJeYWJcIiwgXCJhYlwiLCBcImFiY1wiLCBcImFiZFwiLCBcIiBhYmMgZFwiLCBcImFiZVwiLCBcImFiIDEyXCIpXG5cbmdyZXAoXCJhYi5cIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKVxuXG5ncmVwKFwiYWJbYy1lXVwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhYlteY11cIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKVxuXG5ncmVwKFwiXFxcXF5hYlwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhYmN8YWJkXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSkifQ==

## 6. Character Classes

• [[:digit:]] or \\d or [0-9] : digits 0 1 2 3 4 5 6 7 8 9
• \\D or [^0-9] : non-digits

• [[:lower:]] or [a-z] : lower-case letters

• [[:upper:]] or [A-Z] : upper-case letters

• [[:alpha:]] or [[:lower:][:upper:]] or [A-z] : alphabetic characters

• [[:alnum:]] or [[:alpha:][:digit:]] or [A-z0-9]: alphanumeric characters

• \\w or [[:alnum:]_] or [A-z0-9_] : word characters

• \\W or [^A-z0-9_] : non-word characters

• [[:xdigit:]] or [0-9A-Fa-f] : hexadecimal digits (base 16) 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

• [[:blank:]] : space and tab

• [[:space:]] or \s’ : space characters: tab, newline, vertical tab, form feed, carriage return, space

• \\S : not space characters

• [[:punct:]] : punctuation characters

! " # \$ % & ' ( ) * + , - . / : ; < = > ? @ [  ] ^ _  { | } ~
• [[:graph:]] or [[:alnum:][:punct:]] : graphical (human readable) characters

• [[:print:]] or [[:alnum:][:punct:]\\s] : printable characters

• [[:cntrl:]] or \\c : control characters, like \n or \r etc.

Exercise:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzID0gYyhcIkFiMTJcIiwgXCJhYjEyXCIsIFwiQkExMlwiLCBcIkEgMTJiXCIsIFwiQiFcIiwgXCJkXCIsIFwiICBhYlwiKVxuXG5ncmVwKFwiXltbOnVwcGVyOl1dXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW0EtWl1cIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIl5bQS1aXS5cIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIl5bQS1aXVxcXFxzXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW1s6YWxwaGE6XV0kXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW0Etel0kXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW1s6YWxwaGE6XV17Mn1cIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIl5bQS16XXsyfVwiLCBzdHJpbmdzLCB2YWx1ZT1UUlVFKVxuXG5ncmVwKFwiXFxcXHN7Mix9XCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJcXFxccytcIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIltbOnB1bmN0Ol1dXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpIn0=

## 7. Grouping and String Replacement

(...) is grouping which allows back referencing by \\n for the nth group in the pattern.

Exercise:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJeYWJcIiwgXCJhYlwiLCBcImFiY1wiLCBcImFiZFwiLCBcIiBhYmMgZFwiLCBcImFiZVwiLCBcImFiIDEyMTJcIiwgXCJjZGFiIDEyXCIpXG5wYXR0ZXJuID0gXCIoYWIpIDEyXCJcbnJlcGxhY2VtZW50ID0gXCJcXFxcMSAzNFwiXG5cbnN1YihwYXR0ZXJuLCByZXBsYWNlbWVudCwgc3RyaW5ncylcblxuZ3N1YihwYXR0ZXJuLCByZXBsYWNlbWVudCwgc3RyaW5ncylcblxubGlicmFyeShzdHJpbmdyKVxuc3RyX3JlcGxhY2VfYWxsKHN0cmluZ3MsIHBhdHRlcm4sIHJlcGxhY2VtZW50KSJ9

## 8. Case Conversions

By default, pattern matching is case sensitive in R. Turn it off with ignore.case = TRUE.

Alternatively, tolower() and toupper()` functions can convert everything to lower or upper case.

Example:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJkYXRhIDwtIGMoXCJXb3JsZFwiLCBcIndvcmxkXCIsIFwiV09STERcIilcblxucGF0dGVybiA8LSBcIndvcmxkXCJcblxuZ3JlcChwYXR0ZXJuLCBkYXRhLCB2YWx1ZT1UUlVFKVxuXG5ncmVwKHBhdHRlcm4sIGRhdGEsIHZhbHVlPVRSVUUsIGlnbm9yZS5jYXNlID0gVFJVRSkifQ==

Continue to Part 3.