3 minute read

Before reading this post, read the previous part.

A regular expression is a pattern that describes a group of strings.

1. Escaping Characters

\: Escape metacharacters in regular expression, i.e.

$ * + . ? [ ] ^ { } | ( ) \`

As \ itself needs to be escaped in R, R requires double backslash to escape these metacharacters, like \?.


2. Special Metacharacters

  • \\t : Tab

  • \\n : New line

  • \\v : Vertical tab

  • \\f : Form feed

  • \\r : Carriage return


3. Quantifiers

Quantifiers specify how many times that the preceding pattern should occur.

  • * : matches at least 0 times.
  • + : matches at least 1 times.
  • ? : matches at most 1 times.
  • {n} : matches exactly n times.
  • {n,} : matches at least n times.
  • {,m} : matches at most m times.
  • {n,m} : matches between n and m times.

Exercise

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJhXCIsIFwiYWJcIiwgXCJhY2JcIiwgXCJhY2NiXCIsIFwiYWNjY2JcIiwgXCJhY2NjY2JcIilcblxuZ3JlcChcImFjKmJcIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKVxuXG5ncmVwKFwiYWMrYlwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhYz9iXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcImFjezJ9YlwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhY3syLH1iXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcImFjezIsM31iXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSkifQ==

4. Position Anchors

  • ^ : Start of the string.
  • $ : End of the string.
  • \\b : Empty string at either edge of a word.
  • \\B : Empty string, not at the edge of a word.
  • \\< : Beginning of a word
  • \\> : End of a word
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJhYmNkXCIsIFwiY2RhYlwiLCBcImNhYmRcIiwgXCJjIGFiZFwiLCBcIiphYlwiKVxuXG5ncmVwKFwiXmFiXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcImFiJFwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJcXFxcYmFiXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSlcblxuZ3JlcChcIlxcXFw8YVwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJjXFxcXD5cIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKSJ9

5. Characters and Operators

  • . : Any single character except \n

  • [...] : a permitted character list. Use - inside the brackets to specify a range of characters.

  • [^...] : an excluded character list. Match any characters except those inside the square brackets.

  • |: an OR operator, matches patterns on either side of the |.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJeYWJcIiwgXCJhYlwiLCBcImFiY1wiLCBcImFiZFwiLCBcIiBhYmMgZFwiLCBcImFiZVwiLCBcImFiIDEyXCIpXG5cbmdyZXAoXCJhYi5cIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKVxuXG5ncmVwKFwiYWJbYy1lXVwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhYlteY11cIiwgc3RyaW5ncywgdmFsdWUgPSBUUlVFKVxuXG5ncmVwKFwiXFxcXF5hYlwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpXG5cbmdyZXAoXCJhYmN8YWJkXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSkifQ==


6. Character Classes

  • [[:digit:]] or \\d or [0-9] : digits 0 1 2 3 4 5 6 7 8 9
  • \\D or [^0-9] : non-digits

  • [[:lower:]] or [a-z] : lower-case letters

  • [[:upper:]] or [A-Z] : upper-case letters

  • [[:alpha:]] or [[:lower:][:upper:]] or [A-z] : alphabetic characters

  • [[:alnum:]] or [[:alpha:][:digit:]] or [A-z0-9]: alphanumeric characters

  • \\w or [[:alnum:]_] or [A-z0-9_] : word characters

  • \\W or [^A-z0-9_] : non-word characters

  • [[:xdigit:]] or [0-9A-Fa-f] : hexadecimal digits (base 16) 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

  • [[:blank:]] : space and tab

  • [[:space:]] or `\s’ : space characters: tab, newline, vertical tab, form feed, carriage return, space

  • \\S : not space characters

  • [[:punct:]] : punctuation characters

    ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` { | } ~
  • [[:graph:]] or [[:alnum:][:punct:]] : graphical (human readable) characters

  • [[:print:]] or [[:alnum:][:punct:]\\s] : printable characters

  • [[:cntrl:]] or \\c : control characters, like \n or \r etc.


Exercise:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzID0gYyhcIkFiMTJcIiwgXCJhYjEyXCIsIFwiQkExMlwiLCBcIkEgMTJiXCIsIFwiQiFcIiwgXCJkXCIsIFwiICBhYlwiKVxuXG5ncmVwKFwiXltbOnVwcGVyOl1dXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW0EtWl1cIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIl5bQS1aXS5cIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIl5bQS1aXVxcXFxzXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW1s6YWxwaGE6XV0kXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW0Etel0kXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJeW1s6YWxwaGE6XV17Mn1cIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIl5bQS16XXsyfVwiLCBzdHJpbmdzLCB2YWx1ZT1UUlVFKVxuXG5ncmVwKFwiXFxcXHN7Mix9XCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpXG5cbmdyZXAoXCJcXFxccytcIiwgc3RyaW5ncywgdmFsdWU9VFJVRSlcblxuZ3JlcChcIltbOnB1bmN0Ol1dXCIsIHN0cmluZ3MsIHZhbHVlPVRSVUUpIn0=

7. Grouping and String Replacement

(...) is grouping which allows back referencing by \\n for the nth group in the pattern.

Exercise:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJeYWJcIiwgXCJhYlwiLCBcImFiY1wiLCBcImFiZFwiLCBcIiBhYmMgZFwiLCBcImFiZVwiLCBcImFiIDEyMTJcIiwgXCJjZGFiIDEyXCIpXG5wYXR0ZXJuID0gXCIoYWIpIDEyXCJcbnJlcGxhY2VtZW50ID0gXCJcXFxcMSAzNFwiXG5cbnN1YihwYXR0ZXJuLCByZXBsYWNlbWVudCwgc3RyaW5ncylcblxuZ3N1YihwYXR0ZXJuLCByZXBsYWNlbWVudCwgc3RyaW5ncylcblxubGlicmFyeShzdHJpbmdyKVxuc3RyX3JlcGxhY2VfYWxsKHN0cmluZ3MsIHBhdHRlcm4sIHJlcGxhY2VtZW50KSJ9

8. Case Conversions

By default, pattern matching is case sensitive in R. Turn it off with ignore.case = TRUE.

Alternatively, tolower() and toupper() functions can convert everything to lower or upper case.

Example:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJkYXRhIDwtIGMoXCJXb3JsZFwiLCBcIndvcmxkXCIsIFwiV09STERcIilcblxucGF0dGVybiA8LSBcIndvcmxkXCJcblxuZ3JlcChwYXR0ZXJuLCBkYXRhLCB2YWx1ZT1UUlVFKVxuXG5ncmVwKHBhdHRlcm4sIGRhdGEsIHZhbHVlPVRSVUUsIGlnbm9yZS5jYXNlID0gVFJVRSkifQ==

Continue to Part 3.


comments powered by Disqus