jDataLab

2 minute read

Formal textual content is a mixture of words and punctuations while online conversational text comes with symbols, emoticons and misspellings. Before performing analysis or building a learning model, data wrangling is a critical step to prepare raw text data into an appropriate format. Text can be considered as a collection of documents and a document can be parsed into strings. In text cleaning, to find, find and remove, and find and replace strings, we write search patterns in regular expressions, commonly abbreviated to regex or regexp).

Text Analysis is a broad term to describe processing of text and natural language documents for structures and meaningful descriptions. Most original documents are not represented with a structure and they may contain elements which do not carry any information, such as stop words, punctuation and white space characters. Prior to analysing the textual data, always clean the documents and parse them into a structured or semi-structured collection which will enable computer-aided analysis.

R Functions for Pattern Matching

1. Finding strings: grep

grep(pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index.

For returning the actual matching element values, set the option value to TRUE by value=TRUE.

Example:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHJpbmdzIDwtIGMoXCJhYmNkXCIsIFwiY2RhYlwiLCBcImNhYmRcIiwgXCJjIGFiZFwiKVxuXG5ncmVwKFwiYWJcIiwgc3RyaW5ncylcbmdyZXAoXCJhYlwiLCBzdHJpbmdzLCB2YWx1ZSA9IEZBTFNFKVxuZ3JlcChcImFiXCIsIHN0cmluZ3MsIHZhbHVlID0gVFJVRSkifQ==

Exercise:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENyZWF0ZSBhIHZhcmlhYmxlLCBtZXNzYWdlcy4gQXNzaWduIGZvdXIgc3RyaW5nIHZhbHVlcyB0byB0aGUgdmFyaWFibGUuXG5tZXNzYWdlcyA8LSBjKFwiYXBwbGVcIiwgXCJwZWFyXCIsIFwiYmFuYW5hXCIsIFwib3JhbmdlXCIpXG5cbiMgUnVuIGdyZXAgdG8gcHJpbnQgdmFsdWVzIGluIG1lc3NhZ2VzIGlmIGl0IGNvbnRhaW5zIGEifQ==

2. Finding and replacing patterns: sub and gsub

gsub(pattern, replacement, string) returns the modified string after replacing every pattern occurrence with replacement in string.

sub(pattern, replacement, string) replaces the first pattern occurrence.

Example

Run the following snippet.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJmcnVpdHMgPC0gYyhcImFwcGxlXCIsIFwib3JhbmdlXCIsIFwicGluZWFwcGxlXCIpXG5cbiMgU3BlY2lmeSBhIHN0cmluZyBwYXR0ZXJuXG5wYXR0ZXJuIDwtIFwiYVwiXG5cbiMgU3BlY2lmeSBhIHJlcGxhY2VtZW50IHZhbHVlXG5yZXBsYWNlbWVudCA8LSBcIkFcIlxuXG4jIFJ1biBnc3ViIHRvIHJlcGxhY2UgYWxsICdhJyBvY2N1cnJlbmNlcyB3aXRoICdBJ1xuZ3N1YihwYXR0ZXJuLCByZXBsYWNlbWVudCwgZnJ1aXRzKVxuXG4jIFJ1biBzdWIgdG8gcmVwbGFjZSB0aGUgZmlyc3QgJ2EnIG9jY3VycmVuY2Ugd2kifQ==


3. Finding and replacing patterns: stringr::str_replace and stringr::str_replace_all

The function str_replace_all(string, pattern, replacement) from the R package stringr returns the modified string by replacing all of the matched patterns in the string.

stringr::str_replace replaces the first matched occurrence.

Example

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJmcnVpdHMgPC0gYyhcImFwcGxlXCIsIFwib3JhbmdlXCIsIFwicGluZWFwcGxlXCIpXG5cbnBhdHRlcm4gPC0gXCJhcHBsZVwiXG5cbnJlcGxhY2VtZW50IDwtIFwiXCJcblxubGlicmFyeShzdHJpbmdyKVxuXG5zdHJfcmVwbGFjZV9hbGwoZnJ1aXRzLCBwYXR0ZXJuLCByZXBsYWNlbWVudClcblxuIyBXcml0ZSBSIGNvZGUgdG8gcmVwbGFjZSB0aGUgZmlyc3Qgb2NjdXJyZW5jZSBvZiBcImFwcGxlXCIifQ==

Case Conversion

Pattern matching in R defaults to be case sensitive. Turn the setting off with ignore.case = TRUE.

Alternatively, tolower() and toupper() functions can convert everything to lower or upper case.

Example:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJkYXRhIDwtIGMoXCJXb3JsZFwiLCBcIndvcmxkXCIsIFwiV09STERcIilcblxucGF0dGVybiA8LSBcIndvcmxkXCJcblxuZ3JlcChwYXR0ZXJuLCBkYXRhLCB2YWx1ZT1UUlVFKVxuXG5ncmVwKHBhdHRlcm4sIGRhdGEsIHZhbHVlPVRSVUUsIGlnbm9yZS5jYXNlID0gVFJVRSkifQ==

Continue to Part 2.

Set up a Python environment for doing Data Science in Jupyter Notebook with Conda virtual environment