Formal textual content is a mixture of words and punctuations while online conversational text comes with symbols, emoticons and misspellings. Before performing analysis or building a learning model, data wrangling is a critical step to prepare raw text data into an appropriate format. Text can be considered as a collection of documents and a document can be parsed into strings. In text cleaning, search patterns are defined in regular expressions (shortened as regex or regexp) to “find and remove” or “find and replace” strings.

Cooking is Designing: Channel

Text Analysis is a broad term to describe processing of text and natural language documents for structures and meaningful descriptions. As original text documents are not structured and they may contain elements that do not carry information, such as stop words, punctuation and white space characters, before starting a text analysis project, it is necessary to clean the documents and parse them into a structured or semi-structured collection to enable computer-aided analysis.

R Functions for Pattern Matching

1. Finding strings: grep

grep(pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element’s index.

For returning the actual matching element values, set the option value to TRUE by value=TRUE.

Example:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENyZWFlIGEgdmVjdG9yIHZhcmlhYmxlIGFuZCBhc3NpZ24gZm91ciBzdHJpbmcgdmFsdWVzIG90IHRoZSB2YXJpYWJsZVxuc3RyaW5ncyA8LSBjKFwiYWJjZFwiLCBcImNkYWJcIiwgXCJjYWJkXCIsIFwiYyBhYmRcIilcblxuIyBGaW5kIHN0cmluZyB2YWx1ZXMgY29udGFpbmluZyAnYWInLCByZXR1cm4gaW5kaWNlc1xuZ3JlcChcImFiXCIsIHN0cmluZ3MpXG5cbiMgRmluZCBzdHJpbmcgdmFsdWVzIGNvbnRhaW5pbmcgJ2FiJywgcmV0dXJuIGluZGljZXNcbmdyZXAoXCJhYlwiLCBzdHJpbmdzLCB2YWx1ZSA9IEZBTFNFKVxuXG4jIEZpbmQgc3RyaW5nIHZhbHVlcyBjb250YWluaW5nICdhYicsIHJldHVybiB2YWx1ZXNcbmdyZXAoXCJhYlwiLCBzdHJpbmdzLCB2YWx1ZSA9IFRSVUUpIn0=

Exercise:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiMgUHJlbG9hZCBkYXRhIG9yIHBhY2thZ2VzIiwic2FtcGxlIjoiIyBDcmVhdGUgYSB2YXJpYWJsZSwgbWVzc2FnZXMuIEFzc2lnbiBmb3VyIHN0cmluZyB2YWx1ZXMgdG8gdGhlIHZhcmlhYmxlLiBcbm1lc3NhZ2VzIDwtIGMoXCJhcHBsZVwiLCBcInBlYXJcIiwgXCJiYW5hbmFcIiwgXCJvcmFuZ2VcIilcblxuIyBSdW4gZ3JlcCB0byBwcmludCB2YWx1ZXMgaW4gbWVzc2FnZXMgaWYgaXQgY29udGFpbnMgYSAnZycgIiwic29sdXRpb24iOiIjIENyZWF0ZSBhIHZhcmlhYmxlLCBtZXNzYWdlcy4gQXNzaWduIGZvdXIgc3RyaW5nIHZhbHVlcyB0byB0aGUgdmFyaWFibGUuIFxubWVzc2FnZXMgPC0gYyhcImFwcGxlXCIsIFwicGVhclwiLCBcImJhbmFuYVwiLCBcIm9yYW5nZVwiKVxuXG4jIFJ1biBncmVwIHRvIHByaW50IHZhbHVlcyBpbiBtZXNzYWdlcyBpZiBpdCBjb250YWlucyBhICdnJ1xuZ3JlcChcImdcIiwgbWVzc2FnZXMsIHZhbHVlID0gVFJVRSkiLCJzY3QiOiJ0ZXN0X291dHB1dF9jb250YWlucygnZ3JlcChcImdcIiwgbWVzc2FnZXMsIHZhbHVlID0gVFJVRSknLCBpbmNvcnJlY3RfbXNnID0gXCJNYWtlIHN1cmUgdG8gcnVuIGdyZXAgYW5kIHNldCAndmFsdWUnIHRvIFRSVUUuXCIpXG5leCgpJT4lY2hlY2tfZnVuY3Rpb24oJ2dyZXAnKVxuc3VjY2Vzc19tc2coXCJHcmVhdCFcIikifQ==

2. Finding and replacing patterns: sub and gsub

gsub(pattern, replacement, string) returns the modified string after replacing every pattern occurrence with replacement in string.

sub(pattern, replacement, string) replaces the first pattern occurrence.

Example

Run the following snippet.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJmcnVpdHMgPC0gYyhcImFwcGxlXCIsIFwib3JhbmdlXCIsIFwicGluZWFwcGxlXCIpXG5cbiMgU3BlY2lmeSBhIHN0cmluZyBwYXR0ZXJuXG5wYXR0ZXJuIDwtIFwiYVwiXG5cbiMgU3BlY2lmeSBhIHJlcGxhY2VtZW50IHZhbHVlXG5yZXBsYWNlbWVudCA8LSBcIkFcIlxuXG4jIFJ1biBnc3ViIHRvIHJlcGxhY2UgYWxsICdhJyBvY2N1cnJlbmNlcyB3aXRoICdBJ1xuZ3N1YihwYXR0ZXJuLCByZXBsYWNlbWVudCwgZnJ1aXRzKVxuXG4jIFJ1biBzdWIgdG8gcmVwbGFjZSB0aGUgZmlyc3QgJ2EnIG9jY3VycmVuY2Ugd2l0aCAnQScifQ==


3. Finding and replacing patterns: stringr::str_replace and stringr::str_replace_all

The function str_replace_all(string, pattern, replacement) from the R package stringr returns the modified string by replacing all of the matched patterns in the string.

stringr::str_replace replaces the first matched occurrence.

Example

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJmcnVpdHMgPC0gYyhcImFwcGxlXCIsIFwib3JhbmdlXCIsIFwicGluZWFwcGxlXCIpXG5cbnBhdHRlcm4gPC0gXCJhcHBsZVwiXG5cbnJlcGxhY2VtZW50IDwtIFwiXCJcblxubGlicmFyeShzdHJpbmdyKVxuXG5zdHJfcmVwbGFjZV9hbGwoZnJ1aXRzLCBwYXR0ZXJuLCByZXBsYWNlbWVudClcblxuIyBXcml0ZSBhIHN0YXRlbWVudCB0byByZXBsYWNlIHRoZSBmaXJzdCBvY2N1cnJlbmNlIG9mIFwiQFwiIn0=

Continue to Part 2.