jDataLab

1 minute read

This gives an example of text cleaning with R. For regular expressions and R commands on string manipulations, see A Beginner Guide to String Pattern Matching in R by Regular Expression.


Write R code to clean the following raw collection which contains five strings.

This is a   data science book.
  He is very happy today <U+1F600>
Do you carefully read every word on your statement and every notice your bank every sent you?
.........waiting........
NA

By removing punctuations, stop words (including this, is, a, he, you, your, and) and extra whitespace characters and trimming leading and trailing space from start and end of each string, printing collection should produce the following output in R:

> collection
[1] "  This is a   data science book."                                     
[2] "  He is very happy today <U+1F600>"                               
[3] "Do you carefully read every word on your statement and every notice your bank every sent you?"
[4] ".........waiting........"             
[5] NA   
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjb2xsZWN0aW9uIDwtIGMoXG5cIiAgVGhpcyBpcyBhICAgZGF0YSBzY2llbmNlIGJvb2suXCIsXG5cIiAgSGUgaXMgdmVyeSBoYXBweSB0b2RheSA8VSsxRjYwMD5cIixcblwiRG8geW91IGNhcmVmdWxseSByZWFkIGV2ZXJ5IHdvcmQgb24geW91ciBzdGF0ZW1lbnQgYW5kIGV2ZXJ5IG5vdGljZSB5b3VyIGJhbmsgZXZlcnkgc2VudCB5b3U/XCIsXG5cIi4uLi4uLi4uLndhaXRpbmcuLi4uLi4uLlwiLFxuTkFcbilcbiNQcmludCBjb2xsZWN0aW9uXG5cbiMgUmVtb3ZlIHB1bmN0dWF0aW9ucywgc3RvcCB3b3JkcyAoaW5jbHVkaW5nIHRoaXMsIGlzLCBhLCBoZSwgeW91LCB5b3VyLCBhbmQpIGFuZCBleHRyYSB3aGl0ZXNwYWNlIGNoYXJhY3RlcnMgYW5kIHRyaW0gbGVhZGluZyBhbmQgdHJhaWxpbmcgc3BhY2UgZnJvbSBzdGFydCBhbmQgZW5kIG9mIGVhY2ggdmFsdWUgaW4gdGhlIGNvbGxlY3Rpb24ifQ==