Cleaning ASCII text files using Regular Expressions

ASCII text files can contain unnecessary units of characters that eventually are introduced during a conversion process, such as PDF-to-text conversion or HTML-to-text conversion. These characters are often seen as noise because they are one of the major roadblocks for data processing. This recipe cleans several noises from ASCII text data using Regular Expressions.

How to do it...

  1. Create a method named cleanText(String) that takes the text to be cleaned in the String format:
            public String cleanText(String text){ 
    
  2. Add the following lines in your method, return the cleaned text, and close the method. The first line strips off non-ASCII characters. The line next to it replaces continuous white spaces ...

Get Java Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.