Sylvère Richard thinks this is interesting: W+ From Improving our tokenization from Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua Publisher: Packt Publishing Released: April 2017 Note correct regexp is \W+ Share this highlight http://learning.oreilly.com/a/machine-learning-with/16939885/ Twitter Facebook Google Plus Email Get Instant Access Now Start a Free Trial Have an account? Sign in. Minimise Unlock the rest of Machine Learning with Spark - Second Edition and 30,000 other books and videos By clicking this box, you confirm that you have read and agree to the terms and conditions of our Membership Agreement, and you understand that when your trial period ends, you will be required to provide billing information if you wish to continue using the service. Unlock the rest of this book Start a Free 10-Day Trial loading Learn about Safari for Business Have an account? Sign in.