Normalizing a string

First, the word is to be normalized into NFKC form. In Chapter 2, Linear Regression–House Price Prediction, this was introduced, but I then mentioned that LingSpam basically provides normalized datasets. In real-world data, which Twitter is, data is often dirty. Hence, we need to be able to compare them on an apples-to-apples basis.

To show this, let's write a side program:

 package mainimport ( "fmt" "unicode""golang.org/x/text/transform" "golang.org/x/text/unicode/norm" )func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) }func main() { str1 := "cafe" str2 := "café" str3 := "cafe\u0301" fmt.Println(str1 == str2) fmt.Println(str2 == str3) t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFKC) str1a, ...

Get Go Machine Learning Projects now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.