First, the word is to be normalized into NFKC form. In Chapter 2, Linear Regression–House Price Prediction, this was introduced, but I then mentioned that LingSpam basically provides normalized datasets. In real-world data, which Twitter is, data is often dirty. Hence, we need to be able to compare them on an apples-to-apples basis.
To show this, let's write a side program:
package mainimport ( "fmt" "unicode""golang.org/x/text/transform" "golang.org/x/text/unicode/norm" )func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) }func main() { str1 := "cafe" str2 := "café" str3 := "cafe\u0301" fmt.Println(str1 == str2) fmt.Println(str2 == str3) t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFKC) str1a, ...