Word2Vec with Spark MLlib on the 20 Newsgroups dataset

Training a Word2Vec model in Spark is relatively simple. We will pass in an RDD where each element is a sequence of terms. We can use the RDD of tokenized documents we have already created as input to the model.

object Word2VecMllib {  def main(args: Array[String]) {  val sc = new SparkContext("local[2]", "Word2Vector App")  val path = "./data/20news-bydate-train/alt.atheism/*"  val rdd = sc.wholeTextFiles(path)  val text = rdd.map { case (file, text) => text }  val newsgroups = rdd.map { case (file, text) =>                 file.split("/").takeRight(2).head }  val newsgroupsMap =           newsgroups.distinct.collect().zipWithIndex.toMap  val dim = math.pow(2, 18).toInt var tokens = text.map(doc => TFIDFExtraction.tokenize(doc)) ...

Get Machine Learning with Spark - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.