Training a Word2Vec model in Spark is relatively simple. We will pass in an RDD where each element is a sequence of terms. We can use the RDD of tokenized documents we have already created as input to the model.
object Word2VecMllib { def main(args: Array[String]) { val sc = new SparkContext("local[2]", "Word2Vector App") val path = "./data/20news-bydate-train/alt.atheism/*" val rdd = sc.wholeTextFiles(path) val text = rdd.map { case (file, text) => text } val newsgroups = rdd.map { case (file, text) => file.split("/").takeRight(2).head } val newsgroupsMap = newsgroups.distinct.collect().zipWithIndex.toMap val dim = math.pow(2, 18).toInt var tokens = text.map(doc => TFIDFExtraction.tokenize(doc)) ...