The Extra in Extra Trees comes from the idea that it is extremely randomized. While the tree splits in a Random Forest classifier are effectively deterministic, they are randomized in the Extra Trees classifier. This changes the bias-variance trade-off in cases of high-dimensional data such as ours (where every word is effectively a dimension or classifier). The following snippet shows the classifier in action:
from sklearn.ensemble import ExtraTreesClassifier as XTCxtc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',XTC())])xtc_clf.fit(X=X_train, y=y_train)xtc_acc, xtc_predictions = imdb_acc(xtc_clf)xtc_acc # 0.75024
As you can see, this change works in our favor here, but this is ...