Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

title: Natural Language Processing Lab type: lab duration: "1:25" creator: name: Francesco Mosconi city: SF

Natural Language Processing Lab


In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn. This is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

We will restrict the analysis to 4 groups and will attempt to classify them starting from the corresponding text.

This is a typical example of text classification, where a data scientist's task is to train a model that can partition text in pre-defined categories. Other examples include sentiment analysis and topic assignment.



  1. Data inspection: let's first explore the data and see how it is organized
  • Bag of Words model: let's build a simple model
  • Hashing and TF-IDF: let's beef-up our model with some more powerful techniques
  • Classifier comparison: what's our best model?


  • Other Classifiers: What's the performance of other classifiers?
  • Explore NLTK in more detail


Starter Code

Solution Code

Additional resources