title: Natural Language Processing Lab type: lab duration: "1:25" creator: name: Francesco Mosconi city: SF
In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn. This is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
We will restrict the analysis to 4 groups and will attempt to classify them starting from the corresponding text.
This is a typical example of text classification, where a data scientist's task is to train a model that can partition text in pre-defined categories. Other examples include sentiment analysis and topic assignment.
- Data inspection: let's first explore the data and see how it is organized
- Bag of Words model: let's build a simple model
- Hashing and TF-IDF: let's beef-up our model with some more powerful techniques
- Classifier comparison: what's our best model?
- Other Classifiers: What's the performance of other classifiers?
- Explore NLTK in more detail