Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
assets
code/starter-code
readme.md

readme.md


title: Natural Language Processing Lab type: lab duration: "1:25" creator: name: Francesco Mosconi city: SF

Natural Language Processing Lab

Introduction

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn. This is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

We will restrict the analysis to 4 groups and will attempt to classify them starting from the corresponding text.

This is a typical example of text classification, where a data scientist's task is to train a model that can partition text in pre-defined categories. Other examples include sentiment analysis and topic assignment.

Exercise

Requirements

  1. Data inspection: let's first explore the data and see how it is organized
  • Bag of Words model: let's build a simple model
  • Hashing and TF-IDF: let's beef-up our model with some more powerful techniques
  • Classifier comparison: what's our best model?

Bonus:

  • Other Classifiers: What's the performance of other classifiers?
  • Explore NLTK in more detail

Code

Starter Code

Solution Code

Additional resources