Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

title type duration creator
MRJob lab
name city
Francesco Mosconi

MRJob Lab


In the past lab we've used a Virtual Machine to run Map Reduce jobs on native hadoop. As you may have understood, it's quite cumbersome and complicated.

Luckily we don't have to do that, because our friends at Yelp developed a great open source python library that wraps around hadoop streaming called MRJob.

This is already installed in your VM, but you can also install it locally if you prefer, using:

pip install mrjob.


This lab will teach you to use MRJob, a very powerful python library for map reduce.

Instructor note: this lab needs to be run on the VM they have installed in the previous hour. The VM is packaged with all the necessary data and libraries, so they should just connect to it using ssh and then run everything from inside the machine. If they are not familiar with VIM they can configure Sublime Text to access the VM via SFTP and modify the scripts there.

Except for Exercise 1, all the other exercises can also be run on the laptop in local mode, if some of them have trouble connecting or using the VM. All they need is pip install mrjob.


  • Exercise 1: running map reduce locally and on hadoop cluster
  • Exercise 2: add a combiner
  • Exercise 3: multi step jobs
  • Exercise 4: Setup and teardown of tasks
  • Exercise 5: Counters


  • Putting it all together: find top 15 most frequent words for all books
  • Use NLTK to recognize a book from the most frequent words

Starter code

Starter Code

Solution Code

Additional Resources