Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
assets
code
readme.md

readme.md

title type duration creator
MRJob lab
lab
1:25
name city
Francesco Mosconi
SF

MRJob Lab

Introduction

In the past lab we've used a Virtual Machine to run Map Reduce jobs on native hadoop. As you may have understood, it's quite cumbersome and complicated.

Luckily we don't have to do that, because our friends at Yelp developed a great open source python library that wraps around hadoop streaming called MRJob.

This is already installed in your VM, but you can also install it locally if you prefer, using:

pip install mrjob.

Exercise

This lab will teach you to use MRJob, a very powerful python library for map reduce.

Instructor note: this lab needs to be run on the VM they have installed in the previous hour. The VM is packaged with all the necessary data and libraries, so they should just connect to it using ssh and then run everything from inside the machine. If they are not familiar with VIM they can configure Sublime Text to access the VM via SFTP and modify the scripts there.

Except for Exercise 1, all the other exercises can also be run on the laptop in local mode, if some of them have trouble connecting or using the VM. All they need is pip install mrjob.

Requirements

  • Exercise 1: running map reduce locally and on hadoop cluster
  • Exercise 2: add a combiner
  • Exercise 3: multi step jobs
  • Exercise 4: Setup and teardown of tasks
  • Exercise 5: Counters

Bonus:

  • Putting it all together: find top 15 most frequent words for all books
  • Use NLTK to recognize a book from the most frequent words

Starter code

Starter Code

Solution Code

Additional Resources