Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

title type duration creator
Spark Lab 1
name city
Francesco Mosconi

Spark Lab 1

Virtual Machine Required

Note: This lab requiress additional prep in order to run successfully:

  1. Download and install Virtual Machine.
    • Note: This is a big file. Please reserve time to download and troubleshoot installation.


In this lab, we will use Spark to process the Bay Area Bikeshare data. We will explore both the streaming and the sql APIs for Spark in order to investigate regional bike share usage habits.

We will do this using the Virtual Machine we created earlier this week. The first steps to get started are:

cd dsi-bigdata-vm
vagrant up
vagrant ssh

And then, once inside, run:

Important: If your machine is already running and you've started the Hadoop services with, you may want to first run to stop all services and free some memory space.

Once you've started spark in local mode, you should be able to access Jupyter at this address:

In order to run the starter code on the VM, you will need to upload it using the Jupyter browser upload function.



  • parse data: split csv lines
  • filter: for Caltrain station
  • Spark Map Reduce: Find out number of trips per hour and per day
    • trips by day - hour (mapper)
    • trips by day - hour (reducer)
  • Spark Map Reduce: Find out number of trips per hour
    • trips by hour (mapper)
    • trips by hour (reducer)
  • collect!


  • Repeat the task using Spark SQL

Starter code

Starter Code

Solution Code

Additional Resources