Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Project 2: Billboard Hits + Data Munging


This week we learned how to craft appropriate problem statements, work with pivot tables, and started learning about Pandas. Additionally, we learned how to clean data and use dummy variables. Now, we're going to apply some of these newly acquired skills in our second project. Given a dirty dataset, you will do some exploratory data analysis. You will create a Jupyter writeup hosted on GitHub that provides a dataset overview with visualizations, statistical analysis, and data cleaning methodologies.

As an aside, you should get used to hearing this statistic: 80% of data analysis is spent on the process of cleaning and preparing the data, also known as EDA (Dasu and Johnson 2003). When you're looking at raw data, preparing for an interview or starting a new project, keep that in mind. Good models cannot produce good predictions without good data.

Project Summary

On next week's episode of the 'Are You Entertained?' podcast, we're going to be analyzing the latest generation's guilty pleasure- the music of the '00s. Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?

For this project, we'll be posting a companion writeup with visualizations that will offer insights into our conclusions.

Goal: Jupyter technical notebook including plotting and statistical analysis, with process and results translated in a companion blog post.


Your work must:

  • Write a high quality problem statement

  • State the risks and assumptions of your data

  • Import data using the Pandas Library

  • Perform exploratory data analysis on the information

  • Use a Tableau and/or python plotting modules to generate visualizations

  • Determine correlations / causations in the data

  • Evaluate your hypothesis using statistical analysis

  • Present results in a polished blog post format of at least 500 words (& 1-2 graphics!)

  • Bonus:

  • Write a short white paper on the philosophy of 'Clean Data' of no less than 500 words. Link to it in your Jupyter notebook.

Necessary Deliverables

  • Materials must be in a clearly labeled Jupyter notebook.
  • Materials must be submitted via a Github PR to the instructor's repo.
  • Materials must be submitted by the end of week 2.
    • Note: Blog post must be published on a blogging platform, and submitted via a URL pasted into your Jupyter notebook.
    • Note: Bonus white paper may be submitted as a text file. Bonus nerd points for using LaTeX ;)

Starter code

Instructor Note: The solution code is linked here


Suggested Ways to Get Started

  • Read in your dataset
  • Try out a few NumPy commands to describe your data
  • Write pseudocode before you write actual code. Thinking through the logic of something helps.
  • Read the docs for whatever technologies you use. Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success!
  • Document everything.

Useful Resources

Project Feedback + Evaluation

Attached here is a complete rubric for this project.

Your instructors will score each of your technical requirements using the scale below:

Score | Expectations
----- | ------------
**0** | _Incomplete._
**1** | _Does not meet expectations._
**2** | _Meets expectations, good job!_
**3** | _Exceeds expectations, you wonderful creature, you!_

This will serve as a helpful overall gauge of whether you met the project goals, but the more important scores are the individual ones above, which can help you identify where to focus your efforts for the next project!