This week we learned how to craft appropriate problem statements, work with pivot tables, and started learning about Pandas. Additionally, we learned how to clean data and use dummy variables. Now, we're going to apply some of these newly acquired skills in our second project. Given a dirty dataset, you will do some exploratory data analysis. You will create a Jupyter writeup hosted on GitHub that provides a dataset overview with visualizations, statistical analysis, and data cleaning methodologies.
As an aside, you should get used to hearing this statistic: 80% of data analysis is spent on the process of cleaning and preparing the data, also known as EDA (Dasu and Johnson 2003). When you're looking at raw data, preparing for an interview or starting a new project, keep that in mind. Good models cannot produce good predictions without good data.
On next week's episode of the 'Are You Entertained?' podcast, we're going to be analyzing the latest generation's guilty pleasure- the music of the '00s. Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?
For this project, we'll be posting a companion writeup with visualizations that will offer insights into our conclusions.
Goal: Jupyter technical notebook including plotting and statistical analysis, with process and results translated in a companion blog post.
Your work must:
Write a high quality problem statement
State the risks and assumptions of your data
Import data using the Pandas Library
Perform exploratory data analysis on the information
Use a Tableau and/or python plotting modules to generate visualizations
Determine correlations / causations in the data
Evaluate your hypothesis using statistical analysis
Present results in a polished blog post format of at least 500 words (& 1-2 graphics!)
Write a short white paper on the philosophy of 'Clean Data' of no less than 500 words. Link to it in your Jupyter notebook.
- Materials must be in a clearly labeled Jupyter notebook.
- Materials must be submitted via a Github PR to the instructor's repo.
- Materials must be submitted by the end of week 2.
- Note: Blog post must be published on a blogging platform, and submitted via a URL pasted into your Jupyter notebook.
- Note: Bonus white paper may be submitted as a text file. Bonus nerd points for using LaTeX ;)
- Starter code has been provided in the form of a Jupyter notebook. Please complete all project work in this notebook.
Instructor Note: The solution code is linked here
Suggested Ways to Get Started
- Read in your dataset
- Try out a few NumPy commands to describe your data
- Write pseudocode before you write actual code. Thinking through the logic of something helps.
- Read the docs for whatever technologies you use. Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success!
- Document everything.
Project Feedback + Evaluation
Your instructors will score each of your technical requirements using the scale below:
Score | Expectations ----- | ------------ **0** | _Incomplete._ **1** | _Does not meet expectations._ **2** | _Meets expectations, good job!_ **3** | _Exceeds expectations, you wonderful creature, you!_
This will serve as a helpful overall gauge of whether you met the project goals, but the more important scores are the individual ones above, which can help you identify where to focus your efforts for the next project!