Welcome to Data Science
 Welcome
 Your Team
 Course Overview
 Course Schedule
 Projects
 Tech Requirements
 Classroom Tools
 Student Expectations
 Office Hours
 Student Feedback
Course Overview
Welcome to the part time Data Science course at General Assembly! We are building a global community of lifelong learners who are excited about using data to solve real world problems.
In this program, we will use Python to explore datasets, build predictive models, and communicate data driven insights. Specifically, you will learn how to:
 Define many of the approaches and considerations that data scientists use to solve real world problems.
 Perform exploratory data analysis with powerful programmatic tools in Python.
 Build and refine basic machine learning models to predict patterns from data sets.
 Communicate data driven insights to peers and stakeholders in order to inform business decisions.
What You Will Learn
 Statistical Analysis with Python:
 Perform visual and statistical analysis on data using Python and its associated libraries and tools.
 DataDriven DecisionMaking:
 Define and determine the tradeoffs involving feature selection, model accuracy, and data quality.
 Machine Learning & Modeling Techniques:
 Explore supervised learning techniques, inlcuding classification, regression, and decision trees.
 Visualizations & Presentations:
 Create visualizations and interactive notebooks to present to industry stakeholders.
Class Zoom Link
https://generalassembly.zoom.us/j/129188830
Class
Previous Class Recordings
Access previous class recordings here
Finding DataSets with Google
Enjoy this new dataset search  recently put out in Beta
Python Version
The curriculum materials for this course are written in Python 3.7
Your Instructional Team
Instructor: Steven Longstreet
I.A.: Adi Bronshtein
I.A.: Kelvin Sumlin
Curriculum Structure
General Assembly's Data Science part time materials are organized into four units.
Unit  Title  Topics Covered  Length 

Unit 1  Data Foundations  Python Syntax, Development Environment  Lessons 14 
Unit 2  Working with Data  Stats Review, Visualization, & EDA  Lessons 59 
Unit 3  Data Science Modeling  Regression, Classification, & KNN  Lessons 1014 
Unit 4  Data Science Applications  Decision Trees, NLP, & Flex Topics  Lessons 1519 
Lesson Schedule
Here is the schedule we will be following for our part time data science course:
Date  Lesson  Unit Number  Session Number 

1/8  Welcome to Data Science  Unit 1  Session 1 
1/10  Your Development Environment  Unit 1  Session 2 
1/15  Python Foundations  Unit 1  Session 3 
1/17  Exploratory Data Analysis in Pandas  Unit 1  Session 4 
       
1/24  [FLEX: Project Workshop + Presentations]  Unit 2  Session 5 
1/29  Data Visualization in Python  Unit 2  Session 6 
1/31  Statistics in Python  Unit 2  Session 7 
2/5  Experiments & Hypothesis Testing  Unit 2  Session 8 
2/7  Linear Regression  Unit 2  Session 9 
       
2/12  Logistic Regression  Unit 3  Session 10 
2/14  TrainTest Split & BiasVariance  Unit 3  Session 11 
2/21  KNN / Classification  Unit 3  Session 12 
2/26  Clustering  Unit 3  Session 13 
2/28  Decision Trees  Unit 3  Session 14 
       
3/5  Intro to Natural Language Processing  Unit 4  Session 15 
3/7  Intro to Time Series  Unit 4  Session 16 
3/12  Working With Data: APIs  Unit 4  Session 17 
3/14  Flex  Neural Nets & Assessment  Unit 4  Session 18 
3/19  Final Project Presentations  Part 1  Unit 4  Session 19 
3/21  Final Project Presentations  Part 2  Unit 4  Session 20 
Project Structure
This course will ask you to complete a series of projects in order to practice and apply the skills covered inclass.
Unit Projects
At the end of each Unit, you'll work on short structured projects. These activities will test your understanding of that unit’s most important concepts with inclass practice and instructor support.
For those of you who want to go above and beyond, we’ve also included stretch options, bonus activities, and other opportunities for further reading and practice.
Final Project
You'll also complete a final project, asking you to apply your skills to a realworld or business problem of your choice.
The capstone is an opportunity for you to demonstrate your new skills and tackle a pressing issue relevant to your life, industry, or organization. You’ll create a hypothesis, analyze internal data, and generate a working model, prototype, solution, or recommendation.
You will get structured guidance and designated time to work throughout the course. Final project deliverables include:
 Proposal: Describe your chosen problem and identify relevant data sets (confirming access, as needed).
 Brief: Share a summary of your initial analysis and your next steps with your instructional team.
 Report: Submit a cleanly formatted Jupyter notebook (or other files) documenting your code and process for technical/peer stakeholders.
 Presentation: Present a summary of your business problem, approach, and recommendation to an audience of nontechnical executive stakeholders.
Project Breakdown
 Project 1: Python Technical Code Challenges (Due 1/24)
 Project 2: Exploratory Data Analysis (Due 2/7)
 Project 3: Modeling Practice (Due 2/28)
 Project 4: Final Project
 Part 1: Proposal + Dataset (Due 1/24)
 Part 2: Initial EDA Brief (Due 2/7)
 Part 3: Technical Report (Due 3/19)
 Part 4: Presentation (Due 3/19)
Technology Requirements
Hardware
 8GB Ram (at least)
 10GB Free Hard Drive Space (after installing Anaconda)
Software
 Download and Install Anaconda with Python 3.7.
Note: Anaconda provides support for two different versions of Python. Make sure to install the "Python 3.7" version.
PC only
 Install Git Bash
 Two things to reduce confusion:

 Many first time Git users find Vim to be confusing. If you do not wish to learn Vim commands, select an alternative text editor such as Nano.

 If you select "Use Git and optional Unix tools from the Windows Command Prompt" you will be able to use the the command prompt for all Git related tasks in this course.
Browser
 Google Chrome
Miscellaneous
 Text editor (we recommend Atom)
Slack
We'll use Slack for our class communications platform. Slack is a messaging platform where you can chat with your peers and instructors. We will use Slack to share information about the course, discuss lessons, and submit projects. Our Slack homepage is wave5virtualettth.
Pro Tip: If you've never used Slack before, check out these resources:
Expectations
 Participate in classroom discussions & complete class feedback forms
 Be respectful to others
 Miss no more than 2 classes
 Complete all unit assignments
 Complete the final project
Office Hours
Every week, your instructional team will hold office hours where you can get in touch to ask questions about anything relating to the course. This is a great opportunity to follow up on questions or ask for more details about any topics covered so far.
Office Hours Expectations:
 Office hours are available in 15 minute reservations on a firstcome, firstserved basis. If additional times are needed please let us know and we'll do our best to accomodate your need.
 Let the Instructional Team member know in advance what you'd like to discuss via slack prior to the meeting. If multiple people are having similar issues we may setup a group session or bring some additional material to class
 Students are encouraged to come prepared with specific questions and the things they’ve tried so far so the instructional team can provide progressive suggestions and make the most of your time.
Note: If a question/issue is related to a project or coding question consider discussing it on slack for the entire class, instructional team included, to see/assist.
Instructional Associate  Office Hours  Method 

Adi  Tu 5:00  6:00PM ET & Th 5:00  6:00PM ET  Use this link or Contact via Slack 
Kelvin  Th 10:00  11:00AM ET & Sun 8:00  10:00PM ET  Contact via Slack 
Ryan  Mon 6:00  7:00PM ET & Sat 11:00AM  1:00PM ET  Use http://rylativity.youcanbook.me/ or Contact via Slack 
Additional time to include time with Steven are available by appointment, schedules permitting. The Team is also generally available via Slack. If you need to meet outside of office hours then let us know.
Student Feedback
Throughout the course, you'll be asked to provide feedback about your experience. This feedback is extremely important, as it helps us provide you with a better learning experience.
clicking here
The exit ticket will be found byClass Resources
Class 1: What is Data Science
Python Resources:
 Codecademy's Python course: Good beginner material, including tons of inbrowser exercises.
 DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
 Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
 A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
 Python for Informatics: A very beginneroriented book, with associated slides and videos.
 Python Tutor: Allows you to visualize the execution of Python code.
 My code isn't working is a great flowchart explaining how to debug Python errors.
 PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
Advanced Python Material:
 Want to understand Python's comprehensions? Think in Excel or SQL may be helpful if you are still confused by list comprehensions.
 If you want to understand Python at a deeper level: Ned Batchelder's Loop Like A Native, Python Names and Values, Raymond Hettinger's Transforming Code into Beautiful, Idiomatic Python and Python Epiphanies are excellent presentations.
 Everything is an object in Python
Resources:
 For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
 For some thoughts on what it's like to be a data scientist, read these short posts from WinVector and Datascope Analytics.
 Introduction to Statistical Learning
 Data Science vs Statistics
 15 Books every Data Scientist Should Read
 50+ Free Data Science Books
 Building Data Science Teams
 Doing Data Science
 Getting Started with Data Science
 Quora has a data science topic FAQ with lots of interesting Q&A.
 Keep up with local datarelated events through the Data Community DC event calendar or weekly newsletter.
 Stack Overflow  Developer Survey Results 2017
 Nate Silver on the Art and Science of Prediction
 Data science business application diagram
 Three waves of AI
Material for Next Class:
 Setting up Python for machine learning: scikitlearn and IPython Notebook This videos includes an overview of Jupyter Notebook, which is used in the homework assignment.
 Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
 Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows), and then browse through this command line reference.
Class 2: Git, Github, and the Command Line
Class Resources: Set your Git username and email
Command Line Resources:
 Good source for cheat sheets on various topics like git and command line Master Cheat Sheet site
 Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows), and then browse through this command line reference.
 The Linux command line
 If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
 If you want to do more at the command line with CSV files, try out csvkit, which can be installed via 'pip'.
Git and Markdown Resources:
 Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
 GitHub for Beginners
 If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
 If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
 GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
 Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
 Introducing GitHub is a nice intro to GitHub that reads quickly
 Version Control with Git
 Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
 How to remove .DS_Store from GitHub
 The simple Git guide
Class 3: Python Foundations
For more information on this topic, check out the following resources:
 Python Code Academy
 Learn Python the Hard Way
 Python Data Types and Variables
 Python: IF, ELIF, ELSE
 Python Loops
 Python Control Flow
 Function Practice
 Practice Problems with solutions
 Understanding functions vs methods
 Intro to Loops
 44 practice exercises for loops
 Python list comprehensions: Explained Visually
 Merging List comprehension explanation with exercises
 List Comprehension Exercises
Class Challenge Exercises
Some extra exercises to practice on!
Starter Exercises
 Capture all of the numbers from 11000 that are divisible by 7
 Capture all of the numbers from 11000 that have a 3 in them
 Provide the count of number of spaces in a string
 Remove any vowels in a string
 Capture all of the words in a string less than 4 letters
Challenge Exercises
 Use a dictionary comprehension to count the length of each word in a sentence.
 Use a nested list comprehension to find all of the numbers from 1500 that are divisible by any single digit except 1
 Use a nested list/dictionary comprehension to find the highest single digit any number in the range 11000 is divisible by
Class 4 Exploratory data analysis with Pandas
Class Resources:
 MovieLens 100k movie ratings (data dictionary, website)
 Alcohol consumption by country (article)
 Reports of UFO sightings (website)
Pandas Resources:
 Pandas Cheat Sheet
 Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
 To learn more Pandas, read this threepart tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
 If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
 This notebook demonstrates the different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
 This is a nice, short tutorial on pivot tables in Pandas.
For more information on this topic, check out the following resources:
Class 5 Visualizations
For more information on this topic, check out the following resources:
 SAS Data Viz Guide
 Professor Shafer's Guide to Viz Attributes
 Tableau's Guide to Data Viz
 Documentation for Matplotlib, Seaborn, and Pandas Plotting
 Harvard's Data Science course includes an excellent lecture on Visualization Goals, Data Types, and Statistical Graphs (83 minutes), for which the slides are also available.
 Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.
 For more on Pandas plotting, read this notebook or the visualization page from the official Pandas documentation.
 To learn how to customize your plots further, browse through this notebook on matplotlib or this similar notebook.
 Read Overview of Python Visualization Tools for a useful comparison of Matplotlib, Pandas, Seaborn, ggplot, Bokeh, Pygal, and Plotly.
 To explore different types of visualizations and when to use them, Choosing a Good Chart and The Graphic Continuum are nice onepage references, and the interactive R Graph Catalog has handy filtering capabilities.
 This PowerPoint presentation from Columbia's Data Mining class contains lots of good advice for properly using different types of visualizations.
 Jake VanderPlas Presentation: The Python Visualization Landscape PyCon 2017
Seaborn Resources:
 To get started with Seaborn for visualization, the official website has a series of detailed tutorials and an example gallery.
 Data visualization with Seaborn is a quick tour of some of the popular types of Seaborn plots.
 Visualizing Google Forms Data with Seaborn and How to Create NBA Shot Charts in Python are both good examples of Seaborn usage on realworld data.
Class 6 Flex lesson  Reinforcement & In Class Practice
Class 7: Statistics in Python
Additional Resources
For more information on this topic, check out the following resources:
 Scikit Learn Documentation:
 Useful Wikipedia Pages:
 Bessel's CorrectionSample variance is incredibly complicated once you look into it, but that makes it one of the simplest examples of meaningful bias and variance.
 Mean Squared Error In many fields we obsess about unbiased estimators, in machine learning we obsess about MSE. More examples of the sample variance estimator.
 Think Stats EBook
 A great tour of selfguided resources to learn stats relevant to data science
Statistical References:
 Fantastic book on mathmatics for machine learning Mathematics for Machine Learning
 A great start Elements of Statistical Learning
 Bayesian Data Analysis, by Andrew Gelman
 Machine Learning: a Probabilistic Perspective
 Pattern Recognition and Machine Learning
 And of course my personal favorite
Statistics Resources:
 Read How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips for an excellent example of exploratory data analysis.
 Read Anscombe's Quartet, and Why Summary Statistics Don't Tell the Whole Story for a classic example of why visualization is useful.
 What I do when I get a new data set as told through tweets is a fun (yet enlightening) look at the process of exploratory data analysis.
 Khan Academy Statistics and Probabiliy Good refresher if you need it.
 ThinkStats Good statistics book with Python code in NumPy and Pandas.
 Bias of a estimator More on bias of the sample variance estimator.
 Understanding the BiasVariance Tradeoff Deep topic that we will dive into later in the course, worth a preview.
 Understanding when to standardize vs when to normalize
Class 8 Experiments Hypothesis Testing
For more information on this topic, check out the following resources:
 Survivorship Bias  Abraham Wald and the Statistical Research Group
 An Introduction to Statistical Learning
 The more advanced book: Elements of Statistical Learning
 Spurious Correlations
 Wikipedia pages on ANOVA, Welch's ttest, MannWhitney test
 For a brief introduction to confidence intervals, hypothesis testing, pvalues, and Rsquared, as well as a comparison between scikitlearn code and Statsmodels code, read this lesson on linear regression.
 Here is a useful explanation of confidence intervals from Quora.
 Hypothesis Testing: The Basics provides a nice overview of the topic, and John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
Class 9: Linear Regression
Linear Regression Resources:
 Ben Lorica: Six reasons why I recommend scikitlearn
 To go much more indepth on linear regression, read Chapter 3 of An Introduction to Statistical Learning. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
 This introduction to linear regression is more detailed and mathematically thorough, and includes lots of good advice.
 Analytics Vidhya's Compilation of Linear Regression Blogs
 Data School's "Friendly Introduction to Linear Regression" using Python
 This is a relatively quick post on the assumptions of linear regression.
 Setosa has an interactive visualization of linear regression.
 For a brief introduction to confidence intervals, hypothesis testing, pvalues, and Rsquared, as well as a comparison between scikitlearn code and Statsmodels code, read my DAT7 lesson on linear regression.
 Here is a useful explanation of confidence intervals from Quora.
 Hypothesis Testing: The Basics provides a nice overview of the topic, and John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
 Earlier this year, a major scientific journal banned the use of pvalues:
 Scientific American has a nice summary of the ban.
 This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
 Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
 Science Isn't Broken includes a neat tool that allows you to "phack" your way to "statistically significant" results.
 Accurately Measuring Model Prediction Error compares adjusted Rsquared, AIC and BIC, train/test split, and crossvalidation.
Regularization Resources:
 What is a norm: l0Norm, l1Norm, l2Norm, … , linfinity Norm
 An Introduction to Statistical Learning has useful videos on ridge regression (13 minutes), and lasso regression (15 minutes).
 Caltech's Learning From Data course has a great video introducing regularization (8 minutes) that builds upon their video about the biasvariance tradeoff.
 Scikitlearn examples for Lasso and Ridge Regression
 Scikitlearn documentation for Lasso, Ridge, and Elastic Net Regression
Class 10 Logistic Regression
For more information on this topic, check out the following resources:
 Sklearn Logistic Regression Documentation
 Data School: Logistic Regression InDepth
 Logistic Regression for Machine Learning
 Video: Andrew Ng on Logistic Regression
Logistic Regression Resources:
 Better understand Confusion Matrices
 To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
 For a more mathematical explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
 For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
 The scikitlearn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
 Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
 What is the C hyperparameter for Logistic Regression
 What is the difference between the sigmoid and softmax function
 How does multinomial logistic regression works
 Great paper on "A Comparison of Logistic Regression, kNearest Neighbor, and Decision Tree Induction for Campaign Management" (https://pdfs.semanticscholar.org/24b0/2fc8b438d9a432ad72111ef2d80f8c148c1c.pdf)
Class 11 Train Test Split & Bias/Variance Trade Off
For more information on this topic, check out the following resources:
 Understanding the BiasVariance Tradeoff compares adjusted Rsquared, AIC and BIC, train/test split, and crossvalidation.
 University of Washington Machine Learning Course Slides
 An Intuitive Explanation of Overfitting
 Caltech's Learning From Data course has a great video introducing regularization (8 minutes) that builds upon their video about the biasvariance tradeoff.
 Approaches to feature selection for machine learning in Python
Class 12 K Nearest Neighbors (KNN), Classifiers, Preprocessing and GridSearch
For more information on this topic, check out the following resources:
 Data School: Machine Learning With KNN
 KNN: Dangerously Simple
 KNN From Scratch
 Detailed Intro to KNNis a bit dense, but provides a more thorough introduction to KNN and its applications.
 Stanford's Machine Learning Course: KNN
KNN Resources:
 For a recap of the key points about KNN and scikitlearn, watch Getting started in scikitlearn with the famous iris dataset (15 minutes) and Training a machine learning model with scikitlearn (20 minutes).
 KNN supports distance metrics other than Euclidean distance, such as Mahalanobis distance, which takes the scale of the data into account.
 This lecture on Image Classification shows how KNN could be used for detecting similar images, and also touches on topics we will cover in future classes (hyperparameter tuning and crossvalidation).
 Some applications for which KNN is wellsuited are object recognition, satellite image enhancement, document categorization, and gene expression analysis.
scikitlearn Resources:
 scikitlearn's machine learning map may help you to choose the "best" model for your task.
 Choosing a Machine Learning Classifier is a short and highly readable comparison of several classification models, Comparing supervised learning algorithms is a model comparison table that I created, and Supervised learning superstitions cheat sheet is a more thorough comparison (with links to lots of useful resources).
 Machine Learning Done Wrong, Machine Learning Gremlins (31 minutes), Clever Methods of Overfitting, and Common Pitfalls in Machine Learning all offer thoughtful advice on how to avoid common mistakes in machine learning.
 Practical machine learning tricks from the KDD 2011 best industry paper and Andrew Ng's Advice for applying machine learning include slightly more advanced advice than the resources above.
 An Empirical Comparison of Supervised Learning Algorithms is a readable research paper from 2006, which was also presented as a talk (77 minutes). API design for machine learning software: Experiences from the scikitlearn project Machine learning: artificial intelligence bias
Fundamental Statistics
 causaldatascience Great series about DAGs, association, and causation.
 Khan Academy Statistics and Probabiliy Still useful for basic topics.
Class 13 Clustering
Clustering Resources:
 Kmeans: documentation, visualization 1, visualization 2
 DBSCAN: documentation, visualization
 For a very thorough introduction to clustering, read chapter 8 (69 pages) of Introduction to Data Mining (available as a free download), or browse through the chapter 8 slides.
 scikitlearn's user guide compares many different types of clustering.
 This PowerPoint presentation from Columbia's Data Mining class provides a good introduction to clustering, including hierarchical clustering and alternative distance metrics.
 An Introduction to Statistical Learning has useful videos on Kmeans clustering (17 minutes) and hierarchical clustering (15 minutes).
 This is an excellent interactive visualization of hierarchical clustering.
 This is a nice animated explanation of mean shift clustering.
 The Kmodes algorithm can be used for clustering datasets of categorical features without converting them to numerical values. Here is a Python implementation.
 Here are some fun examples of clustering: A Statistical Analysis of the Work of Bob Ross (with data and Python code), How a Math Genius Hacked OkCupid to Find True Love, and characteristics of your zip code.
Additional Resources
 Scikitlearn Clustering Methods
 KMeans Clustering (video)
 Clustering Overview
 Cluster Analysis and KMeans (PDF)
 KMeans Wikipedia Article
Class 14: Decision Trees and Random Forests
Decision Trees Resources
 Introduction to Statistical Learning  Chapter 8 (TreeBased Methods) This book provides a fantastic introduction to machine learning models and the statistics behind them. The visuals and explanations are really easy to understand. PDF available to download on the website.
 scikitlearn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
 For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
 If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
 The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
 Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.
 Induction of Decision Trees *Top 10 algorithms in data mining
Ensembling Resources:
 scikitlearn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
 MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.
 Browse the excellent solution paper from the winner of Kaggle's CrowdFlower competition for an example of the work and insight required to win a Kaggle competition.
 Interpretable vs Powerful Predictive Models: Why We Need Them Both is a short post on how the tactics useful in a Kaggle competition are not always useful in the real world.
 Not Even the People Who Write Algorithms Really Know How They Work argues that the decreased interpretability of stateoftheart machine learning models has a negative impact on society.
 For an intuitive explanation of Random Forests, read Edwin Chen's answer to How do random forests work in layman's terms?
 Large Scale Decision Forests: Lessons Learned is an excellent post from Sift Science about their custom implementation of Random Forests.
 Unboxing the Random Forest Classifier describes a way to interpret the inner workings of Random Forests beyond just feature importances.
 Understanding Random Forests: From Theory to Practice is an indepth academic analysis of Random Forests, including details of its implementation in scikitlearn.
 Global model tuned to user preference over time
Class Resources
 CHAPTER 9  Elements of Statistical Learning  This book is by most of the same authors as the previous book, but goes into more detail. PDF available to download on the website.
 CHAPTER 8  Applied Predictive Modeling  While this book features R code, the discussion of different predictive models and sampling methodologies are hard to beat.
Class 15: Natural Language Processing (NLP) 1
NLP Resources:
 If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
 This slide deck defines many of the key NLP terms.
 Natural Language Processing with Python is the most popular book for going indepth with the Natural Language Toolkit (NLTK).
 A Smattering of NLP in Python provides a nice overview of NLTK, as does this notebook from DAT5.
 spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
 If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
 When working with a large text corpus in scikitlearn, HashingVectorizer is a useful alternative to CountVectorizer.
 Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikitlearn to solve the problem of uncategorized businesses.
 Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
 Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
 DC Natural Language Processing is an active Meetup group in our local area.
 BM25 (a better tfidf) takes document length into account when determining term importance
Natural Language Processing (NLP) Part 2
Naive Bayes Resources:
 Sebastian Raschka's article on Naive Bayes and Text Classification covers the conceptual material from today's class in much more detail.
 For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (15 pages).
 For an intuitive explanation of Naive Bayes classification, read this post on airport security.
 For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
 When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
 These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
 Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
 If you enjoyed Paul Graham's article, you can read his followup article on how he improved his spam filter and this related paper about stateoftheart spam filtering in 2004.
 Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.
Class 16: Time Series Analysis
Class Resources
 If you are interested in more resources, check out the following:
 In Pandas' datetime library, search for more information on .dt here.
 For additional review of these concepts, see some inspiration from the Python Data Science Handbook.
 There are lots of additional tutorials on ARIMA models out there; here is a good one.
Additional Resources
 Facebook Prophet is a phenomenal package allowing powerful, fast and efficient time series analysis.
 Overview of time series applications
Class 17: Working with data  Api's and Web Scraping
Class Resources:
 APIs (code)
 Web scraping (code)
 Autocomplete in Spyder
Web Scraping Resources:
 The Beautiful Soup documentation is incredibly thorough, but is hard to use as a reference guide. However, the section on specifying a parser may be helpful if Beautiful Soup appears to be parsing a page incorrectly.
 For more Beautiful Soup examples and tutorials, see Web Scraping 101 with Python, this notebook from Stanford's Text As Data course, and this notebook and associated video from Harvard's Data Science course.
 For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
 For more complex web scraping projects, Scrapy is a popular application framework that works with Python. It has excellent documentation, and here's a tutorial with detailed slides and code.
 robotstxt.org has a concise explanation of how to write (and read) the
robots.txt
file.  import.io and Kimono claim to allow you to scrape websites without writing any code.
 How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.
 Be Suspicious Of Online Movie Ratings, Especially Fandango’s is a interesting example on the application of web scraping from FiveThirtyEight
API Resources:
 Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
 API Integration in Python provides a very readable introduction to REST APIs.
 Microsoft's Face Detection API, which powers HowOld.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.
Selenium Resources:
 What is Selenium
 Chromedriver download
 Selenium with Python Documentation
 Selenium Webdriver Python Tutorial For Web Automation
Additional Resources
Databases
Databases and SQL
 This GA slide deck provides a brief introduction to databases and SQL. The Python script from that lesson demonstrates basic SQL queries, as well as how to connect to a SQLite database from Python and how to query it using Pandas.
 The repository for this SQL Bootcamp contains an extremely wellcommented SQL script that is suitable for walking through on your own.
 This GA notebook provides a shorter introduction to databases and SQL that helpfully contrasts SQL queries with Pandas syntax.
 SQLZOO, Mode Analytics, Khan Academy, Codecademy, Datamonkey, and Code School all have online beginner SQL tutorials that look promising. Code School also offers an advanced tutorial, though it's not free.
 w3schools has a sample database that allows you to practice SQL from your browser. Similarly, Kaggle allows you to query a large SQLite database of Reddit Comments using their online "Scripts" application.
 What Every Data Scientist Needs to Know about SQL is a brief series of posts about SQL basics, and Introduction to SQL for Data Scientists is a paper with similar goals.
 10 Easy Steps to a Complete Understanding of SQL is a good article for those who have some SQL experience and want to understand it at a deeper level.
 SQLite's article on Query Planning explains how SQL queries "work".
 A Comparison Of Relational Database Management Systems gives the pros and cons of SQLite, MySQL, and PostgreSQL.
 If you want to go deeper into databases and SQL, Stanford has a wellrespected series of 14 minicourses.
 Blaze is a Python package enabling you to use Pandaslike syntax to query data living in a variety of data storage systems.
 A data engineers guide to nontraditionaldatastorages
 SQL Style Guide
Advanced scikitlearn
scikitlearn Resources:
 This is a longer example of feature scaling in scikitlearn, with additional discussion of the types of scaling you can use.
 Practical Data Science in Python is a long and wellwritten notebook that uses a few advanced scikitlearn features: pipelining, plotting a learning curve, and pickling a model.
 Sebastian Raschka has a number of excellent resources for scikitlearn users, including a repository of tutorials and examples, a library of machine learning tools and extensions, a new book, and a semiactive blog.
 scikitlearn has an incredibly active mailing list that is often much more useful than Stack Overflow for researching functions and asking questions.
 If you forget how to use a particular scikitlearn function that we have used in class, don't forget that this repository is fully searchable!
Pipelines
 Helper functions: Pipeline, GridSearchCV
 To learn how to use GridSearchCV and RandomizedSearchCV for parameter tuning, watch How to find the best model parameters in scikitlearn (28 minutes) or read the associated notebook.
 Pipeline and FeatureUnion: combining estimators
 Feature Union with Heterogeneous Data Sourse
Tidy Data
 Good Data Management Practices for Data Analysis briefly summarizes the principles of "tidy data".
 Hadley Wickham's paper explains tidy data in detail and includes lots of good examples.
 Example of a tidy dataset: Bob Ross
 Examples of untidy datasets: NFL ticket prices, airline safety, Jets ticket prices, Chipotle orders
 If your coworkers tend to create spreadsheets that are unreadable by computers, they may benefit from reading these tips for releasing data in spreadsheets. (There are some additional suggestions in this answer from Cross Validated.)
Regular Expressions Resources:
 Google's Python Class includes an excellent introductory lesson on regular expressions (which also has an associated video).
 Python for Informatics has a nice chapter on regular expressions. (If you want to run the examples, you'll need to download mbox.txt and mboxshort.txt.)
 Breaking the Ice with Regular Expressions is an interactive Code School course, though only the first "level" is free.
 If you want to go really deep with regular expressions, RexEgg includes endless articles and tutorials.
 5 Tools You Didn't Know That Use Regular Expressions demonstrates how regular expressions can be used with Excel, Word, Google Spreadsheets, Google Forms, text editors, and other tools.
 Exploring Expressions of Emotions in GitHub Commit Messages is a fun example of how regular expressions can be used for data analysis, and Emojineering explains how Instagram uses regular expressions to detect emoji in hashtags.
Feature Selection
Dimensionality Reduction Resources:
 A more thorough and friendly introduction to PCA
 Chapters 6 and 10 in ISLR cover feature selection and dimensionality reduction in an accessible manner.
 A slightly more indepth discussion of the kernel trick.
 introductiontoprincipalcomponentanalysis