Course information
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 49 commits ahead, 1 commit behind Reston-mw-july2018:master.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Initial commit Jul 19, 2017
AddRemote.md Update on Adding a Remote Aug 2, 2018
ClassRecordings.md Update ClassRecordings.md Dec 7, 2018
LICENSE.md Update LICENSE.md Mar 26, 2018
Project Schedule.png Add files via upload Nov 29, 2018
README.md Update README.md Nov 29, 2018

README.md

Welcome to Data Science

  1. Welcome
  2. Your Team
  3. Course Overview
  4. Course Schedule
  5. Projects
  6. Tech Requirements
  7. Classroom Tools
  8. Student Expectations
  9. Office Hours
  10. Student Feedback

Course Overview

Welcome to the part time Data Science course at General Assembly! We are building a global community of lifelong learners who are excited about using data to solve real world problems.

In this program, we will use Python to explore datasets, build predictive models, and communicate data driven insights. Specifically, you will learn how to:

  • Define many of the approaches and considerations that data scientists use to solve real world problems.
  • Perform exploratory data analysis with powerful programmatic tools in Python.
  • Build and refine basic machine learning models to predict patterns from data sets.
  • Communicate data driven insights to peers and stakeholders in order to inform business decisions.

What You Will Learn

  • Statistical Analysis with Python:
  • Perform visual and statistical analysis on data using Python and its associated libraries and tools.
  • Data-Driven Decision-Making:
  • Define and determine the trade-offs involving feature selection, model accuracy, and data quality.
  • Machine Learning & Modeling Techniques:
  • Explore supervised learning techniques, inlcuding classification, regression, and decision trees.
  • Visualizations & Presentations:
  • Create visualizations and interactive notebooks to present to industry stakeholders.

Class Zoom Link

https://generalassembly.zoom.us/j/994261580

Class Exit Ticket

https://www.surveymonkey.com/r/QL8586V?Cohort_ID=BAH041-ONLINE-DS-1

Previous Class Recordings

Access previous class recordings here

Finding DataSets with Google

Enjoy this new dataset search - recently put out in Beta

Python Version

The curriculum materials for this course are written in Python 3.6.


Your Instructional Team

Instructor: Steven Longstreet

I.A.: John Whitesell

I.A.: Ed Salinas


Curriculum Structure

General Assembly's Data Science part time materials are organized into four units.

Unit Title Topics Covered Length
Unit 1 Data Foundations Python Syntax, Development Environment Lessons 1-4
Unit 2 Working with Data Stats Review, Visualization, & EDA Lessons 5-9
Unit 3 Data Science Modeling Regression, Classification, & KNN Lessons 10-14
Unit 4 Data Science Applications Decision Trees, NLP, & Flex Topics Lessons 15-19

Lesson Schedule

Here is the schedule we will be following for our part time data science course:

Date Lesson Unit Number Session Number
9/25 Welcome to Data Science Unit 1 Session 1
9/27 Your Development Environment Unit 1 Session 2
10/2 Python Foundations Unit 1 Session 3
10/4 Exploratory Data Analysis in Pandas Unit 1 Session 4
--- --- --- ---
10/9 [FLEX: Project Workshop + Presentations] Unit 2 Session 5
10/11 Data Visualization in Python Unit 2 Session 6
10/16 Statistics in Python Unit 2 Session 7
10/18 Experiments & Hypothesis Testing Unit 2 Session 8
10/23 Linear Regression Unit 2 Session 9
--- --- --- ---
10/25 Logistic Regression Unit 3 Session 10
10/30 Train-Test Split & Bias-Variance Unit 3 Session 11
11/1 KNN / Classification Unit 3 Session 12
11/6 Clustering Unit 3 Session 13
11/8 Decision Trees Unit 3 Session 14
--- --- --- ---
11/13 Intro to Natural Language Processing Unit 4 Session 15
11/15 Intro to Time Series Unit 4 Session 16
11/27 Working With Data: APIs Unit 4 Session 17
11/29 Flex - Neural Nets & Assessment Unit 4 Session 18
12/4 Final Project Presentations - Part 1 Unit 4 Session 19
12/6 Final Project Presentations - Part 2 Unit 4 Session 20

Project Structure

This course will ask you to complete a series of projects in order to practice and apply the skills covered in-class.

Unit Projects

At the end of each Unit, you'll work on short structured projects. These activities will test your understanding of that unit’s most important concepts with in-class practice and instructor support.

For those of you who want to go above and beyond, we’ve also included stretch options, bonus activities, and other opportunities for further reading and practice.

Final Project

You'll also complete a final project, asking you to apply your skills to a real-world or business problem of your choice.

The capstone is an opportunity for you to demonstrate your new skills and tackle a pressing issue relevant to your life, industry, or organization. You’ll create a hypothesis, analyze internal data, and generate a working model, prototype, solution, or recommendation.

You will get structured guidance and designated time to work throughout the course. Final project deliverables include:

  • Proposal: Describe your chosen problem and identify relevant data sets (confirming access, as needed).
  • Brief: Share a summary of your initial analysis and your next steps with your instructional team.
  • Report: Submit a cleanly formatted Jupyter notebook (or other files) documenting your code and process for technical/peer stakeholders.
  • Presentation: Present a summary of your business problem, approach, and recommendation to an audience of non-technical executive stakeholders.

Project Breakdown

  1. Project 1: Python Technical Code Challenges (Due 10/9)
  2. Project 2: Exploratory Data Analysis (Due 10/25)
  3. Project 3: Modeling Practice (Due 11/15)
  4. Project 4: Final Project
    • Part 1: Proposal + Dataset (Due 10/9)
    • Part 2: Initial EDA Brief (Due 10/25)
    • Part 3: Technical Report (Due 12/6)
    • Part 4: Presentation (Due 12/6)

Project Schedule

  • Project 1: Due @ End of Unit 1 (Due 10/9)
  • Project 2: Due @ End of Unit 2 (Due 10/25)
  • Project 3: Due @ End of Unit 3 (Due 11/15)
  • Project 4 (Final):
    • Proposal + Dataset: Due @ End of Unit 2 (Due 10/9)
    • Initial EDA Brief: Due @ End of Unit 3 (Due 10/25)
    • Technical Report: Due @ End of Unit 4 (Due 12/6)
    • Presentation: Due @ End of Unit 4 (Due 12/6)

Technology Requirements

Hardware

  1. 8GB Ram (at least)
  2. 10GB Free Hard Drive Space (after installing Anaconda)

Software

  1. Download and Install Anaconda with Python 3.6.

Note: Anaconda provides support for two different versions of Python. Make sure to install the "Python 3.6" version.

PC only

  • Install Git Bash
  • Two things to reduce confusion:
    • Many first time Git users find Vim to be confusing. If you do not wish to learn Vim commands, select an alternative text editor such as Nano.
    • If you select "Use Git and optional Unix tools from the Windows Command Prompt" you will be able to use the the command prompt for all Git related tasks in this course.

Browser

  • Google Chrome

Miscellaneous

  • Text editor (we recommend Atom)

Slack

We'll use Slack for our class communications platform. Slack is a messaging platform where you can chat with your peers and instructors. We will use Slack to share information about the course, discuss lessons, and submit projects. Our Slack homepage is Reston-mw-july2018.

Pro Tip: If you've never used Slack before, check out these resources:


Expectations

  • Participate in classroom discussions & complete class feedback forms
  • Be respectful to others
  • Miss no more than 2 classes
  • Complete all unit assignments
  • Complete the final project

Office Hours

Every week, your instructional team will hold office hours where you can get in touch to ask questions about anything relating to the course. This is a great opportunity to follow up on questions or ask for more details about any topics covered so far.

  • Instructor's Office Hours - as needed
  • John Whitesell Office Hours - - Friday 5-8PM EST, Sunday 8 AM- 2PM EST. https://johnfwhitesell.youcanbook.me/.
  • Ed Salinas Office Hours- Tuesday nights (after class) and Sunday afternoons (3 PM - 6 PM EST) if you have questions in regards to projects or course content; I can be booked here for Zoom Sessions

The Team is also generally available via Slack. If you need to meet outside of office hours then let us know.


Student Feedback

Throughout the course, you'll be asked to provide feedback about your experience. This feedback is extremely important, as it helps us provide you with a better learning experience.

The exit ticket will be found by following this link: https://bit.ly/2IWVHQw


Class Resources

Class 1: What is Data Science

Python Resources:

  • Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
  • DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
  • Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
  • A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
  • Python for Informatics: A very beginner-oriented book, with associated slides and videos.
  • Python Tutor: Allows you to visualize the execution of Python code.
  • My code isn't working is a great flowchart explaining how to debug Python errors.
  • PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.

Advanced Python Material:

Resources:

Material for Next Class:


Class 2: Git, Github, and the Command Line

Class Resources: Set your Git username and email

Command Line Resources:

Git and Markdown Resources:


Class 3: Python Foundations

For more information on this topic, check out the following resources:

Class Challenge Exercises

Some extra exercises to practice on!

Starter Exercises

  1. Capture all of the numbers from 1-1000 that are divisible by 7
  2. Capture all of the numbers from 1-1000 that have a 3 in them
  3. Provide the count of number of spaces in a string
  4. Remove any vowels in a string
  5. Capture all of the words in a string less than 4 letters

Challenge Exercises

  1. Use a dictionary comprehension to count the length of each word in a sentence.
  2. Use a nested list comprehension to find all of the numbers from 1-500 that are divisible by any single digit except 1
  3. Use a nested list/dictionary comprehension to find the highest single digit any number in the range 1-1000 is divisible by

Class 4 Exploratory data analysis with Pandas

Class Resources:

Pandas Resources:

For more information on this topic, check out the following resources:


Class 5 Visualizations

For more information on this topic, check out the following resources:

Seaborn Resources:


Class 6 Flex lesson - Reinforcement & In Class Practice


Class 7: Statistics in Python

Additional Resources

For more information on this topic, check out the following resources:

Statistical References:

Statistics Resources:

Class 8 Experiments Hypothesis Testing

For more information on this topic, check out the following resources:


Class 9: Linear Regression

Linear Regression Resources:

Regularization Resources:


Class 10 Logistic Regression

For more information on this topic, check out the following resources:

Logistic Regression Resources:


Class 11 Train Test Split & Bias/Variance Trade Off

For more information on this topic, check out the following resources:


Class 12 K Nearest Neighbors (KNN), Classifiers, Preprocessing and GridSearch

For more information on this topic, check out the following resources:

KNN Resources:

scikit-learn Resources:

Fundamental Statistics

Class 13 Clustering

Clustering Resources:

Additional Resources


Class 14: Decision Trees and Random Forests

Decision Trees Resources

Ensembling Resources:

Class Resources

  • CHAPTER 9 - Elements of Statistical Learning - This book is by most of the same authors as the previous book, but goes into more detail. PDF available to download on the website.
  • CHAPTER 8 - Applied Predictive Modeling - While this book features R code, the discussion of different predictive models and sampling methodologies are hard to beat.

Class 15: Natural Language Processing (NLP) 1

NLP Resources:


Natural Language Processing (NLP) Part 2

Naive Bayes Resources:

  • Sebastian Raschka's article on Naive Bayes and Text Classification covers the conceptual material from today's class in much more detail.
  • For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (15 pages).
  • For an intuitive explanation of Naive Bayes classification, read this post on airport security.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
  • These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
  • Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
  • Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.

Class 16: Time Series Analysis

Class Resources

  • If you are interested in more resources, check out the following:
    • In Pandas' datetime library, search for more information on .dt here.
    • For additional review of these concepts, see some inspiration from the Python Data Science Handbook.
    • There are lots of additional tutorials on ARIMA models out there; here is a good one.

Additional Resources


Class 17: Working with data - Api's and Web Scraping

Class Resources:

Web Scraping Resources:

API Resources:

Selenium Resources:

Additional Resources

Databases

Databases and SQL


Advanced scikit-learn

scikit-learn Resources:

  • This is a longer example of feature scaling in scikit-learn, with additional discussion of the types of scaling you can use.
  • Practical Data Science in Python is a long and well-written notebook that uses a few advanced scikit-learn features: pipelining, plotting a learning curve, and pickling a model.
  • Sebastian Raschka has a number of excellent resources for scikit-learn users, including a repository of tutorials and examples, a library of machine learning tools and extensions, a new book, and a semi-active blog.
  • scikit-learn has an incredibly active mailing list that is often much more useful than Stack Overflow for researching functions and asking questions.
  • If you forget how to use a particular scikit-learn function that we have used in class, don't forget that this repository is fully searchable!

Pipelines


Tidy Data

Regular Expressions Resources:

Feature Selection

Dimensionality Reduction Resources: