Kaggle Competition Part 1
Instructor Note: This is part 1 of a multi-part assignment. You will need to break the class up into teams at your discretion.
Welcome to your first week of work at the Center for Disease Control CDC. Time to get to work!
Due to the recent epidemic of West Nile Virus in the Windy City, we've had the Department of Public Health set up a surveillance and control system. We're hoping it will let us learn something from the mosquito population as we collect data over time. Pesticides are a necessary evil in the fight for public health and safety, not to mention expensive! We need to derive an effective plan to deploy pesticides throughout the city, and that is exactly where you come in!
As it's your first week on the job, let's get your development environment set up. Amongst your orientation group, we'll need you to get started by doing the following exercises. Also, see Cathy in HR about getting your benefits set up. We have a GREAT health plan!
The dataset, along with description, can be found here: https://www.kaggle.com/c/predict-west-nile-virus/
Here are your orientation assignments for your first day on the job:
- Set up a GitHub repository
- Explore the data
- Brainstorm a project roadmap
- Create a Trello board with tickets assigned to individual members of your team to keep the project organized
np.correlate(or other correlation methods), explore correlations in the data. Document your findings
- Commit all of your notes to the GitHub repo in a 'Research' directory
- Create a GitHub repository for the group. Each member should be added as a contributor.
- Retrieve the dataset and upload it into a directory named
- Generate a .py or .ipynb file that imports the data available data.
- Describe the data. What does it represent? What types are present? What does each data points' distribution look like? Discuss these questions, and your own, with your partners, and document your conclusions.
- What kind of cleaning is needed? Document any potential issues that will need to be resolved.
Note: EDA is one important facet of Data Science. This is likely where you might be spending most of your time, depending on the role you fill. Knowing your data, and understanding the status of its integrity, is what makes or breaks a project. Remember- Good Model, Good Data, Good Predictions.
The Scientific Method
- Start up a new document and describe the following:
- What is our problem statement?
- What can we learn from the data in order to make an educated hypothesis?
- What is our hypothesis?
- Define your deliverable- what is the end result?
- Break that deliverable up into its components, and then go further down the rabbit hole until you have actionable items. Document these however you wish- github, a project management tool, post-it notes- whatever works for your team.
- Begin deciding priorities for each task. These are subject to change, but it's good to get an initial consensus. Order these priorities however you would like.
Once again, welcome to the CDC. We have high expectations for you!