Permalink
Browse files

Started mr challenge

  • Loading branch information...
markpopovich committed Mar 13, 2019
0 parents commit c977b309c094a2d71616086fe6310803def3929c
Showing with 389 additions and 0 deletions.
  1. +382 −0 Make_regression_challenge.ipynb
  2. +7 −0 README.md
@@ -0,0 +1,382 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"from sklearn.datasets import make_regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Make Regression Challenge\n",
"\n",
"The sklearn datasets includes a function called `make_regression`. From the documentation on `make_regression`, we learn that the function generates a random regression problem with some noise from some random data. \n",
"\n",
"```\n",
"The output is generated by applying a (potentially biased) random linear\n",
"regression model with `n_informative` nonzero regressors to the previously\n",
"generated input and some gaussian centered noise with some adjustable\n",
"scale.```\n",
"\n",
"There are several arguments that give us control of how the regression problem is generated. \n",
"\n",
"```\n",
"n_samples : int, optional (default=100)\n",
" The number of samples.\n",
"\n",
"n_features : int, optional (default=100)\n",
" The number of features.\n",
"\n",
"n_informative : int, optional (default=10)\n",
" The number of informative features, i.e., the number of features used\n",
" to build the linear model used to generate the output.\n",
"\n",
"n_targets : int, optional (default=1)\n",
" The number of regression targets, i.e., the dimension of the y output\n",
" vector associated with a sample. By default, the output is a scalar.\n",
"\n",
"bias : float, optional (default=0.0)\n",
" The bias term in the underlying linear model.\n",
"\n",
"effective_rank : int or None, optional (default=None)\n",
" if not None:\n",
" The approximate number of singular vectors required to explain most\n",
" of the input data by linear combinations. Using this kind of\n",
" singular spectrum in the input allows the generator to reproduce\n",
" the correlations often observed in practice.\n",
" if None:\n",
" The input set is well conditioned, centered and gaussian with\n",
" unit variance.\n",
"\n",
"tail_strength : float between 0.0 and 1.0, optional (default=0.5)\n",
" The relative importance of the fat noisy tail of the singular values\n",
" profile if `effective_rank` is not None.\n",
"\n",
"noise : float, optional (default=0.0)\n",
" The standard deviation of the gaussian noise applied to the output.\n",
"\n",
"shuffle : boolean, optional (default=True)\n",
" Shuffle the samples and the features.\n",
"\n",
"coef : boolean, optional (default=False)\n",
" If True, the coefficients of the underlying linear model are returned.\n",
"\n",
"random_state : int, RandomState instance or None, optional (default=None)\n",
" If int, random_state is the seed used by the random number generator;\n",
" If RandomState instance, random_state is the random number generator;\n",
" If None, the random number generator is the RandomState instance used\n",
" by `np.random`.```\n",
"\n",
"Let's generate our regression using the same parameters so that all of our regression problems are identical. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mr = make_regression(n_samples = 10000,\n",
" n_features = 100,\n",
" n_informative = 20,\n",
" n_targets = 1,\n",
" bias = 0.0,\n",
" effective_rank = 10,\n",
" tail_strength = 0.5,\n",
" noise = 0.25,\n",
" shuffle = True,\n",
" random_state = 42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Further down in the documentation, we can see what the function returns. \n",
"\n",
"```\n",
"Returns\n",
"-------\n",
"X : array of shape [n_samples, n_features]\n",
" The input samples.\n",
"\n",
"y : array of shape [n_samples] or [n_samples, n_targets]\n",
" The output values.```\n",
" \n",
"We see it returns an X array and a y array. Let's look at `mr` to see what this means. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# type \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`mr` is a tuple."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# len \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It contains 2 elements. That makes sense with what the return section of the documentation states. We would expect 1 element to be the X array and the other element to be the y array. Let's check."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# type of both elements\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# shape of both elements\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both elements are arrays, as we thought. The first array is shaped 10000rows x 100columns, and the second appears to be a column that is 10000 rows long. We can probably put both of those directly into dataframes. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# X and y DataFrames\n",
"\n",
"\n",
"# X and y shapes\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# X head\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# y head\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, we have our dataset loaded in. Now we can start doing work on solving the problem. Specifically, let's try to use the tools we have learned so far to work through this problem step by step. \n",
"\n",
"Below is a `to do` list of tasks we should accomplish at the very outset. \n",
"\n",
"```\n",
"To do: \n",
" - Investigate descriptive statistics and check datatypes.\n",
" - Look at correlation of features and target.\n",
" - Visualize the data. We could try subsets of pairplots with this dataset to see if we observe clear linear relationships.\n",
" - Train test split the dataset to prepare for modeling. \n",
" - Build a basic MLR using all the features, establish baseline R2 and Adj R2. \n",
" - Try an MLR using only features with high correlation to target.\n",
" - Check R2 score, Adj R2 score\n",
" - Investigate p-values, are any of our features not statistically relevant or are we doing okay already?\n",
" - Optional: Implement some basic feature selection algorithms from the [sklearn feature selection library](https://scikit-learn.org/stable/modules/feature_selection.html)\n",
" - Visualize predicted vs true values\n",
" - When we are satisfied: Print our established function and then check against the true function. ```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Investigate descriptive statistics and check datatypes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Look at correlation of features and target."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Visualize the data. \n",
"\n",
"We could try subsets of pairplots with this dataset to see if we observe clear linear relationships."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Train test split the dataset to prepare for modeling."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Build a basic MLR using all the features, establish baseline R2 and Adj R2. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Try an MLR using only features with high correlation to target.\n",
"\n",
"- Check R2 score, Adj R2 score\n",
"- Investigate p-values, are any of our features not statistically relevant or are"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Optional: Implement some basic feature selection algorithms from the [sklearn feature selection library](https://scikit-learn.org/stable/modules/feature_selection.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Visualize predicted vs true values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### When we are satisfied: Print our established function and then check against the true function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:dsi]",
"language": "python",
"name": "conda-env-dsi-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,7 @@
### Make Regression Challenge

In this notebook, we will use the built-in sklearn dataset creation function, `make_regression`.

This function can be imported using `from sklearn.datasets import make_regression`.

Let's see if we can solve this regression problem using the skills we have learned so far this week.

0 comments on commit c977b30

Please sign in to comment.