Permalink
Browse files

Add solutions.

  • Loading branch information...
ljhopkins2 committed Dec 2, 2019
1 parent 4d22365 commit 66e70239c937b377bd1845b77b8d7024d7d7a730
Showing with 527 additions and 0 deletions.
  1. +364 −0 solutions/intro_to_pickle_solutions.ipynb
  2. +163 −0 solutions/read_a_pickle_solutions.ipynb
@@ -0,0 +1,364 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import picklep\n",
"\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.linear_model import LogisticRegressionCV\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.pipeline import Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Pickling\n",
"\n",
"---\n",
"\n",
"## Learning objectives\n",
"\n",
"- Learn what serialization and deserialization are\n",
"- Learn what \"pickling\" is in Python\n",
"- Review using `with` statements to safely handle file operations\n",
"- Pickle and unpickle sklearn models in Python\n",
"\n",
"---\n",
"\n",
"## What is pickling?\n",
"\n",
"If you're talking about food, pickling is a method of preserving food for the future. If you're talking about Python, pickling is a method of preserving **objects** for the future, including functions and classes. Since sklearn models are instances of classes, that means they can be pickled.\n",
"\n",
"To pickle an object, it needs to be **serialized**. Serialization is when we transform an object into byte streams. (Byte streams are collections of bytes. One byte is made up of eight zeros or ones.) To unpickle an object so that it can be used in Python again, it needs to be **deserialized**.\n",
"\n",
"If you've ever saved your progress in a video game, you've already serialized data without knowing it! A save file is your serialized save state. When you load the save, you deserialize the data so you can resume the game right where you were before you quit.\n",
"\n",
"### Warning:\n",
"\n",
"Just like you can't open a [Pokemon: Red](https://en.wikipedia.org/wiki/Pok%C3%A9mon_Red_and_Blue) savefile in [Pokemon: Sun](https://en.wikipedia.org/wiki/Pok%C3%A9mon_Sun_and_Moon), you have to unpickle an object in the same version of Python that you pickled it in. \n",
"\n",
"Pickle objects can contain malevolent code. Never unpickle an object you don't trust!\n",
"\n",
"## Why pickle?\n",
"\n",
"Pickling makes a lot of sense any time you have a model you want to work with that you don't want to refit.\n",
"\n",
"If you have a model that took twelve hours to fit, you might want to analyze its residuals, work with its coefficients, or make predictions off of it. But without Pickle, you'd need to refit the model every time you restarted your notebook. Pickling the model allows you to load the fitted model _without_ needing to re-run the code where you fit it.\n",
"\n",
"Note: pickling does **not** compress your model, meaning that some pickled models can end up being fairly large file sizes. (Think of K-nearest neighbors, which requires **every** data point to be stored inside the model.)\n",
"\n",
"---\n",
"\n",
"## Pickling a simple datatype\n",
"\n",
"Before we pickle a full model, let's demonstrate pickling on a simple list.\n",
"\n",
"Create a list called `things_to_pickle` that contains some strings:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"things_to_pickle = [\"cucumbers\", \"pigs\\' feet\", \"beets\", \"a peck of peppers\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Write the pickled list to disk\n",
"\n",
"Let's review [this link](https://www.pythonforbeginners.com/files/with-statement-in-python) to go over why `with` is such a good tool for file operations.\n",
"\n",
"Let's use `with` to write the list to disk as a `.pkl` file. We'll need to use `open`, pass in a file name, and also tell Python we're **writing** to the file, and writing as **bytes**. The pickle method we'll use is called `dump`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"with open('data/things_to_pickle.pkl','wb') as pickle_out:\n",
" pickle.dump(things_to_pickle, pickle_out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Open the pickled list\n",
"\n",
"Let's use `with` to open the pickled file and save it as a new variable, `list_from_pickle`. Remember to tell Python that we're **reading** from the file, and that we're reading in **bytes**. The pickle method we'll use is called `load`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"with open('data/things_to_pickle.pkl', 'rb') as pickle_in:\n",
" list_from_pickle = pickle.load(pickle_in)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['cucumbers', \"pigs' feet\", 'beets', 'a peck of peppers']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list_from_pickle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So far, so good!\n",
"\n",
"---\n",
"\n",
"## Pickle a fitted pipeline\n",
"\n",
"Here, we'll fit a pipeline to the Trump-Clinton corpus, then pickle the model.\n",
"\n",
"Then, we'll import the pickle in a new notebook to demonstrate how the model has been saved. Some code has been provided for you.\n",
"\n",
"### Import the data\n",
"\n",
"Here, the data is imported, and some elementary cleaning is performed:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>handle</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>The question in this election: Who can put the...</td>\n",
" <td>HillaryClinton</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>If we stand together, there's nothing we can't...</td>\n",
" <td>HillaryClinton</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Both candidates were asked about how they'd co...</td>\n",
" <td>HillaryClinton</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text handle\n",
"0 The question in this election: Who can put the... HillaryClinton\n",
"3 If we stand together, there's nothing we can't... HillaryClinton\n",
"4 Both candidates were asked about how they'd co... HillaryClinton"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('data/trump_clinton_tweets.csv')\n",
"df = df[df['is_retweet'] == False][['text', 'handle']]\n",
"df.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up train and test"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"X = df['text']\n",
"y = df['handle'].map(lambda x: 1 if x == 'realDonaldTrump' else 0)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Instantiate, fit, and score a pipeline"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.9937077604288045, 0.9217330538085255)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe = Pipeline([\n",
" ('cv', CountVectorizer(min_df=3)),\n",
" ('lr', LogisticRegressionCV(cv=3, max_iter=1000))\n",
"])\n",
"\n",
"pipe.fit(X_train, y_train)\n",
"pipe.score(X_train, y_train), pipe.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Export the fitted pipeline to `my_pickles` as `pipeline.pkl`:\n",
"\n",
"Just like before, we'll use a `with` statement:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"with open('data/pipeline.pkl', 'wb') as pickle_out:\n",
" pickle.dump(pipe, pickle_out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, open the notebook called `read_a_pickle`."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Oops, something went wrong.

0 comments on commit 66e7023

Please sign in to comment.