Skip to content
No description, website, or topics provided.
Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
code
data
.DS_Store
Project 5_ Predicting Affluence.pdf
README.md
irs_data_dictionary.docx

README.md

Utilizing Yelp Price Estimates to Identify Neighborhood Affluency

By: Vanessa Alvarado, Leland Roberts, Danielle Mizrachi, Jessica Ertel

Problem Statement

The organization New Light Technologies (NLT) provides solutions to government, commercial, and non-profit clients. NLT is a team of dedicated technologists, scientists, engineers, designers, and strategists working on some of the most interesting, challenging, and important assignments in the world, ranging from disaster response to enabling growing telecommunications networks to providing healthcare to Americans.

New Light Technologies was contacted by The Wildflower Cafe, a high end restaurant looking to strategically expand their business to new areas of Connecticut. In order to identify the best location, they're interested in obtaining an estimate of affluence based on the existing number and type of restaurants across different towns and cities. To tackle this request, NLT is utilizing big data related to commercial activity and cost of product and services as an indicator for affluence.

Yelp is a business directory and crowd-sourced review platform that ranks businesses on a 4-point pricing scale ($, $$, $$$, $$$$). The NLT team decided to adopt a unique approach and use Yelp cost estimates to determine the affluence of particular cities and towns in Connecticut. Our success can be validated if we are able to answer the following:

  1. Can Yelp price estimates for restaurants in Connecticut accurately predict median household income, mean household income, per capita income or adjusted gross income?
  2. Can we confidently advise The Wildflower Cafe which city to expand their business?

Table of Contents

  1. Data Gathering
    1. Yelp
  2. Data Cleaning
    1. Yelp
    2. IRS
    3. CT Gov
  3. Combine Data
  4. Exploratory Data Analysis
  5. Modelling

Data Dictionary

Feature Type Description
zipcode object 5-digit zip code for cities and towns in Connecticut
income_bucket int Size of adjusted gross income, binned into six categories*
n_returns int Number of returns filed during the 12-month period of January 1, 2018 to December 31, 2018
agi int Adjusted gross income
total_income int Total income amount
city object City in Connecticut
price float Average price category of restaurants in each city
num_restaurants float The number of restaurants in each city (from Yelp data)
y float Weighted average of each IRS income category weighted by the number of returns in each city
1 float Percent of restaurants in a given city that fall into the $ price category
2 float Percent of restaurants in a given city that fall into the $$ price category
3 float Percent of restaurants in a given city that fall into the $$$ price category
4 float Percent of restaurants in a given city that fall into the $$$$ price category
Median Household Income int Median household income as defined by CT gov
Mean Household Income int Mean household income as defined by CT gov
Per Capita Income int Per capita income as defined by CT gov

*Binned adjusted income categories (per IRS determinations)
1 = $1 under $25,000
2 = $25,000 under $50,000
3 = $50,000 under $75,000
4 = $75,000 under $100,000
5 = $100,000 under $200,000
6 = $200,000 or more

Data Files

Executive Summary

We cleaned and analyzed the three datasets then modeled using the following process:

Data Acquisition

The Internal Revenue Service (IRS) produces state income tax statistics by zip code. We used the most recent data available from Connecticut from the year 2017. See the IRS documentation for more details.

The Connecticut Department of Economic and Community Development produces housing and income data on Connecticut, including per capita income, median household income and median family income at the state, county and town level. The most recent data available is from the year 2015.

Yelp data was gathered using the Yelp API. We developed a function that allowed us to hit the daily limit of 5,000 API calls per 24 hours and gathered a data on restaurants within each of Connecticut's zipcodes. This did result in the gathering of duplicate locales, but after dropping duplicates we were left with a total of 3,729 rows.

Data Cleaning

The Internal Revenue Service (IRS)

  • Convert numerical columns from string to float datatype
  • Convert zip codes to string and ensure all have 6 digits
  • Map numerical values to gross income bins as designated by IRS
  • Drop rows for zip code totals
  • Add a column that identifies city based on zip code
  • Drop rows with zip code 00000 and 99999

The Connecticut Department of Economic and Community Development

  • Remove commas from numeric values
  • Bucket median and mean household income (using same categories as IRS dataset)

Yelp

  • Drop duplicates
  • Map price values to numeric values (1-4)
  • Update incorrectly formatted city names to match IRS/CT Gov data
  • Aggregated restaurants by city

Feature Engineering

We came up with two separate sets of features to use in our model:

  1. Average price category of restaurants in each city
  2. We calculated a normalized count per category per city. We combined the engineered features with the IRS and CT Gov data, resulting in four datasets for modeling. We also had to engineer our predicted value, which involved creating a weighted average of each IRS income category weighted by the number of returns in each city.

Modeling

Our baseline model had a high score of 78% with very imbalanced classes. When the weighted averages were calculated, the highest (over $200,000) and lowest (under $25,000) income classes were not represented. We Gridsearched and fit a Logistic Regression model. Our best estimator had a train score of 77.7% and test score of 79.5%. Our model predicted income class 3 (the majority class) for every city.

Our next approach involved using both regression and classification models on the CT Gov dataset. This dataset also had imbalanced classes and therefore did not result in better scores. As a result we decided to use linear regression models using per capita income and achieved our best performing model with a train r-squared score of .19 and test score of .16.

Conclusions and Limitations

The biggest limitation we faced was the size of our dataset. We were unable to pull additional data from Yelp because of their API limit and we chose to aggregate restaurants by city, leaving us with only 160 observations to work with. This is not nearly enough data!

The number of zip codes we gathered from each source varied. Yelp collects restuarants from more zip codes than the IRS or CT Gov sources contained because it sorts towns into subsections. We were forced to drop this added level of specificity from the analysis when modeling with the CT Gov data, which did not provide data on the same number of zipcodes. Additionally, when scraping data from Yelp by zipcode, the API pulled restaurants within a certain radius of that zipcode. This resulted in gathering many duplicates and uncertainty around completeness, especially since the API capped at 50. The reason for the lack of data was not due to the API limit, but rather due to the limited amount of cities in Connecticut that we could include in our analysis. Therefore, we cannot confidently advise on where The Wildflower Cafe should expand their business.

How does Yelp define their pricing scale and is this consistent across different parts of the U.S.?
Evaluating the distribution of median household income in light of the poor modeling results led us to some fundamental questions: Is the price range for each category consistent across different regions? How and who actually classifies restaurant prices? In other words, does $$$ mean the same in a low and high income city? The answer is astounding! The dollar sign classifications are defined by customers themselves, as part of a survey that is collected when a customer posts a review of the restaurant. The fact that this information is crowd sourced and somewhat subjective is interesting, yet problematic for our investigation. Using Yelp pricing scales might be a better indication of how residents of a certain city rate the cost of product and services in terms of their personal affluence. From this analysis, it is not an accurate predictor of median household income, mean household income, per capita income or adjusted gross income.

Future Work

For future work, we would gather additional data from different sources and further investigate the data that is available by the IRS. We would expand our search radius to inlcude the North East region, East Coast, or potentially include the U.S. in its entirety. To gather that data we would utilize the U.S Census API to aggregate median, mean and per-capita income. Once we have that data in hand we would have to increase our radius to attain a larger dataset of Yelp data. Having a more robust dataset would allow us to do more complex modelling.

You can’t perform that action at this time.