Modelling UK house prices based on features readily available on open source databases
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
UK House Prices.ipynb


The goal of this project is to accurately predict property prices within England and Wales with models based on features readily available on open source databases.

First goal is to retrieve data from the Land Registry's website and analyse house prices on the limited features that they make available:

  - Postcode
  - Property type
  - Tenure
  - Build type

Once this initial analysis has been carried out, next is to pull in more features to improve and strengthen the model (in no particular order):

  - # Bedrooms
  - Square footage
  - Proximity and abundance of transport links
  - Local wealth (Referencing information)
  - Crime rates in the area
  - Council tax banding
  - Density of properties
  - Property performance (e.g EPC certificates)
  - Rent prices in the area
  - Green space
  - Average rainfall and/or other weather indicators


property prices -

postcode co-ordinates -

Potential Sources

train station connectivity -


geo mapping -

pricing model -

Initial Discovery

We are able to relatively easily pull in transction records of properties from the Land Registry's website which is presented as below:

We then format these columns in order to run further analysis such as formatting dates to datetime and month values.

As we will be looking into leveraging the location of the properties themselves to see if and how that affects the price, we will need to run ruther transformation on the postcode.

To start this process, we're interested in the regions that properties belong to, I found a resource which breaks down the postcode prefix into districts and UK regions:

To merge the two, we strip the property postcodes df['postcode_pre'] = df['postcode'].str.upper().str.extract("([A-Z]{1,2})") giving us the prefixes.

Next we add the longitude and latitudes directly on the postcodes.

Missing Values

There are a number of postcodes that are missing within the dataframe and close to 90% of those missing values have the property type of 'other'. As this property type is somewhat ambiguous and makes up the majority of missing postcodes, we will drop all observations with this property type.

To add to this, the long. lat. csv has failed to join on 64,321 rows - We will need to run further analysis into why there were such a large number of properties missing long lat values.


I ran some basic visualisations to get a sense of the data - I found that property prices are heavily squewed to the right. As such applying a mask to only include those with a value of less 1 million pounds.