Data, python and visualisation enthusiast. Keen cyclist. Husband and father of three children.
postgis
postgres
ireland
property
qgis
]
Published Mar 15, 2020
It’s been a few weeks since my last post. Previously I focused on the introduction of PostGIS concepts, and the walking distance from the Luas. Here I’ll analyse a property transactions dataset for County Wicklow in Ireland. I’ll introduce Docker and how it can be used locally to store geo data, analyse it with Python Pandas/Matplotlib using spatial functions, and visualise with QGIS.
My wife and I owned a property in County Wicklow, Ireland from 2005 to 2019. In 2014, we became accidental landlords when be relocated to County Wexford. We’d followed the property market closely before selling in late 2019. We had purchased the property in 2005 close to the peak of the Celtic Tiger.
Since 2010, each property sale in the Republic of Ireland has been published on the Property Price Register. This dataset includes address and price information. In Wicklow at the time of writing had 12,698 residential transactions listed. I downloaded all records and to view these on a map or carry out spatial analysis I needed to geocode them. Geocoding is a process to convert addresses to locations (lat/lon). There are a number of freemium and commercial tools available for this including Autoaddress, Google, and HERE to name just a few. I’ve used Google’s geocoder as there’s a $200 free monthly credit which allowed me to geocode the 12k records free of charge.
Docker allows you to build, run and share applications with containers. Installation details can be found on the Docker Documentation page. An example image available on Docker hub is a Postgres/Postgis Image. Using this you can have a database running within a container on your local machine.
After installing Docker. Create a folder (e.g. Wicklow_PPR) with a Dockerfile that has the following content:
version: '3.2'
services:
db:
image: kartoza/postgis
environment:
- POSTGRES_USER=admin
- POSTGRES_PASS=superSecretPassword!
- POSTGRES_DBNAME=db
- POSTGRES_MULTIPLE_EXTENSIONS=plpgsql,postgis,hstore,postgis_topology,pgrouting,cube
- ALLOW_IP_RANGE=0.0.0.0/0
ports:
- 25432:5432
command: sh -c "echo \"host all all 0.0.0.0/0 md5\" >> /etc/postgresql/10/main/pg_hba.conf && /start-postgis.sh"
Then is as easy as running a docker-compose command which will create a container with a postgis-enabled postgres instance. Once spun up you can connect to it using your preferred SQL client or using the cli:
# Docker Compose
docker-compose up -d --build db
# Access the database using the cli
psql --host localhost --port 25432 --username admin --dbname ddb
Pandas and Matplotlib are two popular Python packages for carrying out data analysis and visualising results. You can import various file types into Pandas including csv, xls, json etc. creating a so called dataframe. You can manipulate the dataframe like you would a sheet in Excel or table in SQL - selecting, filtering, joining, pivoting, filling missing data, aggregating etc.
QGIS is an open source Geographic Information System available available to download and install on all major operating systems. There’s extensive documentation on using it and on GIS in general.
QGIS can be used amongst other things to display points and polygons on top of base maps. The base map I’ve used here comes from OpenStreetmap. Details of using OSM Tile Servers can be found here. The standard osm basemap includes roads and rail networks, and water features.
I ended up with two tables in the Postgis-enabled Postgres database. There’s a one-to-one relationship between the the address field in the PPR table and the input_address field of the geocode table.
Using just the PPR dataset we can see the price trends over time.
I’ve use a Box plot to
illustrate the trend (See code below). The median price (green line) has
steadily improved from 2013 onwards. Unfortunately, I don’t have data from the
Celtic Tiger period (1995-2007) to compare. The quantity of transactions
has also increased as more confidence in the market has returned and as more
accidental landlords are now able to exit more easily. The PPR dataset
also includes dimensions for Vat Exclusive and Not full market value
(reasons not listed). New property transactions are exempt from VAT at
13.5%.
The size of the boxes were adjusted to reflect the number of transactions in the that year (Sample size in Boxplots).
Vat exclusive | Not full market value | Count |
---|---|---|
No | No | 9,659 |
Yes | No | 2,392 |
No | Yes | 477 |
Yes | Yes | 150 |
Taking a closer look at the geocoding results we can use the ‘location_type’ field. The table below shows that over 65% of geocoding processes resulted in an exact location been returned.
Some notes on Google’s Geocoder, I provided a bounding box:
{
"southwest": "51.0,-11.0",
"northeast": "56.0,-5.7"
}
and a region (“ie”) to ensure Irish results were preferred (More details can be found here). I also filtered out any record that did not return with Ireland as the value of country. More details of the code to geocode addresses will be the focus of another post.
location_type | Count |
---|---|
APPROXIMATE | 1,368 |
GEOMETRIC_CENTER | 2,280 |
RANGE_INTERPOLATED | 744 |
ROOFTOP | 8,247 |
Here are the transactions plotted on a map coloured by ‘location_type’ field. The APPROXIMATE results are often in the countryside where addresses are often non-unique.
The analysis gets more interesting when we combine both the PPR and Geocoding results together. The price field is a continuous variable ranging from €11,000 to €1,850,000. To display this on a map it is convenient to convert to a dimension of price bins. Here we can easily see the most expensive properties are location around the Bray/Greystones area.
The bin sizes were automatically defined with QGIS using a graduated scheme of 10 evenly sized buckets:
There is some obvious clusters of property transactions within County Wicklow based around the population centres such as Bray, Greystones, Wicklow Town, Arklow, Baltinglass, Blessington etc. Using a k-means algorithm following an example on stackexchange I clustered the data into 15 groups (arbitrarily set number) and I gave each group a label (largest town name). This allowed me to repeat the time series analysis but one level deeper.
CREATE OR REPLACE VIEW wicklow_clusters AS
SELECT
kmean,
count(*),
ST_MinimumBoundingCircle(ST_Union(geom)) AS circle,
sqrt(ST_Area(ST_MinimumBoundingCircle(ST_Union(geom))) / pi()) AS radius,
ST_SetSRID(ST_Extent(geom), 4326) as bbox,
ST_CollectionExtract(ST_Collect(geom),1) AS multipoint,
ST_ConvexHull(ST_Collect(geom)) AS convexhull
FROM
(
SELECT
ST_ClusterKMeans(geom, 15) OVER() AS kmean,
ST_Centroid(geom) AS geom
FROM geocode g
INNER JOIN ppr s ON s.id = g.nk
WHERE
source = 'ppr'
AND s.address5 = 'Wicklow'
AND g.administrative_area_level_1 = 'County Wicklow'
) tsub
GROUP BY kmean
ORDER BY count DESC;
CREATE TABLE wicklow_cluster_names (
id integer,
name varchar(64)
);
INSERT INTO wicklow_cluster_names(id, name)
VALUES
( 0, 'Baltinglass'),
( 1, 'Wicklow Town'),
( 2, 'Kiltegan'),
( 3, 'Shillelagh'),
( 4, 'Tinahely'),
( 5, 'Dunlavin'),
( 6, 'Aughrim'),
( 7, 'Arklow'),
( 8, 'Roundwood'),
( 9, 'Blessington'),
(10, 'Rathdrum'),
(11, 'Enniskerry'),
(12, 'Newtownmountkennedy'),
(13, 'Bray'),
(14, 'Greystones');
-- Previously owned Greystones property dataset
SELECT
s.id, s.date_of_sale, s.price,
s.not_full_market_value,
s.vat_exclusive, s.address_complete,
g.geom,
wcn.name as cluster_name
FROM geocode g
INNER JOIN ppr s ON s.id = g.nk
JOIN wicklow_clusters wc ON ST_Contains(wc.convexhull, g.geom)
INNER join wicklow_cluster_names wcn ON wcn.id = wc.kmean
WHERE
source = 'ppr'
AND s.address5 = 'Wicklow'
AND administrative_area_level_1 <> 'County Wicklow'
AND wcn.name = 'Greystones'
AND vat_exclusive = 'No' AND not_full_market_value = 'No' -- 2nd Hand
;
Cluster | Count |
---|---|
Baltinglass | 258 |
Wicklow Town | 2,002 |
Kiltegan | 73 |
Shillelagh | 89 |
Tinahely | 270 |
Dunlavin | 191 |
Aughrim | 246 |
Arklow | 1,234 |
Roundwood | 181 |
Blessington | 620 |
Rathdrum | 427 |
Enniskerry | 257 |
Newtownmountkennedy | 848 |
Bray | 2,354 |
Greystones | 3,058 |
Though only a couple of months into 2020, the median price trend is down from previous years and similar to 2017 levels. See the Greystones Box plot above.
Note, this analysis does not take into account details of each property such as its condition, number of bedrooms/bathrooms, square footage, garden size, etc. Thus the trends in price should be taken with a pinch of salt.
I hope this post illustrated some of the things you can do with free resources such as Postgres/PostGIS, Pandas & Matplotlib and QGIS.
requirements.txt file for python packages.
psycopg2
pandas
python-dotenv==0.10.3
matplotlib
import psycopg2
import sys, os
import numpy as np
import pandas as pd
import pandas.io.sql as psql
import matplotlib.pyplot as plt
from dotenv import load_dotenv, find_dotenv
import matplotlib as mpl
# .env file contains variables PGHOST, PGDATABASE, PGUSER, PGPORT and PGPASSWORD
load_dotenv(find_dotenv())
PGHOST = os.getenv('PGHOST')
PGDATABASE = os.getenv('PGDATABASE')
PGUSER = os.getenv('PGUSER')
PGPORT = os.getenv('PGPORT')
PGPASSWORD = os.getenv('PGPASSWORD')
conn_string = "host="+ PGHOST +" port="+ "5432" +" dbname="+ PGDATABASE +" user=" + PGUSER \
+" password="+ PGPASSWORD
conn=psycopg2.connect(conn_string)
cursor = conn.cursor()
def load_data():
sql_command = """
select
id, date_of_sale,
price, vat_exclusive,
not_full_market_value,
date_part('YEAR', date_of_sale)::character varying as year
from public.ppr
where
address5 = 'Wicklow'
and date_of_sale < '2020-01-01'
;
"""
data = pd.read_sql(sql_command, conn)
return (data)
df = load_data()
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
df_pivot = df.pivot(index='id', columns='year', values='price')
df_year = df.groupby('year')
counts = [len(v) for k, v in df_year]
total = float(sum(counts))
cases = len(counts)
widths = [c/total for c in counts]
myFig2 = plt.figure(figsize=[12, 6])
cax = df_pivot.boxplot(column=years, widths=widths)
sample_sizes = ['%s\n$n$=%d'%(k, len(v)) for k, v in df_year]
cax.set_xticklabels(sample_sizes)
cax.get_yaxis().set_major_formatter(
mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ','))
)
axes = plt.axes()
axes.set_ylim(bottom=0, top=1000000)
ireland
postgres
postgis
aws
random
qgis
property
geospatial