Stephen Chan's Blog

Scraping and Geocoding Real Estate Properties for GIS Analysis

Stephen Chan — Mon, 04 Jan 2021 21:15:59 GMT

How do you find and buy a house near a beach that isn't flood-prone?

Real estate portals contains a comprehensive listing of homes that are for sale, but it doesn't really provided you with additional information such as flood hazard, noise levels or proximity of fault zones. You can get all that with GIS of course, but how would you marry them together?

Through a scrape-extract-geocode-export-ingest workflow of course!

Project Code: Github

System Design

System Diagram

The software system consists of a headless chrome browser driven by Puppeteer that scraps real estate property data from domain.com.au's search pages. Paginated results pages results are collated.

Scraped property addresses are used as input for geocoding using mappify.io APIs. To keep ourselves below the 2500 request per month free tier (and to speed things up), the APIs calls are placed behind a simple in-memory / file cache. The geocoding APIs produces coordinates (latitude and longitude) that can be saved along with the scraped data as a flat CSV file.

The resulting CSV file can be ingested by GIS systems (such as QGIS) and overlaid on top of other geospatial data (like flood hazard map).

Results

The scrapped and geocoded data can ingested into GIS system. Allowing for further analysis along side other geospatial data (in this case, elevation data).

Geocoded Real Estate Property Ingested in GIS System

Digitized Penang JICA Drainage Study Floodmap

Stephen Chan — Mon, 06 Nov 2017 00:00:00 GMT

The state of Penang, Malaysia is currently suffering from widespread flooding (5th November, 2017). The current flood comes on the heels of an earlier flood just one and a half months ago in mid September.

As part of the design and maintenance of a sensible urban planning policy, a proper understanding of the area's flood risks is required. Comprehensive flood studies are used to quantitatively analyze and evaluate the risks, hazards and potential human & economic losses associated with potential flood events.

Often, flood studies are lacking in developing parts of the world due to data, expertise and resource constraints. The current tragic flood event has prompted me to search for any data that may help inform the general public of the city's flood risk.

Flood Risk Data

I had managed to dig out an old 1991 drainage and flooding study of Penang produced by JICA. This study had produced a flood map for 5, 10, 30, 50 & 100 year ARI events. Unfortunately, the poorly scanned map is not very accessible for the general public due to missing roads and features from the scanning process as well as unnecessary clutter from the map's design:

Georeferencing & Data Transformation

In order to improve the accessibility of this floodmap and assist with informing the general public. I had georeferenced the scanned image in GIS and digitized the 5, 10, 50 and 100 year ARI flood extent boundaries.

Digitized Data on Online Map UI

The digitized data presented in a modern map UI below should help inform the public about the flood risks in Penang.

Words of Warning

The JICA study is published in 1991 and is considered very dated especially with regards to input such as landuse patterns, drainage infrastructure, and terrain due to the tremendous amount of growth in Penang Island.

As with all flood studies, there are not-insignificant amount of uncertainty associated with the results. It is therefore prudent to treat any flood map only as a rough guide.

Finally, I am not the author of the JICA paper. My effort only included the digitization of the flood map results and making it accessible via the internet. Please read the original report to gain an understanding of the modeling process and parameters.

Finally, there are no guarantee on the quality and suitability of the digitized map itself. Use at your own risk.

Dynamic Optimum Furniture Height Diagram

Stephen Chan — Sat, 21 Oct 2017 00:00:00 GMT

I stumbled across an interesting diagram from the 1950s showing the optimal dimension of various furniture items along side a 5'9" tall person.

Wouldn't it be useful to be able to determine the optimum furniture height for people with heights other than 5'9"?

So I went ahead a made a client-side app to dynamically generate the optimal furniture heights based on the proportions of the original drawing.

Project Code on: GitHub | Live Demo

Checkout the the embedded version:

Design & Features

The plan was to create a simple application consist of simple input field and calculated dimensions overlaid on the diagram. However, self-impose feature creep set in. The feature set now includes:

Metric & US Customary Unit Modes
Automatic User Height Input Unit Conversion
Saves User Height between Sessions
Print-Friendly CSS & JS Scaling
Responsive Design
Model-View-Presenter Architecture
Inversion of Control Pattern with Dependency Injection

Lessons Learnt

Two notable techniques were used to achieve precision overlay with diagram scaling and enabling print-friendliness. These techniques will be subjects of near-future posts.

Modeling Oregon Solar Uptake with Infectious Disease Model Part 1 (Data Discovery & Processing)

Stephen Chan — Sun, 09 Apr 2017 00:00:00 GMT

In the US Solar Industry, customer acquisition cost are very high (49 cents per Watt, equivalent to $3,000 for the typical 6-kilowatt residential rooftop).

This project is born out of the 2017 SunCode challenge where our team will try to address the cost of acquisition issue in the US solar industry by creating a predictive model for residential solar adoption.

Infectious Disease Model

The basic premise of using an infectious disease model to predict solar energy adoption is that Americans (middle class Americans in particular) are driven by what their neighbors are doing. Essentially, it is the "keeping up with the jones" that motivates people to adopt solar.

Give that premise, we can model the likelihood of an individual adopting solar by examining who else in the neighborhood has previously adopted solar. We'll be considering other variables such as household income and neighborhood political affiliations.

Data Requirement

Solar Installations

To model the progression of an infectious "solar" disease, we must understand the spatial relationship between an infected individual to the likelihood of another individual becoming infected.

After some significant effort, we have located public domain information on the Oregon's solar uptake from the year 2013 to 2016. This data is then visualized with Google Fusion Table in a web-based interface.

Income

One of the factor that we will be considering is the income level of individual household. Since income data are difficult to obtain at this granularity, we'll be using the house value as proxy (pulled from the Zillow API).

Political Affiliation

Our hypothesis postulates that residential solar power adoption rates is affected by the individual's political affiliation. We use a simple binary variable to represents whether the district the location is in registered majority democrat or republican between 2013-2016.

Data extract for this portion is difficult, since the data is only available in PDF format with poorly constructed tables.

The Oregon lower house district boundary data can then be used to classify a particular geographical area's political leaning.

Mapping Electrification Potential using Remote Sensing Data

Stephen Chan — Mon, 03 Apr 2017 00:00:00 GMT

Accelerating the identification of electrification opportunities for underserved communities around the world using accurate remote sensing data. Application developed as an entry to the Department's Clean Energy Data Science Challenge 2017.

Project Code on: GitHub | Live Demo

The Challenge

This application is created as an entry to the State Department's Clean Energy Data Science Challenge 2017:

More than a billion people globally lack access to electricity and another billion lack access to reliable electricity, greatly impacting education, health, social and economic development. A major barrier to investment in this space is a lack of visibility into many countries’ renewable energy potential.

Your challenge is to find new ways to map this potential. Can you leverage multiple data sources to do, in two days, what could otherwise take many months (and millions of dollars) with traditional methods?

Creating actionable insights through algorithms, programs, and applications could be a game-changer for entrepreneurs, investors, and policymakers around the world and help to improve the lives of millions of people.

Our Team of Seven

2 Business Analysts / Researcher
1 Geospatial Data Engineer
1 Geospatial Data Scientist
1 Generalist Data Engineer
1 Frontend Engineer
1 Fullstack Engineer + GIS Operator (<- me)

The team was formed on the first day and most of the team members doesn't know each other prior to the event.

Design Philosophy

The design philosophy of the application is to utilize accurate remote sensing data to identify electrification opportunities for underserved communities around the world. The use of remote sensing data remove the reliance on traditional data sources for regions of the world that are often non-existent, inaccurate, incomplete and outdated

Approach

The application utilizes the recently released High Resolution Settlement Layer (HRSL) data (1 arc-second / ~30m resolution) to identify human settlements extents.

The electrification status of the these settlements are then determined using NASA's Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime sensor data (~150m resolution).

Combining the two data set, the application identifies candidate communities where there are sufficient population density to make electrification potentially viable.

System Architecture

Frontend

The Frontend has a map-based UI utilizing Mapbox GL JS library and the Bootstrap framework. The following data layers are added to the visualization:

HRSL (Settlement Density)
VIIRS Nighttime Radiance
Existing HV Electricity Transmission Lines
Existing Roads
Potential Viable Electrification Sites

Backend

The geospatial analysis results are stored within Mapbox's platform, removing the need for hosting our own geoserver. Raster data transformation was needed to create raster data layers to suit Mapbox's very specific bit-depth, projection and format specifications.

Lessons

Mapbox raster data has very specific requirements - 8-bit Geotiff with WGS 84/Pseudo-Mercator (EPSG:3857) with no Alpha. It's likely you'll need to do post-processing from the data analytics pipelines to fit the specification.

What's next?

Features are currently being incrementally added to the project. Future coverage for analysis include countries such as:

Burkina Faso
Ghana
Haiti
Ivory Coast
Madagascar
Malawi
South Africa
Sri Lanka

A Fitness App that Motivates by Making it A Competition

Stephen Chan — Thu, 09 Mar 2017 00:00:00 GMT

How do we get people off the couch and start exercising? Make it a competition so you can one-up your friends. This app challenges allows you to challenge your friends to a competitive race in real-time or asynchronously.

Project Code on: GitHub

System Architecture

Front End

The iOS Frontend is built with React-Native following the Google Materials Design guidelines. Authentication is handled with Auth0 with the Facebook (and potentially other) platform. The selection of React-Native as a frontend framework allow us to write mobile application in JavaScript and will facilitate easier porting to Android in the future.

Back End

The Backend are a collection of containerized microservices running under a docker-machine. The microservices architecture allow the team members to work on the codebase with minimum amount of overlap, reducing the time needed to resolve code conflicts.

The use of containerized microservices also provided flexibility in how the back-end is implemented. nodeJS + Express is used for the microservices responsible for routing, authentication, users management and websocket live race relays. Python and PostgresSQL are used for runs and challenges services to facilitate future manipulation work on geospatial data.

Continuous Integration (CI) & Deployment

The development process of this application utilizes Continuous Testing, Integration & Deployment workflow. Merges to the master branch are tested, and if the tests passes, docker images for each microservices are rebuilt and pushed to Docker Hub.
The EC2 instances then pull newly built docker images from the Docker Hub. Since we're using docker containers, little or no residue of the previous versions are left on the deployment environment.

Lessons

iOS (and now Android) is really aggressive in putting your application to sleep / hibernation mode. It makes executing application logic (such as monitoring GPS location and executing race logic) difficult.

Tracking, Monitoring and Alert System for Incoming Congress Legislations

Stephen Chan — Sat, 11 Feb 2017 00:00:00 GMT

With thousands of Bills being introduced to congress each year, how do informed citizens keep track of legislations that are important to them? We created an application to keep monitor incoming legislations so citizens can be alerted and take action.

Project Frontend Code on Github

Motivation

A healthy democracy requires the active participation of each individual within the community. However, it is difficult and time consuming for individuals to understand and track the thousands of legislations & bills that are passed through congress. This difficulty makes contribute to widespread voter apathy and a suboptimal democratic government.

This application is designed to help lower the difficult and costs of tracking legislations that are of importance to each individual. By providing timely alert to incoming legislation, individuals can react and response organize timely action to oppose or support certain legislations.

Features

Individualized House of Representatives & Senators Social Media + Contact Information for Each User
Search Function for Existing Legislations
Monitor New Incoming Legislation with Email Alerts for Specific Keywords

System Architecture

Frontend

The responsive frontend is built on the Bootstrap framework with the React library. Legislation search functionality and up-to-date legislator contact information is done by calls to Sunlight Foundation's OpenCongress API.

Backend

The application backend consists of three components: web server, legislation monitoring worker and a email notification worker.

The web server is responsible for user authentication, user address geolocation (via Google Maps API), and responding to client request for up-to-date monitored legislation results for the dashboard.

The legislation monitoring working periodically poll the OpenCongress API for new and updated legislations. New legislations are associated with keywords by running their title and text through the Twinwords natural language processing API. These legislations and their associated keywords are then stored in the database. A signal is subsequently sent to the email notification worker to notify it of new incoming legislations.

The email notification worker looks through keywords that are of interest to each individual user, and an email notification is sent when there are new and progressing legislations that matches the user specified keywords.

What's Next?

To make the tool more effective, several social collaboration features are envisioned:

Legislation Text Annotations
Discussion Forum
Legislation Feedback (upvotes / downvotes)
Related News / Online Commentary Discovery

Retrieving Areal Evapotranspiration Data for Water Quality Modeling

Stephen Chan — Tue, 10 Jan 2017 00:00:00 GMT

Making it easy for non-GIS savvy engineers to access the data they need for water quality modelling.

Project Code on: GitHub

Motivation

MUSIC is the de-facto software for modeling performance of stormwater quality management systems.

One of the critical input to the model is the monthly evapotranspiration data, which is dependent upon the location of the project site in question. The evapotranspiration GIS data is available publicly from state agencies, but non-GIS savvy engineers often are unable to access it easily.

So I've built a tool to make it easier and faster for engineers to retrieve the evaporation data for their projects.

Architecture

Frontend

The frontend consists of a map based UI that is built with Angular & Bootstrap framework. It utilizes the Google Maps API for location search and coordinate retrieval. The coordinates are sent as input arguments to the backend to retrieve the evaporation data for a particular project location.

Backend

The backend consists of a nodeJS + Express server acting as both API & Web Server hosted on Heroku Cloud Platform. The server parses evaporation ArcGrid GIS data on startup and expose the data query service with a RESTful API.

What's next?

As part of the quality assurance process all of the modeling (as well as inputs) must be checked. Right now there's no easy way to do that other than going to compare the values to evaporation contours charts from BOM. The next feature will include positions overlay on these chart so that a PDF report can be generated for QA and filing purposes.

Modelling Tweets as Markov Chains

Stephen Chan — Sun, 01 Jan 2017 00:00:00 GMT

To celebrate the end of the 2016, let's model president-elect Trump's 2016 tweets as a Markov Chain!

Project Code on: GitHub | Live Demo

What is a Markov Chain?

Markov Chains are stochastic processes that satisfy the Markov Property. Markov Chains are described "memoryless", where the probability of the future state is entirely and only dependent on the current state and not the previous states.

Are the words contained within a tweet generated via a stochastic process that satisfies the Markov property? absolutely not

So why do you want to model tweets as a Markov Chain? because it's hilarious

Getting the Tweet Data

Twitter provides an API for reading and writing tweet data. It's also useless for our purpose because Twitter only returns tweets from the last three weeks. Very sad.

But it looks like trumptwitterarchive.com have a database of historical tweets from the president-elect. So I've copied all of the tweets from 2016 to a local text file.

Tweet Data Preprocessing

We'll need to pre-process our data so we can produce a sensible looking transition matrix. String elements such as links to articles, quotes and urls are removed using Regex.

Modelling Tweets as a Markov Chain

Strictly speaking, we are modelling the process of writing out a tweet as a Markov Chain. Each word of the tweet is represented by a system state. With the knowledge of the present system state (current word) we can obtain the future system state (next word) using the transition matrix.

A Tweet's First Word:

We will need to set an initial condition for our system before the Markov process of writing out the subsequent word can begin. From our dataset, we can determine the probability distribution of the first word (initial condition) of the tweet:

function calculateFirstWordsProbabilities(tweets)
{
  let tweetFirstWords = [];
  tweets.forEach(extractFirstWordofTweet);
  tweetFirstWordsProbability = convertToProbabilityArray(tweetFirstWords);

  function extractFirstWordofTweet(tweet) {
    let words = tweet.split(' ');
    tweetFirstWords.push(words[0]);
  }  
}

If your dataset is small, you can conceivably just populate an array of all of the first words of each tweet (with repeats). That will simplify the process as we no longer need to calculate the cumulative distribution function. This will probably require lots of storage for any decent size dataset.

Determining the Transition Matrix

Given that the state space is very large (all unique words within our tweets dataset). I've chosen to implement the transition matrix as an hash table with the key representing the current state and the value as an array of future states and their associated probabilities. This minimises storage requirement as the transition matrix is very sparse.

function populateMarkovFrequencies(tweet) {
  let words = tweet.split(' ');

  for (let i = 0; i < words.length; i++) {
    let word = words[i];
    let isLastWordinTweet = (i === (words.length - 1));

    if (markovTransition[word] === undefined) {
      markovTransition[word] = [];
    }

    if (isLastWordinTweet) {
      markovTransition[word].push('\n');
    } else {
      let nextWordInTweet = words[i+1];
      markovTransition[word].push(nextWordInTweet); 
    }
  }

function processMarkovTransition() {
  for (let word in markovTransition) {
    markovTransition[word] = convertToProbabilityArray(markovTransition[word]);
  }
}

Done!

All we need to do now is to use the transition matrix to generate our own tweet!

Optimizing Water Supply Networks with Genetic Algorithm (Part II)

Stephen Chan — Tue, 17 May 2016 00:00:00 GMT

This Part II of my blogpost on the application that helps with optimizing water supply networks using genetic algorithm.

The application is written in C/C++.
Project Code on: GitHub

Genetic Algorithm

Genetic algorithm is a method for optimization process that is inspired by natural selection. Individual solutions are encoded into "DNA" which are repeatedly mutated, reproduced and selected for "fitness" based on system performance.

Initialization

A randomize set of configurations are used to fill the initial population.

Encoding

Possible choices of components of the system are represented as integers. Bit representation of these integers are concatenated and represents a configuration's "DNA". This encoding process enables the mimicking of the biological mutation and reproduction process.

Mutation

In the mutation process, random bits of each configuration's encoded "DNA" are "mutated" or flipped to create new variations of configurations. These small, random variation is intended to explore the solution space in an incremental manner.

Reproduction / Crossover

In the reproduction process, configurations that have high "fitness" scores are paired and their encoded DNA are merged to create a new pair of configurations. A simple single-point cross-over approach, where the encoded bits are swapped at after a pivot point, is used.

Decoding & Fitness Evaluation

The newly mutated and reproduced population is decoded from their DNA representations into system configuration. These configurations are then evaluated for performance (pressure, flow, pump utilization, etc) as well as construction and operating costs using the EPANET network analysis toolkit and the cost calculation module. Each individual configuration is then assigned a "rank" according to their performance.

A crucial difference between standard GA and the current multi-objective GA approaches are the ranking process. Multi-objective GA ranks individual solutions based on their pareto-optimality as oppose to an individual performance metric. With this approach, multiple configurations can have the same "rank".

Population Culling

During the reproduction / crossover process, the total population is effectively doubled. In order to maintain the constant population numbers, system configurations with low fitness scores are removed.

New Generation

Each mutate-reproduce-cull cycle represents a new generation of the population of configurations. The process will then repeat until the required number of generations is reached (or a set of performance criteria are satisfied).

Results

After a sufficient number of generations, we should have identified a set of close-to-pareto-optimum solutions. Designers and decision makers can then evaluate the performance and cost tradeoffs between these limited set of potential configurations.

Optimizing Water Supply Networks with Genetic Algorithm (Part I)

Stephen Chan — Tue, 10 May 2016 00:00:00 GMT

A safe, reliable source of water is fundamental to the health, economy, and wellbeing of communities. However, the design, construction, operation and maintenance of water supply systems require significant resources. This application uses genetic algorithm to simultaneous optimization of multiple system characteristics to identify Pareto-optimal systems. Allowing decision makers to make tradeoff decisions based an optimal set of options.

The application is written in C/C++.
Project Code on: GitHub

Motivation

Designer, operators and owners of water supply systems need to balance system characteristics such as security, system reliability, construction cost, operating & maintenance costs, energy requirements and greenhouse gas emissions. This myriad of system requirements makes it difficult to optimize existing and proposed water supply systems.

Pareto Optimum Systems

This application identifies Pareto-optimal system configurations for water supply networks. Pareto-optimal configurations cannot be modified to make a particular aspect of the system better without causing another aspect of the system to worsen. The diagram below illustrates the system configurations that are Pareto-optimal with respect to system performance characteristics f1 and f2.

System Architecture

Initially, the user specifies the water supply network configuration to be optimized, as well as a database of system component costs. This information is fed into the genetic algorithm, which uses the EPANET network analysis toolkit to evaluate system performance characteristics (eg. supply pressure, flows & flow velocities, as well as pumping cycles & duration). The total system costs are then calculated using the system cost analysis module. The multi-variate system performance metric are then used as a basis for optimization.

System Architecture

I'll explain the working of the genetic algorithm in the Part II

Stephen Chan's Blog

Scraping and Geocoding Real Estate Properties for GIS Analysis

Digitized Penang JICA Drainage Study Floodmap

Flood Risk Data

Georeferencing & Data Transformation

Digitized Data on Online Map UI

Words of Warning

Dynamic Optimum Furniture Height Diagram

Design & Features

Lessons Learnt

Modeling Oregon Solar Uptake with Infectious Disease Model Part 1 (Data Discovery & Processing)

Infectious Disease Model

Data Requirement

Solar Installations

Income

Political Affiliation

Mapping Electrification Potential using Remote Sensing Data

The Challenge

Our Team of Seven

Design Philosophy

Approach

System Architecture

Frontend

Backend

Lessons

What's next?

A Fitness App that Motivates by Making it A Competition

System Architecture

Front End

Back End

Continuous Integration (CI) & Deployment

Lessons

Tracking, Monitoring and Alert System for Incoming Congress Legislations

Motivation

Features

System Architecture

Frontend

Backend

What's Next?

Retrieving Areal Evapotranspiration Data for Water Quality Modeling

Motivation

Architecture

Frontend

Backend

What's next?

Modelling Tweets as Markov Chains

What is a Markov Chain?

How are Tweets and Markov Chain related?

Getting the Tweet Data

Tweet Data Preprocessing

Modelling Tweets as a Markov Chain

A Tweet's First Word:

Determining the Transition Matrix

Done!

Optimizing Water Supply Networks with Genetic Algorithm (Part II)

Genetic Algorithm

Initialization

Encoding

Mutation

Reproduction / Crossover

Decoding & Fitness Evaluation

Population Culling

New Generation

Results

Optimizing Water Supply Networks with Genetic Algorithm (Part I)

Motivation

Pareto Optimum Systems

System Architecture

System Architecture