International Aid Data Hackathon
Last weekend I took part in a hackathon hosted by an offshoot of Engineers Without Borders Canada. This hackathon was focused on using the Canadian Government’s international aid data in various and creative ways. iPolitics coverage of the event can be found here: iPolitics Coverage
I worked on an amazing team with 2 other strong programmers and smart young economics / international development student. We chose to work in Ruby, because for weekend hacking and ease of use nothing beats Ruby. Our team name was GatorAid.
Introduction to International Aid Data
I’ll try to keep this short, there is an international standard for aid data. The standard is xml and is called IATI data, after the initiative (International Aid Transparency Initiative). The Canadian international development organization was recently forced to release all of the projects they fund as open data in the IATI format. This totals approximately 3,000 activities.
In the Canadian aid data, and many other data sets around the world, activity entries are not geocoded with any more specificity than a country name, despite, by our estimates, 30% of projects being implemented at a regional or even municipal. This information is lost in the data, and makes mapping country details impossible. We set out to build a web service that could take in an unstructured text description, identify any regional, city, or village names, geocode them, and return gps co-ordinates for mapping.
Code on Git: https://github.com/davidrs/iati-geocoder
- Read the project description in as a string.
- Pull out all of the capital letter words.
- Ignore words that are all capital letters, these are just acronyms.
- Try to keep hyphenated words together and groups of capital letter words. (Saint-Joseph or Addis Abbaba)
- Compare these words to a white list of cities and regions of the world, courtesy of GeoNames.org
- Take our remaining list and pass them to Google to be geocoded.
- Throw out any geocoded results that aren’t in the parent country or don’t contain the keyword we sent in.
- If we only have a single successful result we can be confident this is where the project is. If we have multiple results we can’t be too sure, return an array of results, if no results are found fall back to the country’s co-ordinates.
- return a json object with the original unique id, and gps co-ordinates.
Any natural language processing challenge is tricky, but given our 30 hours we did pretty well. Before starting development we manually evaluated 40 projects into 3 categories: country level, city level computer solvable, city level non computer solvable. Aprox 70% were country level and there was no further geocoding to do, 25% were city level, and if we could build our ideal algorithm we would correctly identify locations, 5% were city level, but either not enough data was provided, or our algorithm wouldn’t stand a chance, for example: “A project developed in partnership with Ottawa was implemented in Toronto”
We ran our algorithm on a sample of 100 entries and yielded similair percentage distributions and it had correctly sorted the 40 we had manually defined. There are still minor errors, and the geocoding can be overzelous in results returned, so while we wouldn’t be confident automating the entire process we would confidently use the tool to identify ‘low-hanging fruit’ in terms of projects that need to be geocoded and those that don’t. Currently a process that is done by hand. This is a 70% efficiency saving, because instead of reviewing all 3,000 entries you would just focus on the 900 we identify as more than country level programs.
P.S Our team won ‘most innovative’, which was a nice bonus.