Monday, July 01, 2013

Open data Hackaton Amsterdam

Last saturday I visited the Open Data Hackathon in Amsterdam organised by "HackdeOverheid" (http://www.hackdeoverheid.nl). It was my first visit to such an event. I really liked the vibrant atmosphere, but in the days after some thoughts on it returned to me frequently which I would like to share with you here.

Identifying OpenData
These days it's quite popular at governments to publish data as OpenData. For sure any data published is interesting. However quite some datasets published are aggregated in some way, which makes it less usefull for usecases the publisher hadn't thought about it could also be used for. Sometimes the aggregation is done to facilitate developers, but most of the times it's done for other reasons like to protect the privacy of the persons mentioned in the data.
For example at this hackaton a dataset was presented by SVB (http://www.hackdeoverheid.nl/voornamen-data-beschikbaar-voor-apps), they summerized the given firstname over the whole country, which limits the dataset for a single purpose: firstname-popularity. If they had aggregated to street/postode/area (or not at all) people might have used the dataset to relate name-giving to region or even economical status.
Which leads me to a suggestion to publishers, provide us with the raw data please. Offer aggregations as a separate download.

Open standards
At the event there were frequent requests for open formats like CSV/JSON/TXT, also people requested API's. But there was not much awareness on open standards. At such a point I always realise that as a geo community we're quite far in development and implementation of open standards. The risk of every organisation implementing it's own formats and API's is that a data miner should develop specific format conversions for each organisation he wants to extract data from. Think of the dutch communities, we have some 250 communities, if they would all develop a specific api on their data, it would be very hard to extract similar data from all those api's. Quite some people are aware of this riks, that's why government develloped "basisregistraties": indications on how to store and comunicate data on certain thematic area's, to be implemented by for example all communities. And quite important for the Open Data movement, since most of the data available via the "basisregistraties" will be open data. A first example of this is "Basisregistratie gebouwen", a dataset (+ soap and WFS api) which contains all buildings in the Netherlands.  Ok, this is not a simple json-rest API, but hey, we're developers, we are not afraid of a little XML. My colleague pointed me on http://5stardata.info/ where indeed complying to unified data models in not mentioned as a star, they point at linked data as a way to go. Which indeed might be a better pattern to interact with data from different origin. But afaik is quite experimental at most organisations.

GeoJSON in GitHub
Recently the Github team added GeoJSON support in Github. Uploaded GeoJSON files are displayed as maps on a nice backdrop using LeafletJS. Since then people started uploading masses of GeoJSON files, also in preparation of this Hackathon. For sure there is the risk this will be a single action, and the data will soon be outdated, but if done correctly it could mean a real change in how we're used to publishing data. Imagine:
- An automated proces updates the GeoJSON data in Github every... In the history of git you can then see very nicely the historic development of the dataset.
- You can fork the dataset, reformat it and publish it again, or even open a pull request for data optimisations
- To make the data accessible in traditional GIS you could add a WMS/WFS server which uses the Github GeoJSON as input-storage (using OGR)
- In the end people will love Git as storage and will introduce Git servers in their own organisation as master storage and just clone to Github.
Related to this there is a proposal by Max Ogden & OKFN https://github.com/maxogden/dat and another proposal by OpenGeo https://github.com/opengeo/GeoGit. Today I noticed a blog by Rufus Pollock on the matter, http://blog.okfn.org/2013/07/02/git-and-github-for-data, amazing to see the movement on this theme these days

OGC vs best practices
The last thought is on OGC vs best practices standards. These days we see projects like MapBox, CartoDB, LeafletJS, GeoJSON being very popular, but dissociating from OGC standards.
For sure they use conventions between the products (epsg:900913, TMS, GeoJSON), but those conventions are a result of best practices in the community and not designed in a highly technological and political process at OGC. These best practices standards are  light weight, focus on performance, are much easier to implement, widespread and offer a more fluent user experience than any application using OGC standards. OGC should really focus on these leighter standards. We are at a point that data distributors and proprietary spatial software implementors get much pressure from users to also support the best practices standards, resulting in the fact that the best practices standards are widely implemented in both open source and proprietary systems without having been adopted by OGC.


1 comment:

Lex Slaghuis said...

Hi, with regard to SVB data, raw data is not possible due to privacy issues.

If one would have the first name of the oldest person, and also the lastname in a different file, then the oldest person could be located. Such examples hint at the privacy issues involved.


However aggegrates over geo are possible to disclose, and hopefully Svb will look into that (next to keeping the data up-to-date).

Also they could extend the range of data to the full historic span of the database.