Wednesday, July 03, 2013

The internet of things and geo services



Today I visited the concluding conference of a linked open data pilot in the Netherlands (http://pilod.nl). In this blog some thoughts on the matter from a (spatial) catalogue perspective. In the last years a lot of energy has been put in publishing open datasets on the web using open geo standards like CSW/WFS (triggered by regulations like Inspire, basis registraties etc). Unfortunately, due to their nature, geo-services can hardly be used in the linked open data web. As geo community I think we should investigate how to support the internet of thinges from our existing services. It's just a shame if they are left out. Or linked data might even bridge the gap that quite some experience between the geo community and regular ICT

CSW/WFS to RDF
Most geo data is tagged with metadata these days. In the metadata references are made to contact point info and keywords from skos thesauri (like gemet). If we'd be able to present this metadata as RDF along with the spatial data itself, the data will link with exisiting resources in the internet of things instantly. Geonetwork does have a basic RDF-endpoint-implementation these days (/srv/eng/rdf.search, funded by EEA), the actual challenge is in creating rdf output from the WFS services (and linked shapefiles) out there and link to that from the rdf-metadata. A potential solution would be to extend products like geoserver and mapserver, which do have support for a range of output formats for WFS (like GML,KML,SHP,CSV,JSON), with an RDF output format. From the RDF metadata one could then link to a WFS-getfeatures request in format application/rdf+xml. Spiders would be able to index the data and use it as linked data to support GeoSparql queries.

Implementing rdf support can be quite easy if it would be a pure xslt transformation (gml to rdf). However RDF is far more usefull if additional items are added to each getfeature-response-document. Items like contact point and keywords (as defined in the service definition) link the record to other resources on the web.

iso19110
An OGC practice described in iso19110 (http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39965) but unfortunately not commonly implemented is feature catalog metadata (for example http://sdi.eea.europa.eu/catalogue/srv/eng/search?uuid=99b05c69-1532-48b8-bfeb-4cb0df539c88). This could become an essential link in creating RDF from WFS/CSW. ISO19110 describes the attributes used in a dataset. From this metadata references can be made to common (linked data) vocabularies, which then (hopefully) can be used to convert the GML to optimally linked RDF.

Fixed location
The internet of things is based on the fact that each resource on the web has a unique location (url), which will not change over time. In gis servers a record or feature is accessed via a getRecord/getFeature request. One can imagine a scenario that you use this type of requests as unique url for the resource. However, this url is actually not that fixed. For example the order of attributes in the request can change, it can be a post request and the format parameter is explicitely defined in the url (in stead of as an accept header, generally used in open data content negotiation). But more challanging is the version parameter of the standard, after a while a new version of the catalogue standard will become available, and the old standard might get deprecated. The url of the request will change accordingly. This could be managed by adding additional links in the referring document. A link could be added for each supported version, format, language, spatial reference system etc...

Geonetwork as a gateway
Considering above drawbacks it might be an idea to extend the functionality of Geonetwork to act as a gateway between geo and rdf. Geonetwork has all of the required iso19115/iso19110 and links to the actual downloads and services available. It could (daily/hourly) harvest all of that data and convert it to RDF quite optimally. Some additional advantages: a search query in geonetwork will include a search into the actual data and it would solve the uri-strategy for the geo domain instantly: All dutch geodata (PDOK) will for example be available at:
nationaalgeoregister.nl/geonetwork/dataset/{namespace}:{uuid}/record/{id}
For sure some exceptions should be made for frequently changing data like sensor services, and the processing should be delegated to other nodes in the cluster.

For sure above approach will have legal limitations (who is responsible for the data/service). In the long run each organisation will need an RDF endpoint (and register that endpoint in the iso19139 metadata?). But my proposal can offer a temporary solution for organisations which are not ready yet to do the full RDF implementation.

Read more
A search on the web learned there are plenty of initiatives in this area, most important the FP7 Geoknow project (http://geoknow.eu). I'm curious for their results.
Also check "Opportunities and Challenges for using  Linked Data in INSPIRE" by Sven Schade and Michael Lutz http://ceur-ws.org/Vol-691/paper5.pdf


Monday, July 01, 2013

Open data Hackaton Amsterdam

Last saturday I visited the Open Data Hackathon in Amsterdam organised by "HackdeOverheid" (http://www.hackdeoverheid.nl). It was my first visit to such an event. I really liked the vibrant atmosphere, but in the days after some thoughts on it returned to me frequently which I would like to share with you here.

Identifying OpenData
These days it's quite popular at governments to publish data as OpenData. For sure any data published is interesting. However quite some datasets published are aggregated in some way, which makes it less usefull for usecases the publisher hadn't thought about it could also be used for. Sometimes the aggregation is done to facilitate developers, but most of the times it's done for other reasons like to protect the privacy of the persons mentioned in the data.
For example at this hackaton a dataset was presented by SVB (http://www.hackdeoverheid.nl/voornamen-data-beschikbaar-voor-apps), they summerized the given firstname over the whole country, which limits the dataset for a single purpose: firstname-popularity. If they had aggregated to street/postode/area (or not at all) people might have used the dataset to relate name-giving to region or even economical status.
Which leads me to a suggestion to publishers, provide us with the raw data please. Offer aggregations as a separate download.

Open standards
At the event there were frequent requests for open formats like CSV/JSON/TXT, also people requested API's. But there was not much awareness on open standards. At such a point I always realise that as a geo community we're quite far in development and implementation of open standards. The risk of every organisation implementing it's own formats and API's is that a data miner should develop specific format conversions for each organisation he wants to extract data from. Think of the dutch communities, we have some 250 communities, if they would all develop a specific api on their data, it would be very hard to extract similar data from all those api's. Quite some people are aware of this riks, that's why government develloped "basisregistraties": indications on how to store and comunicate data on certain thematic area's, to be implemented by for example all communities. And quite important for the Open Data movement, since most of the data available via the "basisregistraties" will be open data. A first example of this is "Basisregistratie gebouwen", a dataset (+ soap and WFS api) which contains all buildings in the Netherlands.  Ok, this is not a simple json-rest API, but hey, we're developers, we are not afraid of a little XML. My colleague pointed me on http://5stardata.info/ where indeed complying to unified data models in not mentioned as a star, they point at linked data as a way to go. Which indeed might be a better pattern to interact with data from different origin. But afaik is quite experimental at most organisations.

GeoJSON in GitHub
Recently the Github team added GeoJSON support in Github. Uploaded GeoJSON files are displayed as maps on a nice backdrop using LeafletJS. Since then people started uploading masses of GeoJSON files, also in preparation of this Hackathon. For sure there is the risk this will be a single action, and the data will soon be outdated, but if done correctly it could mean a real change in how we're used to publishing data. Imagine:
- An automated proces updates the GeoJSON data in Github every... In the history of git you can then see very nicely the historic development of the dataset.
- You can fork the dataset, reformat it and publish it again, or even open a pull request for data optimisations
- To make the data accessible in traditional GIS you could add a WMS/WFS server which uses the Github GeoJSON as input-storage (using OGR)
- In the end people will love Git as storage and will introduce Git servers in their own organisation as master storage and just clone to Github.
Related to this there is a proposal by Max Ogden & OKFN https://github.com/maxogden/dat and another proposal by OpenGeo https://github.com/opengeo/GeoGit. Today I noticed a blog by Rufus Pollock on the matter, http://blog.okfn.org/2013/07/02/git-and-github-for-data, amazing to see the movement on this theme these days

OGC vs best practices
The last thought is on OGC vs best practices standards. These days we see projects like MapBox, CartoDB, LeafletJS, GeoJSON being very popular, but dissociating from OGC standards.
For sure they use conventions between the products (epsg:900913, TMS, GeoJSON), but those conventions are a result of best practices in the community and not designed in a highly technological and political process at OGC. These best practices standards are  light weight, focus on performance, are much easier to implement, widespread and offer a more fluent user experience than any application using OGC standards. OGC should really focus on these leighter standards. We are at a point that data distributors and proprietary spatial software implementors get much pressure from users to also support the best practices standards, resulting in the fact that the best practices standards are widely implemented in both open source and proprietary systems without having been adopted by OGC.