Is GeoJSON a good spatial data format?
A few days ago on Mastodon Eli Pousson asked:
Can anyone suggest examples of files that can contain location info but aren’t often considered spatial data file formats?
He suggested EXIF, Iván Sánchez Ortega followed up with spreadsheets, and being devilish I said GeoJSON.
This led to more discussion, with people asking why I thought that, so I instead of being flippant I thought about it. This blog post is the result of those thoughts which I thought were kind of obvious but from things people have said since may be aren’t that obvious.
I’ve mostly been a developer for most of my career so my main interest in a spatial data format is that:
- it stores my spatial data as I want it to,
- it’s fast to read and to a lesser extent, write.
- It’s easy to manage.
One, seems to be obvious, if I store a point then ask for it back I want to get that point back (to the limit of the precision of the processor’s floating point). If a format can’t manage that then please don’t use it. This is not common but Excel comes to mind as a program that takes good data and trashes it. If it isn’t changing gene names into dates then it’s reordering the dbf file to destroy your shapefile. GeoJSON also can fail at this as the standard says that I must store the data in WGS:84 (lon/lat), which is fine if that is the format that I store my data in already, but suppose I have some high quality OSGB data that is carefully surveyed to fractions of a millimetre and the underlying code does a conversion to WGS:84 in the background and further the developer wanted to save space and limited the number of decimal places to say 6 (OK, that was me) when it gets converted back to OSGB I’m looking at centimetres (or worse) but given the vagaries of floating point representation I may not be able to tell.
Two, comes from being a GeoServer developer, a largish chunk of the time taken to draw a web map (or stream out a WFS file) is taken up by reading the data from the disk. Much of the rest of the time is converting the data into a form that we can draw. Ideally, we only want to read in the features needed for the map the user has requested (actually, ideally we want to not read in most of the data by having it already be in the cache, but that is hard to do). So we like indexed datasets both spatial indexes and attribute indexes can help substantially speed up map drawing. As the size of spatial datasets increases the time taken to fetch the next feature from the store becomes more and more important. An index allows the program to skip to the correct place in the file for either a specific feature or for features that are in a specific place or contain a certain attribute with the requested value. This is a great time saver, imagine trying to look something up in a big book by using the index compared to paging through it reading each page in turn.
After one or more indexes the main thing I look for in a format is a binary format that is easy to read (and write). GeoJSON (and GML) are both problematic here as they are text formats (which is great in a transfer format) and so for every coordinate of every spatial object the computer has to read in a series of digits (and punctuation) and convert that into an actual binary number that it can understand. This is a slow operation (by computer speeds anyway) and if I have a couple of million points in my coastline file then I don’t want to do 4 million slow operations before I even think of drawing something.
Three, I have to interact with users on a fairly regular basis and in a lot of cases these are not spatial
data experts. If a format comes with up to a dozen similarly named files (that are all important) that a GIS
will refuse to process unless you guess which is the important one then it is more of a pain than a help. And
yes shapefile I’m looking at you. If your process still makes use of Shapefiles please, please stop doing that
to your users (and the support team) and switch over to GeoPackages which can store hundreds of data sets
inside a single file, All good GIS products can process them by now, they have been an OGC standard for nearly
10 years. If you don’t think that shapefiles are confusing go and ask your support team how often they have
been sent just the
.shp file (or 11 files but not the
.sbn) or how often they have seen people who have
deleted all the none
.shp files to save disk space.
My other objection to GeoJSON is that I don’t know what the structure (or schema) of the data set is until I have read the entire file. That last record could add several bonus attributes, in fact any (or all) of the records could do that, from a parsers view it is a nightmare. At least GML provides me with a fixed schema and enforces it through out the file.
When I’m storing data (as opposed to transferring it) I use PostGIS, it’s fast and accurate, can store my data in whatever projection I chose and is capable of interfacing with any GIS program I am likely to use, and if I’m writing new code then it provides good, well tested libraries in all the languages I care about so I don’t have to get into the weeds of parsing binary formats. If I fetch a feature from PostGIS it will have exactly the attributes I was expecting no more or less. It has good indexes and a nifty DSL (SQL) that I can use to express my queries that get dealt with by a cool query optimiser that knows way more than I do about how to access data in the database.
If for some reason I need to access my data while I’m travelling or share it with a colleague then I will use a GeoPackage which is a neat little database all packaged up in a single file. It’s not a quick as PostGIS so I wouldn’t use it for millions of records but for most day to day GIS data sets it’s great. You can even store you QGIS styles and project in it to make it a single file project transfer format.
One final point, I sometimes see people preaching that we should go cloud native (and often serverless) by embracing “modern” standards like GeoJSON and COGs. GeoJSON should never be used as a cloud native storage option (unless it’s so small you can read it once and cache it in memory in which case why are you using the cloud) as it is large (yes, I know it compresses well) and slow to parse (and slower still if you compressed it first) and can’t be indexed. So that means you have to copy the whole file from a disk on the far side of a slow internet connection. I don’t care if you have fibre to the door it is still slow compared to the disk in your machine!