This is a story about data, and about how we collect data, but more about the tools we use to analyse that data and how apt they are to the results we want.
It is a story about finding the limits in analysis and trying to discover and write new ways of analysing data. It is a story about how data can define the tools we use to analyse it.
And it is a story about cats.
I track my cat using a collar based tracker that records his position every ten seconds to an XML file which can be taken off the device when he returns home. It is a fairly low-technology solution but Leo is a fairly low-technology cat.
Leo quickly lost his first tracker – he has lost twelve of sixteen he has had in his two years – but using the second for a period of time showed me how ill suited to the task it was performing it was.
Cat trackers – as a rule – save data as GPX files. GPX files are most commonly used by cyclists and drivers for route planning data. A GPX file describes points inferring the distance between those points by their position in the dataset. That “M606” proceeds “M62” which in turn proceeds “M1” in the code is how we use a GPX file for route mapping.
Route mapping is a lot above movement and almost entire unconcerned with stopping. If we had a GPX file of the M606/M62/M1 journey we could assume that there would be a correlation between the timestamps in the data and the speed of the car.
As a file format GPX is about movement.
Cats are often still.
When Movement Gets Warm
It is very useful and not unsurprising to see where a cat goes when he or she is alone. The distance covered is remarkable and will give you a newfound respect for your moggy after you slob on the sofa following a ten yard walk from the car and he has done 10km in a day.
The early tracking of Leo shows a good idea of where he does go. We see a line that draws into the field behind some houses. These short tracks gave some reasonable information but when we began to track Leo for full days the information became less useful.
An all day wander if a spaghetti junction of movement around the same points. It is not uninteresting but rather than looking at a day like a route we look at it more like a heat map. This is the failing of the tools of analysis of GPX when used for cats and not for cars.
Heat maps for single car journeys are useless – you never go over the same point twice – but for cats a heat map tells you some interesting information. Where does your cat most like to go? Where does he avoid? Does he or she really go into the neighbour’s garden as often as the neighbour suggests?
Let It Linger
Another way GPX data is poor is that it assumes that the aim is to move between points while cats often do not move at all. Often they find a place outside to sleep and can remain still for minutes, even hours, which using the tracking solutions that are most used for cars and bikes barely registers.
In Leo’s tracking there is no way of noticing that he might have stopped.
Also the tracker which Leo uses assumes a sight of the sky most of the time. The reasons for this are fairly obvious but the fact that while cars do not go into houses cats do gives gaps in the gathered data. The tracker can go off line for hours at a time and if one traces where this happens it normally occurs within ten seconds of a cat flap.
And so we looked at tracking two events. The linger, and The undercover.
The linger is a time when Leo remains in one position. GPS tracking (at this cost) is not sophisticated enough to not give jumps so a tolerance of slight moves in his fine Latitude and Longitude data has to be made. This is also useful for showing times when Leo has slowed perhaps when stalking something.
A linger has to have some length too. To stop for a second or two is not the same as waiting to investigate something. A linger has to last at least 90 seconds with a catch that ignores any bad data that may come in the middle of a linger making it appear as if Leo has teleported 100m East and then returned in ten seconds. One bad data event will be tolerated.
The linger need only be a view of the map where the linger occurred.
The undercover is when the dataset has no entries for a period of time. If Leo goes into a house, or under tree cover, and the tracking signal is lost then the gap between data is not obvious when using GPX analysis tools based on route planning.
We need to know where Leo has gone undercover – probably into a cat flap – and where he has come out of being undercover – the same cat flap – and how long he spent away. This will tell us if he has been home for a two hour nap or raided a neighbour’s house for a two minute food steal.
This is a simple job of looking for any time when the difference in timestamps on the data is longer than it should be and then noting where this occurred.
No surprises. When you are this photogenic you do stop to strike a pose.
What this tells us about data
Data has an increasing importance in our society. On a surface level one understands that – for example – Amazon uses it to match one to a book one might like and that Supermarkets use it to try understand why people buy more sprouts on a Thursday afternoon and we are fine with that.
Fine in that we assume that if that data does not lead to anything especially interesting – such as the wrong book – or insightful – such as not knowing why sprouts are sold then the problems are distant and commercial. If Amazon or Tesco fail someone will fill their place.
But data is being used in financial markets, and to model financial markets, and those models are used to shape societies and even to misshape societies. If the analysis of the data is good – and if the tools used to analyze it are apt – then we can expect good things to follow.
But what about a fundamental assumption such as that route tracking and cat tracking work in the same way? It is a benign issue but what if somewhere in a complex model of an economy there is a similar assumption that is flawed? What if we are missing something in our analysis of data because our meta-analysis of how we can analyze data is based around too simple assumptions and not nuanced around the data itself?
We get better at data. We get better at collecting data and better at analysing it and creating analysis which is specific to the data but we only do this when we find previous analysis unsuitable for our needs, as I have with tracking Leo using GPX files, and while Evgeny Morozov’s Click To Save Everything is a flawed book his assertion that as a society we are becoming more enamoured with digital solutionism to the point where we begin to accept it unquestioningly is hard to deny.
We need to get better at understanding the limits of the analysis of the datasets we have, and better at creating better tools for analysis.
Or we could sleep for four hours. I know what Leo would do.