Successlessness: August 2008

This post describes the first two "chop" functions that fit into the partitioning framework outlined last post.

def chopToSpeedHistogram(dest, p):
    # create histogram of speeds from nmea written to stdout
    os.system("cat "+sh_escape(dest)+".log"
              + " | awk -F , '{if($1==\"$GPVTG\" && int($8)!=0){count[int($8+0.5)]++}}"
              + " END {for(w in count) printf(\"[%d,%d],\\n\", w, count[w]);}'"
              # sort it
              + " | sort -g -k 1.2"
              # output json of histogram
              + " > "+sh_escape(dest)+".hist")

def chopToHeadingHistogram(dest, p):
    # create histogram of headings from nmea written to stdout (ignore heading when stopped)
    os.system("cat "+sh_escape(dest)+".log"
              + " | awk -F , '{if($1==\"$GPVTG\" && int($8)!=0){count[5.0*int($2/5.0+0.5)]++;}}"
              + " END {for(w in count) printf(\"[%d,%d],\\n\", w, count[w]);}'"
              # sort it
              + " | sort -g -k 1.2"
              # output json of histogram
              + " > "+sh_escape(dest)+".head")

Both functions use awk to create a histogram from the speed (in km/h) and heading (or bearing, in degrees) from the NMEA VTG sentences. The speed is rounded to an integer, and the bearing to the nearest 5 degrees. The data logger records on reading per second, so this gives a measure of how much time was spent at each speed/bearing.

The histogram is output in a "json" array format that can be inserted straight into a webpage where the flot library is used to generate some graphs.

Speed Histogram

The average and standard deviation (shaded at ±0.5σ) are indicated on the graph for two bike rides along the same route, and match pretty closely with that recorded by my bike computer:

	GPS log			Bike computer
Ride 1 (brown)	2hrs 59 min, minus 41 min stopped	63.4km	27.7km/h	2 hrs 16 min	64.01km	28.00km/h
Ride 2 (dark green)	2 hrs 25 min, minus 13 min stopped	63.4km	29.0km/h	2 hrs 10 min	63.85km	29.30km/h

The two rides went in different directions, the first in the "uphill" direction and the second with a bit of a tail wind. I got a flat tire on the first ride too, hence the extra time spent stopped.

Heading Histogram

Up is north and the radius represents the time spent heading in that direction (normalized during the plotting process and "expanded" by taking the square root to show a little more detail.)

My GPS logger is capturing lots of useful information but it's difficult to efficiently capture data for regular activities. Geotagging photos is easy, and manually working with the logs for a special event is possible, but it's not feasible to put in that much work to analyze commutes for example.

The logger creates a separate log file each time it's switched on and off, and while these logs could be sorted into categories for analysis, it's easy to forget to turn it on and off at the start and end of a section of interest and activities are then merged in the logs. In addition, there is often "junk" data at start and end of logs while leaving or arriving at a destination.

I wanted to be able to automatically capture the information about my daily activities by simply switching on the logger and carrying it around with me. I then simply want to plug the logger into the computer and have the logs automatically chopped into segments of interest that can be compared to each other over time.

The rest of this post roughly outlines the Python script I created to perform this task, minus some of the hopefully irrelevant details.

Firstly, I collect the lat/long coordinates of places that I am interested in collecting data while I'm there and traveling between them. These include my home, work, the climbing gym and so on. Each point has a radius within which any readings will be considered to be in that place.

#         id:  name lat         long        radius
places = { 1: ("A", -37.123456, 145.123456, 0.050),
           2: ("B", -37.234567, 145.234567, 0.050),
           3: ("C", -37.345678, 145.345678, 0.050) }
otherid = 4

For each of these places of interest, I then use gpsbabel's radius filter to find all the times where I was within that zone:

# create a list of all raw log files to be processed
from path import path
month = path("/gpslogs/200808")
logs = " ".join(["-i nmea -f %s"%log 
                 for log in sorted((month/"raw").files("GPS_*.log"))])

for (id,(place,lat,lon,radius)) in places.items():
   os.system("gpsbabel "
             # input files
             + logs
             # convert to waypoints
             + " -x transform,wpt=trk,del"
             # remove anything outside place of interest
             + (" -x radius,distance=%.3fK,lat=%.6f,lon=%.6f,nosort"%(radius,lat,lon))
             # convert back to tracks
             + " -x transform,trk=wpt,del"
             # output nmea to stdout
             + " -o nmea -F -"
             # filter to just GPRMC sentences
             + " | grep GPRMC"
             # output to log file
             + (" > %s/processed/place%d.log"%(month,id)))

And all points outside any of the specific places of interest are sent into an "other" file:

os.system("gpsbabel "
          # input files
          + logs
          # convert to waypoints
          + " -x transform,wpt=trk,del"
          # remove anything in a place of interest
          + "".join([" -x radius,distance=%.3fK,lat=%.6f,lon=%.6f,nosort,exclude"%(radius,lat,lon)
                     for (id,(place,lat,lon,radius)) in places.items()])
          # convert back to tracks
          + " -x transform,trk=wpt,del"
          # output nmea to stdout
          + " -o nmea -F -"
          # filter to just GPRMC sentences
          + " | grep GPRMC"
          # output to log file
          + (" > %s/processed/place%d.log" % (month, otherid)))

These files are filtered with grep to contain only minimal data as we only require the timestamps for this part of the process. Specifically only the NMEA GPRMC sentences are kept.

To provide a brief illustration, the following picture shows two log files of data, a blue and a green, between three points of interest:

The above process would create four files, one for each point A, B and C and one for "Other" points that would contain something like the following information, where the horizontal axis represents time:

I then read all those log files back in to create a "time line" that for each timestamp stores my "location" in the sense that it knows whether I was "home", at "work" or somewhere between the two.

# dict of timestamp (seconds since epoch, UTC) to placeid
where = {}
for placeid in places.keys()+[otherid,]:
   for line in (month/"processed"/("place%d.log"%placeid)).lines():
      fields = line.split(",")
      # convert date/time to seconds since epoch (UTC)
      t, d = fields[1], fields[-3]
      ts = calendar.timegm( (2000+int(d[4:6]), int(d[2:4]), int(d[0:2]),
                             int(t[0:2]), int(t[2:4]), int(t[4:6])) )
      where[ts] = placeid

This is then summarised from one value per second to a list of "segments" with a start and end time and a location. Unlogged time segments are also inserted at this point whenever there are no logged readings for 5 minutes or more.

# array of tuples (placeid, start, end, logged)
# placeid = 0 indicates "unknown location", i.e. unlogged
summary = []
current, start, stop, last_ts = 0, 0, 0, None
for ts in sorted(where.keys()):
   # detect and insert "gaps" if space between logged timestamps is greater than 5 minutes
   if last_ts and ts-last_ts > 5*60:
      if current:
         summary.append( [current, start, stop, True] )
      current, start, stop = where[ts], ts, ts
      summary.append( [0, last_ts, ts, False] )
 
   last_ts = ts

   if where[ts] != current:
      if current:
         summary.append( [current, start, stop, True] )
      current, start, stop = where[ts], ts, ts
   else:
      stop = ts
summary.append( [current, start, stop, True] )

(If there's a more "Pythonic" way of writing that kind of code, I'd be interested in knowing it.)

"Spurious" segments are then removed. These show up because when the logger is inside buildings the location jumps around and often out of the 50m radius meaning that, for example, there will be a sequence of Home-Other-Home-Other-Home logs. The "Other" segments that are between two known points of interest and less than 5 minutes long are deleted, as are "Other" segments that sit between a known place of interest and an unlogged segment.

Based on the above graphic, the summary might look something like the following:

start	end	location
10.00am	10.05am	A
10.05am	10.30am	Other
10.30am	10.35am	B
10.35am	11.00am	Other
...

The "Other" segments are then labelled if possible to indicate they were "commutes" between known locations:

start	end	location
10.00am	10.05am	A
10.05am	10.30am	A-B
10.30am	10.35am	B
10.35am	11.00am	B-C
...

Some segments cannot be labeled automatically and are left as "Other". This may be a trip out to a "one-off" location and back again, which can be left as "Other". However, sometimes it is because the logger didn't lock onto the satellites within the 50m radius on the way out of a place of interest and these can be manually fixed up later.

Once a list of "activities" has been obtained, with start and end times, it is easy to use gpsbabel again to split logs based on start and end of time segments:

for (place, start, stop, place_from, place_to, logged) in summary:
    dest = month / "processed" / ("%s-%s"%(time.strftime("%Y%m%d%H%M%S", time.localtime(start)),
                                           time.strftime("%Y%m%d%H%M%S", time.localtime(stop))))

   for (ext, chopFn) in [(".log", chopToLog),
                         (".kml", chopToKml), 
                         (".speed", chopToSpeedVsDistance), 
                         (".alt", chopToAltitudeVsDistance), 
                         (".hist", chopToSpeedHistogram),
                         (".head", chopToHeadingHistogram),
                         (".stops", chopToStopsVsDistance)]:
      if not (dest+ext).exists():
         chopFn(dest, locals())
         # make the file in case it was empty and not created
         (dest+ext).touch()

This generates a bunch of files for each segment, named with the start and end timestamps of the segment and an extension depending on the content. The first "chop" function generates an NMEA format log file that is then processed further by the remaining "chop" functions. The other chop functions will probably be explained in a later post, the first two are:

def chopToLog(dest, p):
    # filter input file entries within times of interest to temp file
    os.system("gpsbabel " + p["logs"]
              + (" -x track,merge,start=%s,stop=%s"
                 % (time.strftime("%Y%m%d%H%M%S", time.gmtime(p["start"])),
                    time.strftime("%Y%m%d%H%M%S", time.gmtime(p["stop"]))))
              + " -o nmea -F "+sh_escape(dest)+".log")

def chopToKml(dest, p):
    # create kml file with reduced resolution
    os.system("gpsbabel -i nmea -f "+sh_escape(dest)+".log"
              + " -x simplify,error=0.01k"
              + " -o kml -F "+sh_escape(dest)+".kml")

def sh_escape(p):
    return p.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

(Again, if there's a better way to handle escaping special characters in shell commands, I would like to know it.)

Using this, I can simply plug in the logger, which launches an autorun script, and the end result are nicely segmented log files that I can map and graph. More about that in another post.

Successlessness

Monday, August 25, 2008

Analysing GPS Logs with Awk

Speed Histogram

Heading Histogram

Thursday, August 21, 2008

Automatically Partitioning GPS Logs with gpsbabel

Monday, August 18, 2008

GMail tip: "To be filtered" label

AMOD AGL3080 GPS data logger

labels

archive

del.icio.us/tompaton

contributors