I've been using my Garmin Forerunner 210 to record my running progress. Like many gadgets it has a built in GPS receiver, allowing the watch to not only record how long my run is, but also where I run.
Running the same route every time can be really boring. To keep things interesting, I take slight detours on my routes. Then a crazy idea popped into my head: can I automatically compare GPS tracks to see where they differ?
Aligning Protein Sequences
Aligning protein sequences is really useful. It allows bioinformaticians to find regions of similarity which may be a consequence of functional, structural, or evolutionary relationships between the sequences.
Sequence 1: TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Sequence 2: TCCPSIVARRSNFNVCRLTPEAICATYTGCIIIPGATSPGDYAN
When the above sequences are aligned, gaps are shown to indicate insertions and deletions.
Alignment 1: TTCCPSIVA-RSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Alignment 2: -TCCPSIVARRSNFNVCRL--TPEAICATYTGCIIIPGATSPGDYAN
One of the earliest methods for aligning sequences is the Needleman-Wunsch algorithm, published in 1970.
At the heart of the algorithm is the scoring system, which is used to determine how a pair of residues should align. The higher the score for a given combination, the more it is biased in the resulting alignment.
A gap penalty is also applied during alignment. If the gap penalty is greater than the score for aligning two residues, a gap is introduced in the alignment instead.
You can use any scoring system you like, depending on the problem you're trying to solve. BLOSUM is a similarity matrix commonly used as the scoring system to align protein sequences. It takes into account the liklihood that a residue will mutate into another based on what has already been observed in nature.
Aligning GPS tracks
How many of you skipped the bit about protein sequences? Come on, be honest. It's okay, I don't mind, but you might be confused if you don't read that first.
Hopefully you can already see the connection: instead of aligning residues in protein sequences, we'll be aligning co-ordinates in GPS tracks. The bits that don't align, the gaps, will be the sections of the tracks that differ.
But what scoring system and gap penalty should be used?
When we compare tracks by eye, we look at the distance between points. We can use the same metric to score GPS points in the alignment; however, the Needleman-Wunsch algorithm considers higher values a match.
Higher distances mean the points are further apart and therefore non-matching, so we need to use the negative distance as the score, which results in larger distances having smaller values.
The gap penalty works as the cut-off for when the distance between two points is considered large enough that they don't represent the same location. As the distance is always negative, the gap penalty distance must also be negative.
There is no universal value for this cut-off. It depends on the quality of the data and the context.
If the GPS device is accurate to 5 metres, setting the cut-off to 2 metres will result in a noisy alignment. Likewise, if the tracks being compared are from planes, you might consider them on the same flight path if they are within 200 metres of one another; for tracks of people walking, this distance will be far too large.
So what happens if we use all this information to align two GPS tracks stored in a GPX file and plot them on a map, showing the matched points in blue and the differences in red and orange?
It's not bad, but also not perfect. It's found the major detour at the bottom-left of the map (in red), but there are also a lot of incorrect mismatches spread throughout the tracks (in orange). What's causing this?
The GPS points in the two tracks are not evenly distributed, resulting in points that are far apart being compared when the density in a section of one track differs greatly from the other. If you look at the locations of the orange points in the alignment, they match very closely to the low-density regions in the plot of the first track shown in the introduction.
This is fixed by pre-processing the GPX files to distribute GPS points evenly over the track. The closer together the points, the more accurate the result, but the longer the alignment will take.
So how does it look now? Amazing!
Show me the code
The implementation of the Needleman-Wunsch algorithm, adapted for GPS tracks, is shown below:
def align_tracks(track1, track2, gap_penalty):
""" Needleman-Wunsch algorithm adapted for gps tracks. """
_log.info("Aligning tracks")
def similarity(p1, p2):
d = gpxpy.geo.distance(p1.latitude, p1.longitude, p1.elevation,
p2.latitude, p2.longitude, p2.elevation)
return -d
# construct f-matrix
f = numpy.zeros((len(track1), len(track2)))
for i in range(0, len(track1)):
f[i][0] = gap_penalty * i
for j in range(0, len(track2)):
f[0][j] = gap_penalty * j
for i in range(1, len(track1)):
t1 = track1[i]
for j in range(1, len(track2)):
t2 = track2[j]
match = f[i-1][j-1] + similarity(t1, t2)
delete = f[i-1][j] + gap_penalty
insert = f[i][j-1] + gap_penalty
f[i, j] = max(match, max(delete, insert))
# backtrack to create alignment
a1 = []
a2 = []
i = len(track1) - 1
j = len(track2) - 1
while i > 0 or j > 0:
if i > 0 and j > 0 and \
f[i, j] == f[i-1][j-1] + similarity(track1[i], track2[j]):
a1.insert(0, track1[i])
a2.insert(0, track2[j])
i -= 1
j -= 1
elif i > 0 and f[i][j] == f[i-1][j] + gap_penalty:
a1.insert(0, track1[i])
a2.insert(0, None)
i -= 1
elif j > 0 and f[i][j] == f[i][j-1] + gap_penalty:
a1.insert(0, None)
a2.insert(0, track2[j])
j -= 1
return a1, a2
There are more examples, along with the code to generate your own alignments, in the jonblack/cmpgpx repository on GitHub. It's in the public domain, too, so do what you please with it. Yay!
This was originally posted on my blog.