What do you get when you let a bunch of ML engineers run? Well, in this case an analysis on the performance and a hierarchical clustering on the split times. In this blog our ML engineer Keje briefly explains the process of clustering based on our results of the Zevenheuvelenloop. Let’s start!

Seven Hills Run data

The Zevenheuvelenloop (Seven Hills Run) is an annual road running race of 15 kilometres held in Nijmegen, The Netherlands. This year ten of us enjoyed this run. After completion, we were eager to analyse the results. The dataset after some slight modifications looked like this:

 
Distance P1_time Split_1 Splitsec_1 P10_time Split_10 Splitsec_10
1 kilometer 04:57 4:57 297 03:44 03:44 224
2 kilometer 09:48 4:51 291 07:41 03:57 237
..
14 kilometer 01:06:42 4:34 274 55:49 03:49 229
15 kilometer 1:11:04  4:22 262 59:26 03:37 217

Analysis of running times

One of us managed to run below the one hour mark. Some of you will probably know who that is! An interesting analysis would be to look at the similarity of running patterns. The first step is to plot the split times of each person in seconds to get a feeling for the data. The first visualisation of the split times looks like the one below. We can observe that one person (brown line) has the fastest split time and one person (red) started fast but had some issues at kilometer six.

 

Running times Machine Learning

Normalising the split times

If we want to cluster it would be fun to compare the behaviour of running and how similar that is to each other. For example, do some persons sprint at the end or do some show a steady pattern? And who is a fast climber in relation to the average split time? Let’s start by looking at the data after normalising the split time in seconds with the  average time in seconds of each individual. In this case a 1 would be the same split time as the average of that person and a 0.9 would be 10% faster. The graph on the right now shows that the split times show a more similar behaviour except one. It’s also interesting to see that eight persons were sprinting at the end whereas two others slowed down. After the quick analysis, it would be interesting to see the correlation of each person to the others. 

Correlation

To determine the correlation we can use a correlation plot to visually determine who correlates with who.

 

There are some strong correlations! Makes sense, because most of us run faster downhill than uphill. However, there are some negative correlating combinations. Person_1 with person_7. Going back to the line plot, we see that person_1 is the blue line who starts with a value of 1.05, so 5% slower than the average and after each kilometer that person runs faster. Person 7 (pink) shows the opposite trend. This explains why we observe a negative correlation.

Clustering

Clustering is a technique to group similar datapoints to form a cluster. Each datapoint in the cluster is more similar to the other datapoints in the same cluster. Multiple algorithms are available to cluster the data. For this example, we are using a hierarchical clustering method because our dataset is limited (running time) and with hierarchical clustering we don’t need to specify the number of clusters beforehand. 

There are two forms of hierarchical clustering:

  1. Agglomerative: Building the clusters from bottom-up where each observations starts with its own cluster and merges with the other clusters that are the most similar. It stops when there is only one cluster.
  2. Divisive:  This method is a top down approach where all observations start in one cluster and splits are performed recursively

In our case when using the agglomerative method the method looks like:

  1. Compute the normalised split times for each person so it looks like {person_1}, [1.04, …, 0.98]
  2. Put every person in his/her own bucket and start looking which cluster is the most similar by calculating the distance (more about distance here)
  3. Visualise the distance by plotting a dendrogram
  4. Repeat when only one cluster is remaining

A dendrogram is a tree-like diagram that visualises the clusters and its distance by merges and splits as shown below

 

Results Hierarchical clustering

When we look at the dendrogram there are some clear clusters. person_2 and person_9 are most similar because they have the smallest distance. Person 10 and 6 follow soon after. Looking back at the normalised line plot, the characteristics of this cluster is that they tend to start faster than average and begin to slow down at 5km. Between  5 and 11km the maintain a steady pace where they slow down in climbs and speed up downhill. From 11km onwards the speed increases with the fastest split time at the end. Number 7 and 8 show similarity at the end by slowing down when they reach the finish. There is also an outlier which is person 4 (red) who slowed down at kilometer 6 and 7 due to an issue. 

 

Next steps

This quick analysis showed us how similar some of us run. Possible next steps could be to enhance the data by including heart rate measures from Strava and include the height differences of the route. Even better would be to deploy this clustering method in the cloud to stream real time the smallest distance to determine how similar you are running compared to your colleagues!

Thanks for reading! – Keje (ML Engineer @ Enjins)

 

LinkedIn Contact

 

Get in touch

Curious to find out what machine learning can do for you? Or about our office?

Don’t hesitate to fill out the contact form and get in touch, or drop by for some coffee, tea or a Friday afternoon beer!  

Go to careers

Enjins logo
Close

Contact

De Entree 234
1101 EE Amsterdam

KvK: 71755101

We would like to hear from you

Contact