The Apteco Datathon: 3. how can you win an Olympic medal?

20 Sep 2018  |  by Tom Taylor

In the third of our Apteco Datathon series we continue to deep dive into real-life datasets to analyse and interpret the numbers. In this blog we examine what it takes to be an Olympic athlete – and your best chances of winning a medal.

In our last blog we explored a dataset containing international football matches starting from 1872. Continuing this sporting theme, we created and explored a FastStats system containing 120 years of Olympic data. What can we learn from this data and could it show us how to improve our chances of standing on the podium with an Olympic medal around our neck?

The Dataset

Conveniently for us, a dataset scraped from sports-reference.com was put into a nice csv format and hosted on kaggle. We combined this with a US government dataset containing various measurements of country development. With a bit of tweaking this gave us two tables: Athletes; and Events. The data for each table is:

  • Athletes
    • ID
    • Height
    • Weight
    • Sex
    • Name
  • Events
    • Athlete
    • Athlete age at the event
    • Competing country and NOC code (National Olympic Committee Code e.g. GBR)
    • Team (e.g. Denmark/Sweden or USA)
    • Host city
    • Sport and event
    • Medal
    • Event year

To attach the country data to the relevant country in the events table, the NOC code was used with a lookup table to its associated country. This isn’t always appropriate as there are occasionally mixed nation teams representing under the same country code, but this rarely happens. Plus more often than not the team countries are very similar in terms of development (note the Denmark/Sweden example). This gives us a total of 135,571 athletes, with 271,116 unique entries, and information for 205 countries.

As there’s no data for the 2018 winter Olympics in PyeongChang, we’ll only be looking at the summer games.

Diving into the data

To first verify the integrity of the data, let's look at something we already know. How about identifying the most decorated Olympic athlete? We'll begin this by making an aggregation of events that have a bronze, silver or gold medal.

datathon-1.jpg

datathon-2.jpg

Using this value on a selection by applying it for the top n results gives us the data grid:

datathon-3.jpg

with Michael Phelps on top. That's reassuring.

How the games have changed

Evidently the games are a totally different beast from 120 years ago, but how have they changed? Let’s start by looking at event entries over time on a chart.

datathon-4.jpg

From this we can see some dips in 1932, 1956, 1976, and 1980. The 1932 games was hosted in Los Angeles, and as a result many competitors couldn’t afford to pay for the trip there. In 1956, these countries didn’t compete for the following reasons:

  • Egypt, Iraq, and Lebanon in response to the Suez Crisis.
  • Netherlands, Cambodia, Spain, and Switzerland in response to USSR crushing the Hungarian Revolution.
  • People's Republic of China as a result of Republic of China (Taiwan) being allowed to compete

In 1976, 29 countries boycotted the games due to the IOC not banning New Zealand for the All Blacks tour of South Africa.

In 1980, 65 countries boycotted the games in Moscow in protest of the 1979 Soviet-Afghan war.

Let’s now have a look at the breakdown of events to see how the popularity of sports has changed.

datathon-5.jpg

In the modern Olympics there’s not been much variation, with athletics in purple being the most popular sport in terms of entries, followed by swimming in blue, and rowing in orange.

However, in 1932 we can see one green spike in popularity for an odd sport… Art competitions! This division of the Olympic Games was hosted from 1912 until 1948 but stopped due to “concerns about amateurism and professionalism”. The IOC also doesn’t recognise the medals awarded as official anymore, but this doesn’t make it any less interesting.

Something more personal

As I was looking into this dataset I realised I had a massive desire to win an Olympic medal. Luckily, with all this data in front of me I had an advantage over all those fools who blindly choose a sport to compete in for fun – I could statistically look at the data to define the sport in which achieving a medal is most probable.

I looked at the summer games since 2000, and using BMI and age created a cube with the mode event type as a statistic. Here's what comes out of this for men:

datathon-6.jpg

If you hover the mouse over a cube entry a popup will show with the relevant event. For example, I weigh 65kg and stand 1.8m tall, so my BMI is 65 / 1.8 ^ 2 = 20.1, and my age is 22. This puts me on event ID 71, which is Athletics Men’s 1,500 meters. Maybe it’s time to start doing a few laps of the track.

Below is the same table for women and an index lookup table so you're able to see what would suit you best too.

datathon-7.jpg

Event ID

Title

71

Athletics Men's 1,500 metres

76

Athletics Men's 10,000 metres

92

Athletics Men's 4 x 400 metres Relay

97

Athletics Men's 5,000 metres

99

Athletics Men's 50 kilometres Walk

102

Athletics Men's 800 metres

116

Athletics Men's Marathon

143

Athletics Women's Discus Throw

146

Athletics Women's High Jump

147

Athletics Women's Javelin Throw

149

Athletics Women's Marathon

152

Athletics Women's Shot Put

183

Boxing Men's Featherweight

184

Boxing Men's Flyweight

189

Boxing Men's Light-Welterweight

298

Diving Men's Platform

315

Equestrianism Mixed Dressage, Individual

316

Equestrianism Mixed Dressage, Team

320

Equestrianism Mixed Jumping, Individual

321

Equestrianism Mixed Jumping, Team

324

Equestrianism Mixed Three-Day Event, Team

324

Equestrianism Mixed Three-Day Event, Team

336

Fencing Men's epee, Team

337

Fencing Women's Foil, Individual

349

Football Men's Football

374

Gymnastics Men's Parallel Bars

380

Gymnastics Men's Team All-Around

384

Gymnastics Women's Balance Beam

388

Gymnastics Women's Team All-Around

390

Gymnastics Women's Uneven Bars

391

Handball Men's Handball

392

Handball Women's Handball

393

Hockey Men's Hockey

394

Hockey Women's Hockey

398

Judo Men's Extra-Lightweight

433

Rhythmic Gymnastics Women's Group

436

Rowing Men's 17-Man Naval Rowing Boats

439

Rowing Men's Coxed Fours

442

Rowing Men's Coxed Pairs

452

Rowing Women's Coxed Eights

453

Rowing Women's Coxed Fours

492

Sailing Mixed Three Person Keelboat

495

Sailing Mixed Two Person Keelboat

577

Shooting Mixed Trap

583

Shooting Women's Sporting Pistol, 25 metres

627

Swimming Men's 1,500 metres Freestyle

636

Swimming Men's 200 metres Backstroke

638

Swimming Men's 200 metres Butterfly

645

Swimming Men's 4 x 100 metres Medley Relay

646

Swimming Men's 4 x 200 metres Freestyle Relay

651

Swimming Men's 400 metres Freestyle

673

Swimming Women's 4 x 100 metres Freestyle Relay

674

Swimming Women's 4 x 100 metres Medley Relay

676

Swimming Women's 400 metres Freestyle

682

Synchronized Swimming Women's Team

712

Volleyball Women's Volleyball

713

Water Polo Men's Water Polo

715

Weightlifting Men's All-Around Dumbbell Contest

735

Weightlifting Women's Super-Heavyweight

737

Wrestling Men's Bantamweight, Freestyle

While this is quite a fun piece of analysis, there are clearly limitations – not least that the mode will be heavily influenced by the sports in which there have been a lot of athletes, so we see a lot of the more popular sports in the list.

What makes an athlete?

This exercise also doesn't really give us any proper insight into how these variables affect the chance of getting a medal. To determine this we can use a modelling environment, or more precisely the Apteco Modelling profiling tool.

datathon-11.jpg

Both the base and analytic selection are all summer events entries from 2000, and the analytic section also filters out all non-medal events. We looked at how weight, height, sex, sport, age and bmi affect the likelihood of obtaining a medal.

Using the profile tool it appears that the particular sport you choose has the greatest impact on your chances of standing on the podium. Team sports are the most likely route to success, with individual sports – such as triathlon and gymnastics – the least likely.

datathon-12.jpg

datathon-13.jpg

datathon-14.jpg

Weight and height were the next most influential variables, with taller and heavier athletes more likely to be medal winners. It's also nice to see that out of our variables, sex is the least effective.

Unfortunately, the success of an athlete isn't always dictated by how much work they put in. This is evident when we look at how the success of an athlete is affected by the country they represent.

datathon-15.jpg

datathon-16.jpg

With this profile including GDP (USD per Capita) and literacy (% of population), we can see that athletes in more developed countries are heavily favoured. This is likely to be related to access to greater resources than those in less-developed nations (Note that the GDP banding seems quite erratic – bin size was prioritized over range size).

Conclusion

There isn’t enough space in this blog to detail everything I've discovered, and there’s plenty more to learn from this dataset. Considering how relatively small the dataset is with only 271,116 entries, the data is very rich with information. This made it very enjoyable to build the system and work with it. If I were to continue working with the data I would want to look into the outliers that were discovered in the profiling tool and have a look to see if certain countries are doing better than they are expected to. But then there’s always more to learn from data.

New Call-to-action

Tom Taylor

This post was written by Tom Taylor. Tom joined Apteco for a summer placement before his final year of his degree and spent 2 months working on various data and programming projects. This post is the result of a project to gather, clean, analyse and present a data analysis project using Apteco FastStats®.

Subscribe to our blog and get all the latest data analysis and campaign automation news.