Is my car better than average? Data mining using the FastStats profiling tool

09 Mar 2021  |  by Chris Roe

Introduction

I’ve never been one to buy new cars. My vehicles tend to be bought when they are around 7 years old with around 60,000 miles on the clock. I don’t have a preference for any particular make or model but purchase whatever is available at the time I need to buy it that satisfies the requirements I have at the time. I then keep them until such time that they cost more to repair and get them through their annual MOT test of roadworthiness, than it would do to replace them with a newer car. Typically this has tended to be another 5-7 years and about 50 to 60 thousand miles.

My current vehicle has just past the 110,000 mile mark and will be 13 years old in January 2021, the date for its next MOT. Is this a high mileage for this type of vehicle for this age? When should I expect that it will come to the end of its natural life?

The UK government data website (1) publishes all of the UK MOT test data since the computerisation of those results in 2005. In this blog post we will use FastStats to look into this data and see if it can tell us whether my car is living on borrowed time!

Data Collection 

Data has been released from 2005 to 2019 for all of the MOT tests that have taken place in the UK. These have been made available as a series of yearly files with information on the vehicles, tests and details on individual test results.

There are 61 million vehicles in the data, and some limited information is stored on the vehicle make and model, first use date and its colour, fuel type and cylinder capacity. The vehicles are not identifiable by registration plate (which is not included in the data) but the same identifier is used for each vehicle each year so that it is possible to track vehicles over time.

For those 61 million vehicles there are an associated 512 million MOT tests. For each test we know the result, the date of the test and the recorded mileage at the time of the test. We also know the Postal Area in which the test took place.

The final data table gives further detailed information on each of the MOT tests. All reasons for failure, or advisory notices are recorded against each test. Details for the type, reason, vehicle location and text of the failure are stored. There are 955 million of these MOT test item details in the data.

The website contains an instruction manual (2) with a detailed description of the data, the available attributes and explanatory notes about the data collection and quality.

So, what about my car?

My car is a Mazda 5 with a 57 year plate and a first use date in January 2008. I bought it when it was 6 years old. It has had some issues needing rectification at its last 3 MOT tests. Am I an average driver of this type of car? Is it doing better than expected? How long should I expect to keep it before needing to replace it? Let’s take a closer look…

Given some information from the website (3) it is not difficult to use the test information (test dates and mileage information) to identify your own vehicle in a selection. This will give me the Vehicle Id for my car and I can then save this as a FastStats Selection. I can compare the information from my specific car against the general population of cars of the same type.

Our first job is to find all of the relevant vehicles in the database that are of the same make and model. The Make is a Selector so we can easily find Mazda within the data. The Mazda 5 has been known under the ‘5’ and ‘Premacy’ models. The Model is a Text field and is quite likely that it has been entered at the MOT station, as there are many versions of the model in the text. There are also a number of typographical mistakes within the text. I found a total of 35,895 vehicles that match common variations of the Mazda 5 model. Some of these have the first use date before 1999, so I have removed these from the analysis, as that is the year it was first released. This takes the number down to a final usable set of 35,855 vehicles, which again is specified as a FastStats Selection.

We know that the MOT test is performed annually so we can work out how old in years a car was at the date of the test, by rounding it to the nearest year using an expression as below. This rounding helps to ensure that we mark the age correctly as the MOT can be performed up to a month before it is due.

Ageinyearsatdataoftest FS image

Using the two FastStats selections defined above (My car, All Mazda 5s), I can add these to one dimension of the cube as an overlapping QuerySet. I can add the expression above banded into integer values as the other dimension. The two interesting statistics initially are the number of vehicles in each category (i.e number of vehicles of age X having an MOT test) and the Mean of the mileage of those vehicles at the date of the MOT test. This gives us a cube as shown below.

Cube of Mazda mean mileage

Some interesting insights that we can gain from the analysis above:

  1. There is only 1 vehicle in the first column for each of the years (3-11) which is expected since my 1 car necessarily has an MOT test annually.
  2. The second column gives us the number of vehicles of each age that have had a test whilst that old. This peaks at 6 years old. Cars produced since 2014 can’t contribute to MOT tests older than that, and some cars that are older than that will have been scrapped so they are not contributing to tests at older ages.
  3. My car mileages at the test date are shown in Column 3, and show that my car has had reasonably average mileage throughout most of its lifetime. When it was 9 years old the mileage average ends “.5” because the car had 2 tests that year.
  4. Column 4 shows the average mileage of all Mazda 5’s that had a test whilst that age. It does show that we get to just over 100k average mileage by 13 years old but then it flattens out, which suggests that cars that are driven more extensively are then dropping off (i.e being scrapped) and those which are being driven less are still working.
  5. My most recent MOT test in January 2020 (not in this data set) had a mileage of 106,200 on its test certificate. The difference between my vehicle and the average Mazda5 is now growing wider. It is now above what we might consider to be the terminal mileage of the average Mazda 5! Maybe it is indeed living on borrowed time.

What factors are relevant in distinguishing whether a Mazda 5 is still running?

There are 35,855 Mazda 5’s that have had an MOT test at some point. Some of those will no longer be on the road. We can identify those by looking for vehicles that did not have an MOT test in 2019. There were 15,200 of these. We can then use the Profile tool to look at the factors which distinguish discontinued Mazda5’s from those that are still on the road.

We can add some of the interesting fields about the vehicle itself and the test particulars and use these as dimensions on the Profile to see what we can find.

One of the possible fields is the Postcode Area in which the test took place. There are 126 such codes, but we have also grouped them up into 15 Regions. Here is the variable level information returned for the Profile.

Profile of Mazda discontinued vs in use image

Before we delve more deeply, there are clearly some correlations between some of the variables that I have initially added to the Profile.

  1. Postcode Area and Region are entirely correlated due to the manner of their creation. The Postcode Area is slightly more predictive – this is not too surprising as there are fewer records in each of the categories so any random variability will lead to higher percentage differences from expected values. Alternatively it may be due to finer geographic differences that can be picked up more accurately by the variable with more values.
  2. Banded First Use Date and Banded Test Mileage are also somewhat correlated. This is a natural relationship – you add mileage to the car over time.
  3. The age of the car is the best predictor of whether it has been discontinued. There is a natural trend here we would expect – we shall see shortly if it is indeed the case!
  4. Colour is the next best predictor, which initially seems a bit odd. We’ll delve into this in a later section.
  5. The postcode area/region also seems to have a small predictive effect too; we’ll look at this more closely in a minute.

Now let us look at each of the individual variables in turn.

The screenshot below shows the code level information for the First Use Date (Years) banded variable. The length of the red bar and the direction in the ‘Penetration’ column gives an indication of whether it is over-represented (to the right) or under-represented (to the left). Unsurprisingly, the older the car the more likely it is to no longer be in use. The ‘par’ year is 2006 where there are nearly the same number of vehicles running/not running – vehicles newer than this are more likely on average to still be in use.

Larger profile of discontinued vs in use Mazda

There is a strong (but definitely not absolute!) correlation between mileage and year of first use. In general, the older cars have higher mileage. However, we can see in the banded mileage the ‘par’ mileage is around 60,000 miles (more than that and we’d expect the car to no longer be in use). There is also a clear data error in the top mileage of over 9 million miles!

Profile banded test mileage

Now, let us turn our attention to the less obvious variables.

Here is the category level view of the Colour variable. It appears at first glance that some colours can be useful predictors of whether a Mazda is still in use. It would appear that Yellow and Green Mazda 5’s are likely to be not in use, and White, Maroon and Grey are most likely to still be in use. This seems counter-intuitive as this factor shouldn’t really have an effect on whether a car works or not! There are also some colours (Bronze or Turquoise for example) where there were so few produced that there is not enough evidence to know whether they are significant or not, and this is indicated with the yellow colour.

Colour profile

My first hypothesis at this point is that the vehicle colours produced over the lifetime of the Mazda 5 have changed and that there are fewer Yellow and Green ones being produced now with White/Maroon colours being a more recent trend. This is indeed the case! The following cube shows the relationship between First Use Year and Colour for our Mazda 5’s. The Index statistic gives us a numeric measure (centred around 100) of whether the value in a cell is more or less than we would expect. We can then see that for the colours mentioned above – the expected trend can be seen.

For example:

  • For White we can see that they have become more common than expected from 2011 onwards.
  • For Green we can see the opposite trend, and that they are less common than expected from 2004 onwards.

Cube Mazda5 colour vs first use date  

Finally, let us look at the geographic indicators. This shows the area in which the MOT test took place – we can make a reasonable assumption that most people have their MOT test close to their primary residence – so it is a reasonable proxy for where the car is based.  I have chosen here to show the Region variable (as it is easier to see all of the categories), and there does appear to be an interesting insight here. Mazdas based in the North are more likely to be out of use, whereas (generally) the further South we travel the more likely it is that it is still in use.

Profile by Region

The differences from the expected values are nowhere near as significant as the other variables, as shown by the Index scores being closer to 100 than for the other variables.

Further investigation doesn’t show any particular correlations between mileages/first use dates and test region so this does appear to be a result that doesn’t have any immediately obvious explanation.

If we consider all of the discontinued vehicles, against all of the other vehicles produced since 1999 then we see that the regions are in different orders, as shown in the screenshot below – so there is possibly something interesting here with Mazda 5 regions.

Profile by region

Given the factors above we can then see that for my Mazda 5:

  1. The first use date of 2008 is indicative of it still being roadworthy.
  2. The mileage is higher than expected and has a negative contribution to whether it is still running.
  3. The Region and Fuel Type are pretty neutral factors.
  4. Overall we would score my car slightly positively in 2019 – with the expectation that within 2 years it would likely be scoring negatively. Writing this in September 2020, it is now 18 months since that last MOT test in the data, and it will be 2 years at its next test in January 2021 so maybe I do need to start looking now…! (4)

Conclusion

This is a rich large dataset despite the relatively small number of variables that are provided. In this post we have really focussed on a very small number of the records to look at a specific question on a particular make and model. The profile analysis has shown that my car has done at least as well as an average Mazda5. We have also shown the importance of understanding correlations between variables to ensure that we understand the important predictive factors. Given its mileage is now higher than the average, and coupled with its age I should expect that it will soon become uneconomical to repair it to the MOT standard in the next year or so.

How can I go about choosing my next car – if only I had some relevant data to do this analysis…!

Notes

(1) The dataset described in this blog post has been collected from the UK government website. The URL is: https://data.gov.uk/dataset/e3939ef8-30c7-4ca8-9c7c-ad9475cc9b2f/anonymised-mot-tests-and-results .

(2) A full description of the data, methodology etc can be found at (http://data.dft.gov.uk/anonymised-mot-test/MOT_user_guide_v4.docx).

(3) You can check the MOT history of a vehicle from its licence plate at the following website: https://www.gov.uk/check-mot-history

(4) This blog was written in September 2020, but only prepared for release in January 2021, shortly after my cars next MOT test – which it sailed through with only minor faults. One more year at least!!

 

Download “Seven top tips for best practice marketing campaign management”

Chris Roe

Developer

Chris spends his time developing new analytics features for Apteco FastStats®. You may also meet him during Apteco training sessions. In addition to this, Chris spends his time building FastStats systems from publicly available data, searching for insights and writing for the Apteco Blog series.

Subscribe to our blog and get all the latest data analysis and campaign automation news.