Monday, 22 December 2008

Dravid's slump in form



I am on vacation in India. One of my nephews and I are vegging out in front of the TV on a Monday afternoon at my in-laws place, while the rest of the family naps. We are watching India nurdle along at 2 runs per over on the fourth afternoon of the Mohali test against England.

The commentators don't have a whole lot to talk about. We are watching endless replays of Dravid's stumps being shattered by Stuart Broad. Are horrible pictures like this a sign of Rahul Dravid's decline? Or does this just happen sometimes to any batsman, however great? And what to read into his century in the first innings of the Mohali test? The commentators are blathering on and on...for long enough for my inner-analyst to want to get beyond the balther...

The commentariat all agree that Dravid is suffering a slump in form. What, unfortunately, has not been properly examined is whether Dravid has really been scoring fewer runs than before, or whether the perceived slump in form is nothing more than randomness playing out. It is entirely possible that Dravid is batting as well as he ever has, and that the dice just haven't rolled his way. The mind is very good at spotting patterns, especially when there aren't any.

This question is inspired by Moneyball (recommended reading for any cricket fan). Moneyball is about how statistical analysis forms the foundation of a winning baseball team, the Oakland Athletics. It reports on persistent sporting myths that statistics busts. For instance, there is no such thing as a clutch hitter, a batter who does especially well in vital situations. Or that there is no such thing as a hot hand, a streak in basketball when a NBA player is "in-the-groove" and landing every shot in the basket.

Baseball is now enriched by a Society for American Baseball Research. The American Statistical Association now has a section dedicated to sports statistics. It is a pity that this quality of statistical analysis has not been applied to cricket, despite the richness of the data available. It is also an opportunity for a smart young cricket-loving statisticians. Calling S. Rajesh of Cricinfo?

To give the interested (geeky) reader a flavour of what is possible, here is the outline of a statistical analysis that would shed more light on Dravid's form than anything that has appeared in the media so far. None of the technique described below is very complicated, or goes beyond material taught routinely at the undergraduate level. I would be delighted to see this analysis available in the public domain along with a well documented methodology and explanations, and expect no credit or authorship rights. Also, a disclaimer. I am not a professional statistician; my knowledge of statistics is mainly as a customer to statisticians. Any feedback from readers with more statistical knowledge, especially around time-series analytic techniques could improve this analysis, is appreciated.

Outline of desired analysis
Step 1: compile the dataset
Each record in the dataset is one of the ~25000 balls Rahul Dravid has faced in test cricket. Each record in the dataset has the following fields: outcome (which takes the values 0-6 and W, all represented as class variables), opponent (Australia, England etc.), bowler, bowler type (pace, military medium, leg spin etc.), location, home away flag (derived from location), innings (which-ith innings of a test match), position played in the batting order (mostly #3), number of balls already faced in the innings, date innings started, a random number (for validation in step 5).

I don't think any of this data is hard to obtain. It is reported in the ball by ball commentary on Cricinfo, which I'm assuming is professionally archived. This list is not meant to be exhaustive, most datasets come with a few plausible covariates that can be thrown in and played with.

A couple of fields I would love to add, which may be harder to obtain, are length (full, length, back-of, short) and line (outside off, off, middle and leg, outside leg). I believe this is the data the team statisticians sitting in the dressing room code in their laptops.

Examine the dataset to get familiar with patterns, especially with potentially tricky variables like bowler or location. For instance, a bowler Dravid has faced for 12 balls may have dismissed Dravid twice. Worth being aware of weird things in the data before running any regressions.

Step 2: run the regression model
Model the outcome, number of runs scored, wicket or dot-ball as a multinomial logistic outcome. This class of models are used in transportation analysis - every commuter has the choice of multiple models of transportation - or in brand analysis - every consumer has the choice of multiple brands of breakfast cereal. Similarly, every ball has the choice of different outcomes - from sixer through dot-ball to wicket.

Allow the model to see all the fields listed above. Do not constrain the model. All two-term interactions. Just maximize the fit. Essentially, the computer is finding the configuration of explanatory variables with maximizes the likelihood of the observing the outcomes in the dataset.

Most modern statistical packages will apply simple transforms to covariates to improve fit, like for instance taking log(number of balls already faced), a transform which makes intuitive sense anyway.

Step 3: read the results
First pass, one is expecting to see date of innings being a statistically significant. If it is clearly significant, and the coefficient has the right sign (a decline in form), that probably means the effect is real. A completely unconstrained model might spit out some funky functional forms, with performance being a parabolic function of time...improving initially and then declining.

A bunch of other interesting effects will be visible at this stage, and are fun to look for. For instance, does Dravid have a nemesis bowler? Is Dravid genuinely as good abroad as he is at home? Has he done any worse as an opener than at #3? Is Dravid more vulnerable to full length deliveries on the slow pitches at home than abroad (does the interaction term between home away flag and length have a non-zero coefficient)?

Step 4: tweak the model
Refinements to the model are usually needed at this stage.

For instance, if no effect is observed overall, it might be because a real effect over the last six months may be hidden by the length of the continuous dataset in use. Converting time into six monthly blocks may be useful.

Also, a time effect might be masked because it is correlated with the opposition. It might look like Dravid just happens to be weaker against Sri Lanka and Australia, India's most recent opponents. In this case, one might want to force the model to accept time blocks before it admits opposition.

Bowlers with thin data might show up having implausibly strong effects. One might want to modify the data to slot all bowlers who have bowled less than 250 balls at Dravid into a pie-chuckers categorical variable.

Step 5: validate the model
Keep a random subset of ~5000 balls outside the analysis described so far. Repeat the analysis on this holdout to make sure the results observed are similar. Validating on an additional time period is probably nonsense in this context, since time is a variable of interest.

A more interesting approaches to validation is to validate on non-test match data. If Dravid is in decline, we would expect to see that in all forms of cricket.

Step 6: Document the results and limitations
Gaps in data and any subjective interpretations or analytic choices missing values/ definition of class variable etc. would be logged here.

Some limitations are systematic. This dataset is limited to Dravid's performance only. So a generalized improvement in the performance of all test batsmen of the same time period would not be picked up by the model. It is possible that Dravid is playing as well as ever, and that the world has moved forward faster than Dravid. A more ambitious analysis spanning a broader base of test batsmen is needed to shed more light on this.

Also highlight opportunities to improve on the analysis. For instance, it would be interesting to compare Dravid's decline with that of other top players. Assuming there is a decline, is it worse than what Gavaskar or Border suffered? Data may be thinner in the pre-internet era...but maybe it is out there in official score sheets.

Most critically, this analysis does not tell the captain whether or not Dravid should be replaced with a younger batsman. That remains a judgment call, based largely on how he wants to build his team. What it may tell the captain is that Dravid's run of poor scores is explained by randomness and is likely to end soon. So we avoid the injustice of a great player being judged on poorly constructed evidence.

4 comments:

Subhrendu K. Pattanayak said...

As they say in this part of the world ... palleeez! I know you are raring to show off your statistics "consumerism" (as you call it), but you could have said it in half the space. Also, I suspect the 24/7 media cycle has no patience for careful statistical analysis. How many things that are in the news are based on a careful poring over the data (including politics and or speculations about the economy or pathways to recovery). Cricket media has tended to be particularly fickle; it is a small wonder they haven't gone bananas about Dravid being bowled for a duck, and starting to record a form slump which begins with the first innings after he his last century. Actually, right now they are all busy with Hayden (now that Ponting has scored a ton each in Windies, India, and Australia again is out of his retirement talk). How easily we all forget that the highest run getter in the last great Indian victory abroad, the famous WACA Perth test (Australian fortress and all), was Dravid at 93 (7 short of that magical number that somehow makes 93 seem like 50 less than a 101) less than a year ago. So basically it is a problem of too much cricket in a 24/7 world.

Prithvi Chandrasekhar said...

To Subhrendu's comments:

1. We agree. Dravid is probably suggering a run of bad luck more than an irreversible decline. Someone just needs to demonstrate this, though even that may not shut the commentators up

2. Why doesn't 24/7 media coverage crowd out intelligent analysis in baseball? It did for a long time. But Bill James et al emerged from the caliginosity

3. Yes, the post is too long. The step 1 to 6 stuff is very pedantic. I'm sure most readers skipped it

4. The point of being pedantic is to go beyond just moaning, and hopefully prompt someone (pro-stato or amateur) to do this analysis. Cricket needs enlightenment

Anonymous said...

It is futile to expect the mainstream Indian media to get more sophisticated in their understanding and analysis of stats. You have to consider the nature of the audience they primarily address.

Having said that, there is a need for select groups of people to understand the numbers better (eg selectors, quant-strategists working with the team, aficionados like you and me). For example, the batting average as a indicator of likely score in a given innings is worse than useless. In a close-to-normal distribution, all the measures of central tendency converge, whereas a batsman's scores are almost inverse of a normal (more like a U-shape). Regardless of the perceived 'consistency' of a batsman, the most likely dismissal is < 10 runs...As a proportion, any batsman tends to get out least in a range around his batting average - he either gets out very early or well past the average.

Finally, the most sophisticated stats is in the end an analysis of past data. However, the past is no indicator of the future. All the analysis in the world will not successfully predict what is going to happen in the next innings, let alone the next ball.

Anonymous said...

Two more thoughts -

One, an elaboration of my last point, lest it get generalised to a position that all historical analysis is of no use for predicting the future. (Though in a sense if you go down the line of David Hume and Sir Karl Popper you could end in absolute skepticism). I think in professional sports played at the highest level, the margins are so fine that predictive accuracy is bound to be seriously limited. Eg. The difference between a 'well left' and a 'faint nick through to the keeper' is a few microns. No amount of data churn on variables such as condition of pitch, bowler, spot where the ball landed etc. is going to suffice to predict the outcome of a shot. Even over a larger interval - say probability that there will be a nick over say 100 balls, you may not end up with very meaningful findings - in the end the variables are too many...

The second point is with respect to Dravid as a batsman. For a long time, I wondered which of the Big 4 would lose his touch first. I suspected it would be Ganguly / Laxman, because of their reliance on hand-eye co-ordination as opposed to rock solid technique, which typically tends to drop off sharply with age (Srikkanth for eg). I thought Dravid would probably retain his form the longest. I now have a different hypothesis - Dravid's game is essentially based around error avoidance, and selection of the loose ball to play a productive stroke with min risk. His strike rate is naturally much lower. I think Dravid's judgement of risk has dropped with age...result being, once he's set he makes an error every 100 balls as opposed to 300 balls earlier. The early dismissals are par for the course for every batsman (as pointed out earlier) and can be attributed to the law of large nos catching up...it is the dismissals in the 20s / 30s when set (as opposed to going on to make a really big score) that is more worrying.