I am on vacation in India. One of my nephews and I are vegging out in front of the TV on a Monday afternoon at my in-laws place, while the rest of the family naps. We are watching India nurdle along at 2 runs per over on the fourth afternoon of the Mohali test against England.
The commentators don't have a whole lot to talk about. We are watching endless replays of Dravid's stumps being shattered by Stuart Broad. Are horrible pictures like this a sign of
Rahul Dravid's decline? Or does this just happen sometimes to any batsman, however great? And what to read into his century in the first innings of the Mohali test? The commentators are blathering on and on...for long enough for my inner-analyst to want to get beyond the balther...
The commentariat all agree that Dravid is suffering a slump in form. What, unfortunately, has not been properly examined is whether
Dravid has really been scoring fewer runs than before, or whether the perceived slump in form is nothing more than randomness playing out. It is entirely possible that
Dravid is batting as well as he ever has, and that the dice just haven't rolled his way. The mind is very good at spotting patterns, especially when there aren't any.
This question is inspired by
Moneyball (recommended reading for any cricket fan).
Moneyball is about how statistical analysis forms the foundation of a winning baseball team, the Oakland Athletics. It reports on persistent sporting myths that statistics busts. For instance, there is
no such thing as a clutch hitter, a batter who does especially well in vital situations. Or that there is
no such thing as a hot hand, a streak in basketball when a NBA player is "in-the-groove" and landing every shot in the basket.
Baseball is now enriched by a
Society for American Baseball Research. The American Statistical Association now has a
section dedicated to sports statistics. It is a pity that this quality of statistical analysis has not been applied to cricket, despite the richness of the data available. It is also an opportunity for a smart young cricket-loving statisticians. Calling S.
Rajesh of
Cricinfo?
To give the interested (geeky) reader a flavour of what is possible, here is the outline of a statistical analysis that would shed more light on
Dravid's form than anything that has appeared in the media so far. None of the technique described below is very complicated, or goes beyond material taught routinely at the undergraduate level. I would be delighted to see this analysis available in the public domain along with a well documented methodology and explanations, and expect no credit or authorship rights. Also, a disclaimer. I am not a professional statistician; my knowledge of statistics is mainly as a customer to statisticians. Any feedback from readers with more statistical knowledge, especially around time-series analytic techniques could improve this analysis, is appreciated.
Outline of desired analysisStep 1: compile the datasetEach record in the
dataset is one of the ~25000 balls
Rahul Dravid has faced in test cricket. Each record in the
dataset has the following fields: outcome (which takes the values 0-6 and W, all represented as class variables), opponent (Australia, England etc.), bowler, bowler type (pace, military medium, leg spin etc.), location, home away flag (derived from location), innings (which-
ith innings of a test match), position played in the batting order (mostly #3), number of balls already faced in the innings, date innings started, a random number (for validation in step 5).
I don't think any of this data is hard to obtain. It is reported in the ball by ball commentary on
Cricinfo, which I'm assuming is professionally archived. This list is not meant to be exhaustive, most
datasets come with a few plausible
covariates that can be thrown in and played with.
A couple of fields I would love to add, which may be harder to obtain, are length (full, length, back-of, short) and line (outside off, off, middle and leg, outside leg). I believe this is the data the team statisticians sitting in the dressing room code in their laptops.
Examine the
dataset to get familiar with patterns, especially with potentially tricky variables like bowler or location. For instance, a bowler
Dravid has faced for 12 balls may have dismissed
Dravid twice. Worth being aware of weird things in the data before running any regressions.
Step 2: run the regression modelModel the outcome, number of runs scored, wicket or dot-ball as a
multinomial logistic outcome. This class of models are used in transportation analysis - every commuter has the choice of multiple models of transportation - or in brand analysis - every consumer has the choice of multiple brands of breakfast cereal. Similarly, every ball has the choice of different outcomes - from
sixer through dot-ball to wicket.
Allow the model to see all the fields listed above. Do not constrain the model. All two-term interactions. Just maximize the fit. Essentially, the computer is finding the configuration of explanatory variables with maximizes the likelihood of the observing the outcomes in the
dataset.
Most modern statistical packages will apply simple transforms to
covariates to improve fit, like for instance taking log(number of balls already faced), a transform which makes intuitive sense anyway.
Step 3: read the resultsFirst pass, one is expecting to see date of innings being a statistically significant. If it is clearly significant, and the coefficient has the right sign (a decline in form), that probably means the effect is real. A completely unconstrained model might spit out some funky functional forms, with performance being a parabolic function of time...improving initially and then declining.
A bunch of other interesting effects will be visible at this stage, and are fun to look for. For instance, does
Dravid have a nemesis bowler? Is
Dravid genuinely as good abroad as he is at home? Has he done any worse as an opener than at #3? Is
Dravid more vulnerable to full length deliveries on the slow pitches at home than abroad (does the interaction term between home away flag and length have a non-zero coefficient)?
Step 4: tweak the modelRefinements to the model are usually needed at this stage.
For instance, if no effect is observed overall, it might be because a real effect over the last six months may be hidden by the length of the continuous
dataset in use. Converting time into six monthly blocks may be useful.
Also, a time effect might be masked because it is correlated with the opposition. It might look like
Dravid just happens to be weaker against
Sri Lanka and Australia, India's most recent opponents. In this case, one might want to force the model to accept time blocks before it admits opposition.
Bowlers with thin data might show up having implausibly strong effects. One might want to modify the data to slot all bowlers who have bowled less than 250 balls at
Dravid into a
pie-chuckers categorical variable.
Step 5: validate the modelKeep a random subset of ~5000 balls outside the analysis described so far. Repeat the analysis on this holdout to make sure the results observed are similar. Validating on an additional time period is probably nonsense in this context, since time is a variable of interest.
A more interesting approaches to validation is to validate on non-test match data. If Dravid is in decline, we would expect to see that in all forms of cricket.
Step 6: Document the results and limitations
Gaps in data and any subjective interpretations or analytic choices missing values/ definition of class variable etc. would be logged here.
Some limitations are systematic. This dataset is limited to Dravid's performance only. So a generalized improvement in the performance of all test batsmen of the same time period would not be picked up by the model. It is possible that Dravid is playing as well as ever, and that the world has moved forward faster than Dravid. A more ambitious analysis spanning a broader base of test batsmen is needed to shed more light on this.
Also highlight opportunities to improve on the analysis. For instance, it would be interesting to compare Dravid's decline with that of other top players. Assuming there is a decline, is it worse than what Gavaskar or Border suffered? Data may be thinner in the pre-internet era...but maybe it is out there in official score sheets.
Most critically, this analysis does not tell the captain whether or not Dravid should be replaced with a younger batsman. That remains a judgment call, based largely on how he wants to build his team. What it may tell the captain is that Dravid's run of poor scores is explained by randomness and is likely to end soon. So we avoid the injustice of a great player being judged on poorly constructed evidence.