Moonballs from Planet Earth: statistics

Showing posts with label statistics. Show all posts

Sunday, 7 March 2021

Do Goans really want to become Portuguese citizens?

Mario Miranda's Goa

Suppose people from a poor third world country were given the option of being citizens of a rich first world country. Would they take it? Or leave it?

As a rule, people from poor third world countries don’t have the option of acquiring first world citizenship, so this remains a mostly theoretical question. Goa is an exception to this rule.

Goans who were Portuguese subjects before 1961, when India liberated Goa from the Salazar dictatorship, can choose to take a Portuguese passport. Their children and grandchildren can make the same choice. In effect, Goans who can trace their roots to colonial times can choose to be EU citizens.

How many of them have taken the option? My best estimate is about 15%.

I did a small poll of friends and family to guess this number among people I thought had enough context to hazard a guess. The range of guesstimates ranged from <1% to >80%. People like us don’t have an intuitive sense for a natural emigration rate.

My 15% estimate is a “soft” number because (surprisingly) there doesn’t seem to be any authoritative public data on this process. Google doesn’t throw up any crisp, credible results. Here is how I pieced together the estimate:

The most frequently quoted number in the press is that there are about 70,000 Goan-origin Portuguese citizens resident in Portugal, another 30,000 in the UK (as EU citizens before Brexit), and about 50,000 living in India (presumably on OCI visas). Adding these numbers up, it seems that about 150,000 Goans have opted for Portuguese citizenship.

How many Goans had the option?

Goa is tiny. Its population today is only about 1.5 million. This number includes migrants from the rest of India who settled in Goa after 1961, who are not eligible for a Portuguese passport.

The population of Goa in 1961 was just under 600,000. If that eligible population grew at a rate of ~1% per annum, about a 1 million would be eligible.

Taken together, it seems about 15% of eligible Goans chose EU citizenship. Or, 85% of those who could have moved from the third world to the first world chose to stay! My own guess was that at least 30% eligible Goans would have taken the EU passport because of the size-of-the-prize.

What is the size-of-the-prize these Goans are choosing not to take?

The chart below shows that trajectory of India and Portugal’s per capita GDP in today’s USD from 1961 onwards (sourced from the World Bank…btw, I love this online data visualization tool!).

At the time the Salazar dictatorship was thrown out of India, emigrating to Portugal was not that attractive. Portugal's per capita GDP was at USD 360 in today’s money. Portugal was basically just another third world dictatorship that happened to be in Europe.

However, after democracy was established in 1974, after Portugal joined the EU in 1986, income skyrocketed. Today Portuguese incomes are about USD 23,000 compared to Indian incomes of about USD 2,000.

Looking at the average Indian's income may be misleading. Goa is much more prosperous than the rest of India. Goa's per capita income is about USD 6000. After adjusting for purchasing power parity, that might mean emigration doubles real income rather than increasing it 10X.

In general, demand for emigration vs. income follows an S shaped curve (first S shaped curve in the diagram). The promise of doubling income, off a relatively high base, clearly isn't enough to prompt a mass migration of the comfortably-off.

Migration is all about S shaped curves

Culture must also be a factor. Goans are stereotypically laid-back. They might care less about the extra income than other Indians. Most Goans don’t speak Portuguese. I can imagine that the psychic cost of learning a new language to gain a foothold in a new home must be daunting.

Perhaps the biggest factor limiting emigration is that many Goans have chosen to stay home.

This is certainly the most emotionally resonant factor for me, as a Tam Bram from Mylapore, Madras. I’d attended a school reunion in Chennai (nee Madras) a few years ago when only 2 out of the 75 science students from my class showed up. The rest were abroad. The large concentrations were in California, Texas and New Jersey. Another Tam Bram friend I was discussing this phenomenon with told me about his classmates from IIT-Roorkee. He is the only one, out of about fifty, who is still in India.

These people are not emigrating because of economic necessity. The quality of life they would have experienced in India was always going to be okay. A big part of the reason they emigrate is because others like them are also emigrating (this is the second S shaped curve on the diagram).

I guess that if enough people-like-us leave home, home doesn’t quite feel like home anymore. I guess its good that enough Goans are staying at home for Goa to still feel like Goa.

Paul Fernandes' Goa

Saturday, 20 July 2013

We drive on the left, so why do we walk on the right?

In England, we drive on the left. So it would be natural to walk on the left, right? Wrong!

This sign, instructing pedestrians to walk on the right, was photographed in the Green Park tube station, in Central London.

In the Green Park tube station

Why? Because of the high concentration of American tourists in Central London? Maybe...but it might just be random.

I'm conditioned to think that things are the way they are for a reason. It is much harder to accept that most things are the way they are for no especially good reason. It just is what it is. Get with the programme, baby, go with the flow.

Pedestrian tunnel, Green Park tube station

Tuesday, 29 May 2012

Why Rafael Nadal is like a Black Swan

"Black Swan" is business-speak for a single observation that demolishes a previously plausible theory.

The phrase comes from Nassim Taleb's excellent book - The Black Swan: The Impact of the Highly Improbable. Suppose one had a theory that "all swans are white". This would have been a really solid theory for a while, it would have been consistent with available evidence, robust to skeptical inquiry. The theory would have held until Australia was discovered, and black swans were observed, at which point the theory is toast. Personally, I find the metaphor a little awkward. But now that it has become a part of the language, it is quite helpful in talking about the limitations of statistics, and the problems that come from looking in the rear view mirror to get a view of the road ahead.

Taleb's book is about finance, but his concept applies to any aspect of life, including tennis.

Peter Bodo's preview of the 2012 French Open in Tennis magazine talks about why Rafael Nadal is a black swan (though he doesn't use that phrase). Until Rafa burst on the scene, the prevailing theory was that Bjorn Borg would be the last dominant French Open champion. After Bjorn Borg, who played with a wooden Donnay racquet, French Open champions had been a succession of one-slam-wonders.

"There were solid, well thought out, inter-related reasons for this. The men's field was getting deeper and deeper. At the same time, advances in racquet and string technology gave everyone a boost of power and a more lethal return game. Combine these comparably superior and fit athletes with more powerful weapons, and put them to work on a relatively slow court, and it was a bit like tennis roulette.

It seemed that Roland Garros had been transformed from the tournament that only the best and most consistent players could win into the one that anybody could win. And that was only heightened by the fact that so many of its more successful players were developed on clay in emerging tennis nations like Spain, Sweden, France, and Argentina. When you looked back upon the Borg years, you were apt to think, "We'll not see the likes of him again. . ."

And when Bruguera, who had even more radical technique than Borg, was unable to add to his Roland Garros haul of two, it seemed that the days when style-of-play and particularly vicious topspin might yield a huge advantage were definitely over.

Well, Nadal has exposed all that as just so much fancy-pants theorizing..."

Good luck in Paris to the King of Clay.

Saturday, 31 December 2011

Is it principled to be principled?

"Nobody ever did anything very foolish except from some strong principle". I chanced upon this quote a couple of months ago, and it has stayed in my mind ever since. It is an old quote, by the 2nd Viscount Melbourne, the young Queen Victoria's political mentor, but it has stayed in mind because it feels contemporary, and is less cynical than it sounds.

Good principles - like, for instance, that all human beings are created equal - tend to be very abstract. It is never obvious how these abstract principles translate into programs of specific action, into doing. However, it is always tempting to invoke these principles to build support for a program of action.

The problem with linking an action plan closely with its animating principle is that it makes it harder to abandon the action plan, which is a pity, because only certainty with any action plan is that it will be made to look silly by "black swans", by real-world conditions that the plan did not, and could not have, known about. The bigger the agenda, the more quickly the black swans will strike.

A program of action which is tightly linked to a cherished principle usually means a program of action that isn't adaptive enough. von Moltke the elder was pointing in the same direction when he said "No battle plan survives contact with the enemy"

Sunday, 6 February 2011

Red Plenty

I generally review books after I have read them, but I'm posting about Red Plenty when its still in my Amazon shopping basket. I heard about this book's premise on the radio, and the premise may turn out to be its most more interesting part.

Here is what the front flap says:

Once upon a time in the Soviet Union...

Strange as it may seem, the grey, oppressive USSR was founded on a fairy tale. It was built on the twentieth century magic called "the planned economy", which was going to gush forth an abundance of good things that the lands of capitalism could never match. And just for a little while, in the heady years of the late 1950s, the magic seemed to be working.

Red Plenty is about that moment in history, and how it came, and how it went away; about the brief era when, under the rash leadership of Nikita Khrushchev, the Soviet Union looked forward to an future of rich communists and envious capitalists...

This was the time between the launch of Sputnik in 1957, and the Cuban missile crisis in 1962, when the Soviet Union looked and felt rich and successful. It felt like the Soviets had invented a wonderful new world, both morally and materially superior to the West. So...this was the illusion, the chimera, that lured Nehru's India into decades of socialism and stagnation.

Red Plenty's hero is Leonid Kantorovich, the only Soviet to win the Nobel Prize for Economics. He invented linear programming (among other things), and so helped create the impression that Soviet science could allocate resources more effeciently than capitalist markets. The book is a melding of fact and fiction about how that vision was, and was not, true.

The other book in my Amazon shopping basket is Michael Lewis' The Big Short. I've actually started reading this book, but I didn't finish my father-in-law's copy on our last trip to Madras. It feels like a nice counter-point to Red Plenty. It too takes us back to a far-away past, the time between the fall of the Berlin Wall and the fall of Lehman Brothers, and reminds us that capitalism can also fall into catastrophic science-induced hubris.

Friday, 16 July 2010

Big-point players in tennis: NOT a myth

There really are big-point players in tennis. Just found a couple of statistical references to support this claim.

Watching this year’s Wimbledon, Rafael Nadal always looked in charge of his semi final against Andy Murray. Yet, there was a time late in the third set, with Murray down 0-2 on his way to a 0-3 whipping, when Murray had actually won more points than Nadal. Rafa was winning the points that mattered.

Similar claims in other sports have turned out to be false. For instance, baseball long believed in “clutch hitters”, batters who perform especially well in important situations. However, Bill James, the spiritual father of sports statistics, showed that this was simply not supported by the data. Similarly, fans long believed that basketball players have “hot hands”, when they are “in the zone” and sink every attempt. Statistical analysis showed that “hot hands” were fully explained by chance. Is tennis really different?

One reason for believeing tennis is different is comes from this (superb) New Yorker article on the state of the doubles game. The relevant sections say:

The doubles tour might no longer exist, if not for Etienne de Villiers, the chairman of the men’s tour at the time. De Villiers had previously worked at Walt Disney International, so he understood the need for better marketing. The doubles tour could survive, he said, but only if the players agreed to some compromises. The game would be streamlined. Most matches would be kept to two sets, with a “match tie break” in place of the third set. If a game went to 40-40 the next point would decide it, there would be no more endless ads and dueces. (Grand slams would stick with the traditional scoring).

The new format has few fans among the players. Martina Navratilova says it is a “bullshit excuse”. Leander Paes calls it as “Russian roulette”, and Luke Jensen dismisses it as “tennis in a microwave”. Jensen believes that the shorter format favours weaker teams, “Anyone can win one set”.

Oddly enough, though, the statistics don’t bear this out. Not long after the changes were made, Wayne asked Carl Morris, a mathematician at Harvard, to calculate their effect on a team’s chances. In shorter matches, Morris concluded, the likelihood of an upset could increase by as much as five percentage points. And yet, when the ATP later reviewed the tour’s statistics, it found that the best players had improved their records. The new format offered “no second chances”, as Bob Bryan put it, but that wasn’t necessarily a bad thing. “The one thing we didn’t figure in is that the better teams are clutch” Wayne says. “On those big points, they come through”.

That said, this is a roundabout way of making a simple point. My friend Sriram Subramaniam suggested a comparison of % break points won with % other points won. One would expect Rafa to win more break points than Murray. Unfortunately, Google didn’t turn up this specific analysis. The closest thing to this analysis that a few mintues of Googling turned up is this paper by a Franc Klaassen of the University of Amsterdam.

He shows that there really are big points, and that seeded players play better on big points than unseeded players. He observes that seeded players facing a break point on their serve have the same win % as on other points, and that unseeded players have a lower win %, suggesting that it is more about weaker players choking than better players raising their game. He also shows that serving first in a set, or serving with new balls, has no impact. He doesn’t make any conclusions about champions like Rafa or Federer as opposed to the general pool of seeded players; his dataset is small, coming only from Wimbledon 92-95.

Calling for tennis’ Bill James to mine the vast amount of data generated by the ATP tour...

Saturday, 13 February 2010

Were we ever #1?

Were we ever #1? This feels like a question worth asking after the whipping at South Africa’s hands in Nagpur. Maybe the ICC ratings don’t actually mean anything.

For several years I have trusted the Rediff ratings more than the ICC ratings. The Rediff ratings suggest that India never were #1. The latest Rediff ratings Google could find, published in December 2009, show India at #2 behind Australia.

The nice thing about the Rediff ratings is that they set more value on wins against better teams, and wins away from home. They were developed back in 2001 by two geeky cricket fans, one of whom was the Director of the Economics Department at Bombay University. The good professor might have felt the need to develop an intelligent ratings scale because the official ICC ratings developed earlier in 2001 were so bad. These ratings were designed by a panel of distinguished cricketers, like Sunil Gavaskar and Ian Chappell, and treated all test wins as equally valuable. This is not a bad attitude for a player, who should play equally hard against any opposition. But from a fan's viewpoint this original ICC scale is asinine. I thought this post was going to be a rant about the stupidity of the ICC ratings.

However, it turns out that over time the ICC have improved their ratings methodology. They have now incorporated the best idea from the Rediff methodology, that wins against stronger teams matter more. With that improvement, the ICC ratings are not meaningless. India topped a meaningful table in 2009.

There still are interesting differences between the Rediff and ICC scales. The ICC scale gives extra weight to test series outcomes, which is nice. It does not weight-up away wins, which is odd. But the biggest difference is that the ICC ratings give double the weight to wins in the last two years, while the Rediff scale treats an entire cycle of home-away tests as one equally important block.

For instance, the Rediff scale gives Australia’s 5-0 whitewash of England in the 2006-07 Ashes as much weight as the 1-2 loss in England in 2009. Rediff’s logic is that these are the two most recent home-away series. In the ICC ratings, the 5-0 hammering in 2006 gets only half the weight as the 1-2 loss in 2009, because the 5-0 hammering happened more than two years ago. Clearly, weighting-up recent matches makes it harder to apply a home-away factor, because very few pairs of teams will have both home and away matches in the most recent two years.

Neither approach is right or wrong, different scales serve different purposes. The ICC ratings will respond more quickly to changes in performance. It will therefore have more predictive power, will generate more rapid rating changes and therefore more news. The Rediff ratings are probably a more fair and comprehensive summing up of a complete block of historical performance. The swapping of ranks indicates that there probably is no real (statistically significant) difference in the performance of the best test teams since Shane Warne and Glenn McGrath retired.

Rediff ratings don’t seem to have been updated and published on schedule. The most current Rediff ratings don’t reflect South Africa’s drawn series against England, or Australia’s annihilation of Pakistan. Unfortunately, this might be for a good reason. As a profit maximizing brand, Rediff might not want to tell the Indian public things they don’t want to hear. Judging by the mean-spirited and jingoistic reader comments that were posted under the last Rediff update, this is a real concern.

Maybe the chest-thumping nationalism of a big chunk of Indian fans is much more worthy of a rant than the ICC’s rating methodology.

Tuesday, 5 May 2009

Fielding Flu

The swine flu, that terrible, dangerous contagion, resulted in my trip to a “summit” in the US being called off. Hence, I could veg out in front of the TV yesterday evening and watch the spread of an even more terrible, dangerous contagion: the fielding flu.

Chennai Super Kings managed to drop four easy catches, and fluff a run-out in a manner that would have embarrassed swine-herds, and yet beat the Deccan Chargers. Kolkata Knight Riders had a similar epidemic today (though with a less happy match-result).

The interesting thing about these drops is that they are not random. If the last few chances that went to hand were dropped, the likelihood that the next chance will be dropped is significantly higher*. Fielding flu spreads through exactly the same mechanism described in my previous post: fielders carry a mental image of a colleague grassing the ball, and the subconscious brings that image into reality.

Paradoxically, a strong team ethos may actually make teams more vulnerable* to this contagion. Players who sincerely identify with each other may carry a more vivid mental image of a friend dropping a catch.
__________

*this is a testable statistical proposition and a wonderful opportunity for ambitious young cricket statisticians looking to emulate the great Bill James

Saturday, 2 May 2009

Dangerous Safety Signs

Bikers on twisty mountain roads should carry mental images of stability and control. They should not carry mental images of spectacular crashes. These images make the rider more likely to crash the bike, yet these are exactly the images that the road sign above is trying to evoke.

This is a simple truth that sports coaches know. A good cricket coach does not tell a batter to not fish outside the off stump. He tells the batter to hit through the line. The subconscious does not work with logical operators like not. It simply brings the mental images it holds into reality.

But the people who design signage for roads don't seem to know this. With tragic consequences...

Seriously, this is a completely testable proposition.

Show amateur pilots video footage of gruesome crashes of planes similar to what they fly. Put them in a flight simulator. Ask them to do complex manouveres. Measure their crash rate. Compare with a control group which was shown footage of smooth, successful flights.

And presto...we now have scientific evidence with which to prosecute the road sign chaps for manslaughter. Or at least save a few lives.

Wednesday, 22 April 2009

Cherokee Medicine

"The Cherokee lands furnished herbs to treat every known illness – until the Europeans came". This claim is from a tourist brochure I came across in North Carolina, still home to the Cherokee Nation.

Herbs to treat every known illness? A strong claim by any standards. Yet I read that claim humbly, respectfully, sympathetically. It is an assertion of Cherokee pride, an assertion worth making after the horrors of native American history. Is there a crime even worse than genocide? The annihilation of an entire civilization?

That respectful, sympathetic moment stuck in memory when I realized that I would never extend the same courtsey to the other sort of Indians, Asian-Indians like myself. This, despite the many terrible things that have been done to us through history.

When a fellow Indian seriously claims that our ancient culture had herbs to treat every known illness (this happens astonishingly often), my irritated instinct is to refer him to Ben Goldacre's excellent book/ blog on Bad Science, and ask to see the data from randomized, double blind, placebo controlled clinical trials.

Why the difference?

I guess I just can't think about India as a Wounded Civilization any more.

Saturday, 14 March 2009

Bad Science

Read this book. Ben Goldacre is a doctor + blogger. This is his good-natured rant about the manipulative tricks of money grubbing charlatans who adopt the trappings of science. His targets include homeopaths (homeopathic drugs are no better than placebos), pharma companies (trials which show expensive drugs to be ineffective are not published), and the media (who publicise a fake health scare a week). Great fun.

I hereby proclaim that Moonballs from Planet Earth and Bad Science are kindred souls.

The trouble with bad science actually starts where the book leaves off, when one moves beyond pharmacology. There are many fields worthy of scientific enquiry, where placebo-controlled, double-blind, randomized trials are not possible.

For instance, Earth Sciences. It is worth knowing if we are making our planet uninhabitable. However, we can't find out by doing an experiment. We can't hold out a control sample of several dozen similar planets where the fossil fuels were never burnt, and compare the richness of life-forms observed a few thousand years later in the test and control. So scientists have to use models, which are intrinsically fallible.

Calling out the shortcomings of the models used is central to being an honest scientist. However, lists of model caveats don't make for good TV (or for good top-management presentations). So the media coverage of global warming is about as alarmist as the fake-health-scare-a-week stories that Ben Goldacre rants on about.

The guy who first called this non-science, was Bjorn Lomberg, in the Skeptical Environmentalist. It is not light reading, but it is also worth looking up, just to get a sense for how hard it really is to construct good science, with limited data, in the thick of an emotionally charged, politicized debate.

Monday, 22 December 2008

Dravid's slump in form

I am on vacation in India. One of my nephews and I are vegging out in front of the TV on a Monday afternoon at my in-laws place, while the rest of the family naps. We are watching India nurdle along at 2 runs per over on the fourth afternoon of the Mohali test against England.

The commentators don't have a whole lot to talk about. We are watching endless replays of Dravid's stumps being shattered by Stuart Broad. Are horrible pictures like this a sign of Rahul Dravid's decline? Or does this just happen sometimes to any batsman, however great? And what to read into his century in the first innings of the Mohali test? The commentators are blathering on and on...for long enough for my inner-analyst to want to get beyond the balther...

The commentariat all agree that Dravid is suffering a slump in form. What, unfortunately, has not been properly examined is whether Dravid has really been scoring fewer runs than before, or whether the perceived slump in form is nothing more than randomness playing out. It is entirely possible that Dravid is batting as well as he ever has, and that the dice just haven't rolled his way. The mind is very good at spotting patterns, especially when there aren't any.

This question is inspired by Moneyball (recommended reading for any cricket fan). Moneyball is about how statistical analysis forms the foundation of a winning baseball team, the Oakland Athletics. It reports on persistent sporting myths that statistics busts. For instance, there is no such thing as a clutch hitter, a batter who does especially well in vital situations. Or that there is no such thing as a hot hand, a streak in basketball when a NBA player is "in-the-groove" and landing every shot in the basket.

Baseball is now enriched by a Society for American Baseball Research. The American Statistical Association now has a section dedicated to sports statistics. It is a pity that this quality of statistical analysis has not been applied to cricket, despite the richness of the data available. It is also an opportunity for a smart young cricket-loving statisticians. Calling S. Rajesh of Cricinfo?

To give the interested (geeky) reader a flavour of what is possible, here is the outline of a statistical analysis that would shed more light on Dravid's form than anything that has appeared in the media so far. None of the technique described below is very complicated, or goes beyond material taught routinely at the undergraduate level. I would be delighted to see this analysis available in the public domain along with a well documented methodology and explanations, and expect no credit or authorship rights. Also, a disclaimer. I am not a professional statistician; my knowledge of statistics is mainly as a customer to statisticians. Any feedback from readers with more statistical knowledge, especially around time-series analytic techniques could improve this analysis, is appreciated.

Outline of desired analysis
Step 1: compile the dataset
Each record in the dataset is one of the ~25000 balls Rahul Dravid has faced in test cricket. Each record in the dataset has the following fields: outcome (which takes the values 0-6 and W, all represented as class variables), opponent (Australia, England etc.), bowler, bowler type (pace, military medium, leg spin etc.), location, home away flag (derived from location), innings (which-ith innings of a test match), position played in the batting order (mostly #3), number of balls already faced in the innings, date innings started, a random number (for validation in step 5).

I don't think any of this data is hard to obtain. It is reported in the ball by ball commentary on Cricinfo, which I'm assuming is professionally archived. This list is not meant to be exhaustive, most datasets come with a few plausible covariates that can be thrown in and played with.

A couple of fields I would love to add, which may be harder to obtain, are length (full, length, back-of, short) and line (outside off, off, middle and leg, outside leg). I believe this is the data the team statisticians sitting in the dressing room code in their laptops.

Examine the dataset to get familiar with patterns, especially with potentially tricky variables like bowler or location. For instance, a bowler Dravid has faced for 12 balls may have dismissed Dravid twice. Worth being aware of weird things in the data before running any regressions.

Step 2: run the regression model
Model the outcome, number of runs scored, wicket or dot-ball as a multinomial logistic outcome. This class of models are used in transportation analysis - every commuter has the choice of multiple models of transportation - or in brand analysis - every consumer has the choice of multiple brands of breakfast cereal. Similarly, every ball has the choice of different outcomes - from sixer through dot-ball to wicket.

Allow the model to see all the fields listed above. Do not constrain the model. All two-term interactions. Just maximize the fit. Essentially, the computer is finding the configuration of explanatory variables with maximizes the likelihood of the observing the outcomes in the dataset.

Most modern statistical packages will apply simple transforms to covariates to improve fit, like for instance taking log(number of balls already faced), a transform which makes intuitive sense anyway.

Step 3: read the results
First pass, one is expecting to see date of innings being a statistically significant. If it is clearly significant, and the coefficient has the right sign (a decline in form), that probably means the effect is real. A completely unconstrained model might spit out some funky functional forms, with performance being a parabolic function of time...improving initially and then declining.

A bunch of other interesting effects will be visible at this stage, and are fun to look for. For instance, does Dravid have a nemesis bowler? Is Dravid genuinely as good abroad as he is at home? Has he done any worse as an opener than at #3? Is Dravid more vulnerable to full length deliveries on the slow pitches at home than abroad (does the interaction term between home away flag and length have a non-zero coefficient)?

Step 4: tweak the model
Refinements to the model are usually needed at this stage.

For instance, if no effect is observed overall, it might be because a real effect over the last six months may be hidden by the length of the continuous dataset in use. Converting time into six monthly blocks may be useful.

Also, a time effect might be masked because it is correlated with the opposition. It might look like Dravid just happens to be weaker against Sri Lanka and Australia, India's most recent opponents. In this case, one might want to force the model to accept time blocks before it admits opposition.

Bowlers with thin data might show up having implausibly strong effects. One might want to modify the data to slot all bowlers who have bowled less than 250 balls at Dravid into a pie-chuckers categorical variable.

Step 5: validate the model

Keep a random subset of ~5000 balls outside the analysis described so far. Repeat the analysis on this holdout to make sure the results observed are similar. Validating on an additional time period is probably nonsense in this context, since time is a variable of interest.

A more interesting approaches to validation is to validate on non-test match data. If Dravid is in decline, we would expect to see that in all forms of cricket.

Step 6: Document the results and limitations
Gaps in data and any subjective interpretations or analytic choices missing values/ definition of class variable etc. would be logged here.

Some limitations are systematic. This dataset is limited to Dravid's performance only. So a generalized improvement in the performance of all test batsmen of the same time period would not be picked up by the model. It is possible that Dravid is playing as well as ever, and that the world has moved forward faster than Dravid. A more ambitious analysis spanning a broader base of test batsmen is needed to shed more light on this.

Also highlight opportunities to improve on the analysis. For instance, it would be interesting to compare Dravid's decline with that of other top players. Assuming there is a decline, is it worse than what Gavaskar or Border suffered? Data may be thinner in the pre-internet era...but maybe it is out there in official score sheets.

Most critically, this analysis does not tell the captain whether or not Dravid should be replaced with a younger batsman. That remains a judgment call, based largely on how he wants to build his team. What it may tell the captain is that Dravid's run of poor scores is explained by randomness and is likely to end soon. So we avoid the injustice of a great player being judged on poorly constructed evidence.

Thursday, 3 July 2008

Leos suffer from weak digestion. They do, don't they?

Great old story from the Economist about a very common statistical error. Cherry picking.

Hospital admission data from Canada shows that Leos are likely to have gastric trouble and Sagittarians are more likely to break their arms. Both results are statistically significant...if your statistical technique ignores the fact that with 24 comparisons 2-3 are likely to be significant at the 95% level due to pure randomness.

I unconsciously resisted absorbing this idea during stats training...probably because I'm usually very keen for the results of my tests to be significant. Yet when one is doing dozens of tests (as I often am) results that appear significant are often just noise.

This example hammered the point home...probably because I am very receptive to the thought that astrology is a vicious scam. Cultural context: astrology in India isn't just harmless fun. The truth is that Leos are no more likely than anyone else to have gastric trouble. And my mom's painful feet are because of poorly designed footwear, not her Virgo birth sign.

Friday, 5 October 2007

Blogger Econometrics

Greg Mankiw thinks he has had 3,000,000 visitors to his blog.

http://gregmankiw.blogspot.com/2007/10/3000000.html

This was my comment on his blog...with a helpful link to direct some traffic to my blog:

Sitemeter also measures the time each visitor spends on your site. Fifteen of the last twenty visitors had spend zero seconds. And the other five had spent less than a minute.

Sure, this blog is great. And fellow bloggers like me spend a lot of time here. But 3 million visitors is a huge over estimate of your effective reach.

Monday, 10 September 2007

Performance under pressure

After the surrender at Lord's on Saturday, the hacks and cynics will be out in force. They will rant and rave about how India lack killer instinct, about how Sachin doesn't fire when it matters most. The pseudo-intellectuals might read deeper meaning into this. Our lack of a killer instinct is cultural. We are, after all, the land of ahimsa.

Humbug.

My take is that India, England, Pakistan and South Africa are evenly matched teams. Who wins on a particular day is down to chance. India have had a run of bad luck in clutch games. It has happened to South Africa. It happened to Ricky Ponting against Harbhajan Singh. Our boys aren't playing worse than usual in clutch games.

In case old friends are wondering if I've suddenly become a sentimental apologist for mediocrity...no.

I believe the bad luck theory mainly because a very similar hypothesis has been rigorously testing in baseball. For generations baseball had a rich mythology about clutch-hitters, great batters who score when it matters most. However, when trained statisticians examined the data, there was no evidence at all that the fabled clutch hitters did any better in clutch situations than in other situations. A purely random event had spawned a rich and convincing mythology. Read Money Ball by Michael Lewis for more fundas (strongly recommended if you like this post). Or just look up clutch hitting or Bill James on Wikipedia (link below)

Clutch hitter - Wikipedia, the free encyclopedia

Similar academic quality analysis showed no evidence of the equivalent in basketball - a Hot Hand, when a player who is in the Zone shoots baskets at will. The memorable sequences of great shots in basketball are fully explained by chance. The mind sees patterns where there are none.

There is a real opportunity here for an ambitious statistician. There is little or no rigorous statistical analysis of cricket data in circulation.
_________________________________

I have seen professional sportsmen lose self-belief at crucial moments in their careers. I saw it most recently in the Twenty20 finals when Carl Greenidge of Gloustershire, Gordon Greenidge's son and Andy Robert's cousin, went to pieces bowling the last over. I saw Jana Novotna slip into that same state of mind during a Wimbledon final against Steffi Graf. Zaheer might have been in that state of mind in the 2003 World Cup finals. Though, he seemed too wound-up rather than going-to-pieces. We might have gone to pieces against England in Mumbai in 2005, when we collapsed to Shaun Udal.

That sort of mental disintegration happens to people of all races. It is rare. It didn't happen on Saturday. The dice just rolled for England.