Archive for the ‘methodological quibbles’Category

Data: big, small, and meta

When I read this New York Times piece back in August, I was in the midst of preparation and training for data collection at rural health facilities in Zambia. The Times piece profiles a group called Global Pulse that is doing good work on the ‘big data’ side of global health:

The efforts by Global Pulse and a growing collection of scientists at universities, companies and nonprofit groups have been given the label “Big Data for development.” It is a field of great opportunity and challenge. The goal, the scientists involved agree, is to bring real-time monitoring and prediction to development and aid programs. Projects and policies, they say, can move faster, adapt to changing circumstances and be more effective, helping to lift more communities out of poverty and even save lives.

Since I was gearing up for ‘field work’ (more on that here; I’ll get to it soon), I was struck at the time by the very different challenges one faces at the other end of the spectrum. Call it small data? And I connected the Global Pulse profile with this, by Wayan Vota, from just a few days before:

The Sneakernet Reality of Big Data in Africa

When I hear people talking about “big data” in the developing world, I always picture the school administrator I met in Tanzania and the reality of sneakernet data transmissions processes.

The school level administrator has more data than he knows what to do with. Years and years of student grades recorded in notebooks – the hand-written on paper kind of notebooks. Each teacher records her student attendance and grades in one notebook, which the principal then records in his notebook. At the local district level, each principal’s notebook is recorded into a master dataset for that area, which is then aggregated at the regional, state, and national level in even more hand-written journals… Finally, it reaches the Minister of Education as a printed-out computer-generated report, complied by ministerial staff from those journals that finally make it to the ministry, and are not destroyed by water, rot, insects, or just plain misplacement or loss. Note that no where along the way is this data digitized and even at the ministerial level, the data isn’t necessarily deeply analyzed or shared widely….

And to be realistic, until countries invest in this basic, unsexy, and often ignored level of infrastructure, we’ll never have “big data” nor Open Data in Tanzania or anywhere else. (Read the rest here.)

Right on. And sure enough two weeks later I found myself elbow-deep in data that looked like this — “Sneakernet” in action:

In many countries a quite a lot of data — of varying quality — exists, but it’s often formatted like the above. Optimistically, it may get used for local decisions, and eventually for high-level policy decisions when it’s months or years out of date. There’s a lot of hard, good work being done to improve these systems (more often by residents of low-income countries, sometimes by foreigners), but still far too little. This data is certainly primary, in the sense that was collected on individuals, or by facilities, or about communities, but there are huge problems with quality, and with the sneakernet by which it gets back to policymakers, researchers, and (sometimes) citizens.

For the sake of quick reference, I keep a folder on my computer that has — for each of the countries I work in — most of the major recent ultimate sources of nationally-representative health data. All too often the only high-quality ultimate source is the most recent Demographic and Health Survey, surely one of the greatest public goods provided by the US government’s aid agency. (I think I’m paraphrasing Angus Deaton here, but can’t recall the source.) When I spent a summer doing epidemiology research with the New York City Department of Health and Mental Hygiene, I was struck by just how many rich data sources there were to draw on, at least compared to low-income countries. Very often there just isn’t much primary data on which to build.

On the other end of the spectrum is what you might call the metadata of global health. When I think about the work the folks I know in global health — classmates, professors, acquaintances, and occasionally thought not often me — do day to day, much of it is generating metadata. This is research or analysis derived from the primary data, and thus relying on its quality. It’s usually smart, almost always well-intentioned, and often well-packaged, but this towering edifice of effort is erected over a foundation of primary data; the metadata sometimes gives the appearance of being primary, when you dig down the sources often point back to those one or three ultimate data sources.

That’s not to say that generating this metadata is bad: for instance, modeling impacts of policy decisions given the best available data is still the best way to sift through competing health policy priorities if you want to have the greatest impact. Or a more cynical take: the technocratic nature of global health decision-making requires that we either have this data or, in its absence, impute it. But regardless of the value of certain targeted bits of the metadata, there’s the question of the overall balance of investment in primary vs. secondary-to-meta data, and my view — somewhat ironically derived entirely from anecdotes — is that we should be investing a lot more in the former.

One way to frame this trade-off is to ask, when considering a research project or academic institute or whatnot, whether the money spent on that project might result in more value for money if it was spent instead training data collectors and statistics offices, or supporting primary data collection (e.g., funding household surveys) in low-income countries. I think in many cases the answer will be clear, perhaps to everyone except those directly generating the metadata.

That does not mean that none of this metadata is worthwhile. On the contrary, some of it is absolutely essential. But a lot isn’t, and there are opportunity costs to any investment, a choice between investing in data collection and statistics systems in low-income countries, vs. research projects where most of the money will ultimately stay in high-income countries, and the causal pathway to impact is much less direct.  

Looping back to the original link, one way to think of the ‘big data’ efforts like Global Pulse is that they’re not metadata at all, but an attempt to find new sources of primary data. Because there are so few good sources of data that get funded, or that filter through the sneakernet, the hope is that mobile phone usage and search terms and whatnot can be mined to give us entirely new primary data, on which to build new pyramids of metadata, and with which to make policy decisions, skipping the sneakernet altogether. That would be pretty cool if it works out.

Slow down there

Max Fisher has a piece in the Washington Post presenting “The amazing, surprising, Africa-driven demographic future of the Earth, in 9 charts”. While he notes that the numbers are “just projections and could change significantly under unforeseen circumstances” the graphs don’t give any sense of the huge uncertainty involved in projecting trends out 90 years in the future.

Here’s the first graph:

 

The population growth in Africa here is a result of much higher fertility rates, and a projected slower decline in those rates.

But those projected rates have huge margins of error. Here’s the total fertility rate, or “the average number of children that would be born to a woman over her lifetime”  for Nigeria, with confidence intervals that give you a sense of just how little we know about the future:

That’s a lot of uncertainty! (Image from here, which I found thanks to a commenter on the WaPo piece.)

It’s also worth noting that if you had made similar projections 87 years ago, in 1926, it would have been hard to anticipate World War II, hormonal birth control, and AIDS, amongst other things.

18

07 2013

(Not) knowing it all along

David McKenzie is one of the guys behind the World Bank’s excellent and incredibly wonky Development Impact blog. He came to Princeton to present on a new paper with Gustavo Henrique de Andrade and Miriam Bruhn, “A Helping Hand or the Long Arm of the Law? Experimental evidence on what governments can do to formalize firms” (PDF). The subject matter — trying to get small, informal companies to register with the government — is outside my area of expertise. But I thought there were a couple methodologically interesting bits:

First, there’s an interesting ethical dimension, as one of their several interventions tested was increasing the likelihood that a firm would be visited by a government inspector (i.e., that the law would be enforced). From page 10:

In particular, if a firm owner were interviewed about their formality status, it may not be considered ethical to then use this information to potentially assign an inspector to visit them. Even if it were considered ethical (since the government has a right to ask firm owners about their formality status, and also a right to conduct inspections), we were still concerned that individuals who were interviewed in a baseline survey and then received an inspection may be unwilling to respond to a follow-up. Therefore a listing stage was done which did not involve talking to the firm owner.

In other words, all their baseline data was collected without actually talking to the firms they were studying — check out the paper for more on how they did that.

Second, they did something that could (and maybe should) be incorporated into many evaluations with relative ease. Because findings often seem obvious after we hear them, McKenzie et al. asked the government staff whose program they were evaluating to estimate what the impact would be before the results were in. Here’s that section (emphasis added):

A standard question with impact evaluations is whether they deliver new knowledge or merely formally confirm the beliefs that policymakers already have (Groh et al, 2012). In order to measure whether the results differ from what was anticipated, in January 2012 (before any results were known) we elicited the expectations of the Descomplicar [government policy] team as to what they thought the impacts of the different treatments would be. Their team expected that 4 percent of the control group would register for SIMPLES [the formalization program] between the baseline and follow-up surveys. We see from Table 7 that this is an overestimate…

They then expected the communication only group to double this rate, so that 8 percent would register, that the free cost treatment would lead to 15 percent registering, and that the inspector treatment would lead to 25 percent registering…. The zero or negative impacts of the communication and free cost treatments therefore are a surprise. The overall impact of the inspector treatment is much lower than expected, but is in line with the IV estimates, suggesting the Descomplicar team have a reasonable sense of what to expect when an inspection actually occurs, but may have overestimated the amount of new inspections that would take place. Their expectation of a lack of impact for the indirect inspector treatment was also accurate.

This establishes exactly what in the results was a surprise and what wasn’t. It might also make sense for researchers to ask both the policymakers they’re working with and some group of researchers who study the same subject to give such responses; it would certainly help make a case for the value of (some) studies.

This beautiful graphic is not really that useful

This beautiful infographic from the excellent blog Information is Beautiful has been making the rounds. You can see a bigger version here, and it’s worth poking around for a bit. The creators take all deaths from the 20th century (drawing from several sources) and represent their relative contribution with circles:

I appreciate their footnote that says the graphic has “some inevitable double-counting, broad estimation and ball-park figures.” That’s certainly true, but the inevitably approximate nature of these numbers isn’t my beef.

The problem is that I don’t think raw numbers of deaths tell us very much, and can actually be quite misleading. Someone who saw only this infographic might well end up less well-informed than if they didn’t see it. Looking at the red circles you get the impression that non-communicable and infectious diseases were roughly equivalent in importance in the 20th century, followed by “humanity” (war, murder, etc) and cancer.

The root problem is that mortality is inevitable for everyone, everywhere. This graphic lumps together pneumonia deaths at age 1 with car accidents at age 20, and cancer deaths at 50 with heart disease deaths at 80. We typically don’t  (and I would argue should’t) assign the same weight to a death in childhood or the prime of life with one that comes at the end of a long, satisfying life.  The end result is that this graphic greatly overemphasizes the importance of non-communicable diseases in the 20th century — that’s the impression most laypeople will walk away with.

A more useful graphic might use the same circles to show the years of life lost (or something like DALYs or QALYs) because those get a bit closer at what we care about. No single number is actually  all that great, so we can get a better understanding if we look at several different outcomes (which is one problem with any visualization). But I think raw mortality numbers are particularly misleading.

To be fair, this graphic was commissioned by Wellcome as “artwork” for a London exhibition, so maybe it should be judged by a different standard…

26

03 2013

First responses to DEVTA roll in

In my last post I highlighted the findings from the DEVTA trial of deworming in Vitamin A in India, noting that the Vitamin A results would be more controversial. I said I expected commentaries over the coming months, but we didn’t have to wait that long after all.

First is a BBC Health Check program features a discussion of DEVTA with Richard Peto, one of the study’s authors. It’s for a general audience so it doesn’t get very technical, and because of that it really grated when they described this as a “clinical trial,” as that has certain connotations of rigor that aren’t reflected in the design of the study. If DEVTA is a clinical trial, then so was

Peto also says there were two reasons for the massive delay in publishing the trial, 1) time to check things and “get it straight,” and 2) that they were ” afraid of putting up a trial with a false negative.” [An aside for those interested in publication bias issues: can you imagine an author with strong positive findings ever saying the same thing about avoiding false positives?!]

Peto ends by sounding fairly neutral re: Vitamin A (portraying himself in a middle position between advocates in favor and skeptics opposed) but acknowledges that with their meta-analysis results Vitamin A is still “cost-effective by many criteria.”

Second is a commentary in The Lancet by Al Sommers, Keith West, and Reynaldo Martorell. A little history: Sommers ran the first big Vitamin A trials in Sumtra (published in 1986) and is the former dean of the Johns Hopkins School of Public Health.  (Sommers’ long-term friendship with Michael Bloomberg, who went to Hopkins as an undergrad, is also one reason the latter is so big on public health.) For more background, here’s a recent JHU story on Sommers’ receiving a $1 million research prize in part for his work on Vitamin A.

Part of their commentary is excerpted below, with my highlights in bold:

But this was neither a rigorously conducted nor acceptably executed efficacy trial: children were not enumerated, consented, formally enrolled, or carefully followed up for vital events, which is the reason there is no CONSORT diagram. Coverage was ascertained from logbooks of overworked government community workers (anganwadi workers), and verified by a small number of supervisors who periodically visited randomly selected anganwadi workers to question and examine children who these workers gathered for them. Both anganwadi worker self-reports, and the validation procedures, are fraught with potential bias that would inflate the actual coverage.

To achieve 96% coverage in Uttar Pradesh in children found in the anganwadi workers’ registries would have been an astonishing feat; covering 72% of children not found in the anganwadi workers’ registries seems even more improbable. In 2005—06, shortly after DEVTA ended, only 6·1% of children aged 6—59 months in Uttar Pradesh were reported to have received a vitamin A supplement in the previous 6 months according to results from the National Family Health Survey, a national household survey representative at national and state level…. Thus, it is hard to understand how DEVTA ramped up coverage to extremely high levels (and if it did, why so little of this effort was sustained). DEVTA provided the anganwadi workers with less than half a day’s training and minimal if any incentive.

They also note that the study funding was minimalist compared to more rigorous studies, which may be an indication of quality. And as an indication that there will almost certainly be alternative meta-analyses that weight the different studies differently:

We are also concerned that Awasthi and colleagues included the results from this study, which is really a programme evaluation, in a meta-analysis in which all of the positive studies were rigorously designed and conducted efficacy trials and thus represented a much higher level of evidence. Compounding the problem, Awasthi and colleagues used a fixed-effects analytical model, which dramatically overweights the results of their negative findings from a single population setting. The size of a study says nothing about the quality of its data or the generalisability of its findings.

I’m sure there will be more commentaries to follow. In my previous post I noted that I’m still trying to wrap my head around the findings, and I think that’s still right. If I had time I’d dig into this a bit more, especially the relationship with the Indian National Family Health Survey. But for now I think it’s safe to say that two parsimonious explanations for how to reconcile DEVTA with the prior research are emerging:

1. DEVTA wasn’t all that rigorous and thus never achieved the high population coverage levels necessary to have a strong mortality impact; the mortality impact was attenuated by poor coverage, resulting in the lack of a statistically significant effect in line with prior results. Thus is shouldn’t move our priors all that much. (Sommers et al. seem to be arguing for this.) Or,

2. There’s some underlying change in the populations between the older studies and these newer studies that causes the effect of Vitamin A to decline — this could be nutrition, vaccination status, shifting causes of mortality, etc. If you believe this, then you might discount studies because they’re older.

(h/t to @karengrepin for the Lancet commentary.)

25

03 2013

A massive trial, a huge publication delay, and enormous questions

It’s been called the “largest clinical* trial ever”: DEVTA (Deworming and Enhanced ViTamin A supplementation), a study of Vitamin A supplementation and deworming in over 2 million children in India, just published its results. “DEVTA” may mean “deity” or “divine being” in Hindi but some global health experts and advocates will probably think these results come straight from the devil. Why? Because they call into question — or at least attenuate — our estimates of the effectiveness of some of the easiest, best “bang for the buck” interventions out there.

Data collection was completed in 2006, but the results were just published in The Lancet. Why the massive delay? According to the accompany discussion paper, it sounds like the delay was rooted in very strong resistance to the results after preliminary outcomes were presented at a conference in 2007. If it weren’t for the repeated and very public shaming by the authors of recent Cochrane Collaboration reviews, we might not have the results even today. (Bravo again, Cochrane.)

So, about DEVTA. In short, this was a randomized 2×2 factorial trial, like so:

The results were published as two separate papers, one on Vitamin A and one on deworming, with an additional commentary piece:

The controversy is going to be more about what this trial didn’t find, rather than what they did: the confidence interval on the Vitamin A study’s mortality estimate (mortality ratio 0.96, 95% confidence interval of 0.89 to 1.03) is consistent with a mortality reduction as large as 11%, or as much as a 3% increase. The consensus from previous Vitamin A studies was mortality reductions of 20-30%, so this is a big surprise. Here’s the abstract to that paper:

Background

In north India, vitamin A deficiency (retinol <0·70 μmol/L) is common in pre-school children and 2–3% die at ages 1·0–6·0 years. We aimed to assess whether periodic vitamin A supplementation could reduce this mortality.

Methods

Participants in this cluster-randomised trial were pre-school children in the defined catchment areas of 8338 state-staffed village child-care centres (under-5 population 1 million) in 72 administrative blocks. Groups of four neighbouring blocks (clusters) were cluster-randomly allocated in Oxford, UK, between 6-monthly vitamin A (retinol capsule of 200 000 IU retinyl acetate in oil, to be cut and dripped into the child’s mouth every 6 months), albendazole (400 mg tablet every 6 months), both, or neither (open control). Analyses of retinol effects are by block (36 vs36 clusters).

The study spanned 5 calendar years, with 11 6-monthly mass-treatment days for all children then aged 6–72 months.  Annually, one centre per block was randomly selected and visited by a study team 1–5 months after any trial vitamin A to sample blood (for retinol assay, technically reliable only after mid-study), examine eyes, and interview caregivers. Separately, all 8338 centres were visited every 6 months to monitor pre-school deaths (100 000 visits, 25 000 deaths at ages 1·0–6·0 years [the primary outcome]). This trial is registered at ClinicalTrials.gov, NCT00222547.

Findings

Estimated compliance with 6-monthly retinol supplements was 86%. Among 2581 versus 2584 children surveyed during the second half of the study, mean plasma retinol was one-sixth higher (0·72 [SE 0·01] vs 0·62 [0·01] μmol/L, increase 0·10 [SE 0·01] μmol/L) and the prevalence of severe deficiency was halved (retinol <0·35 μmol/L 6% vs13%, decrease 7% [SE 1%]), as was that of Bitot’s spots (1·4% vs3·5%, decrease 2·1% [SE 0·7%]).

Comparing the 36 retinol-allocated versus 36 control blocks in analyses of the primary outcome, deaths per child-care centre at ages 1·0–6·0 years during the 5-year study were 3·01 retinol versus 3·15 control (absolute reduction 0·14 [SE 0·11], mortality ratio 0·96, 95% CI 0·89–1·03, p=0·22), suggesting absolute risks of death between ages 1·0 and 6·0 years of approximately 2·5% retinol versus 2·6% control. No specific cause of death was significantly affected.

Interpretation

DEVTA contradicts the expectation from other trials that vitamin A supplementation would reduce child mortality by 20–30%, but cannot rule out some more modest effect. Meta-analysis of DEVTA plus eight previous randomised trials of supplementation (in various different populations) yielded a weighted average mortality reduction of 11% (95% CI 5–16, p=0·00015), reliably contradicting the hypothesis of no effect.

Note that instead of just publishing these no-effect results and leaving the meta-analysis to a separate publication, the authors go ahead and do their own meta-analysis of DEVTA plus previous studies and report that — much attenuated, but still positive — effect in their conclusion. I think that’s a fair approach, but also reveals that the study’s authors very much believe there are large Vitamin A mortality effects despite the outcome of their own study!

[The only media coverage I’ve seen of these results so far comes from the Times of India, which includes quotes from the authors and Abhijit Banerjee.]

To be honest, I don’t know what to make of the inconsistency between these findings and previous studies, and am writing this post in part to see what discussion it generates. I imagine there will be more commentaries on these findings over the coming months, with some decrying the results and methodologies and others seeing vindication in them. In my view the best possible outcome is an ongoing concern for issues of external validity in biomedical trials.

What do I mean? Epidemiologists tend to think that external validity is less of an issue in randomized trials of biomedical interventions — as opposed to behavioral, social, or organizational trials — but this isn’t necessarily the case. Trials of vaccine efficacy have shown quite different efficacy for the same vaccine (see BCG and rotavirus) in different locations, possibly due to differing underlying nutritional status or disease burdens. Our ability to interpret discrepant findings can only be as sophisticated as the available data allows, or as sophisticated as allowed by our understanding of the biological and epidemiologic mechanisms that matter on the pathway from intervention to outcome. We can’t go back in time and collect additional information (think nutrition, immune response, baseline mortality, and so forth) on studies far in the past, but we can keep such issues in mind when designing trials moving forward.

All that to say, these results are confusing, and I look forward to seeing the global health community sort through them. Also, while the outcomes here (health outcomes) are different from those in the Kremer deworming study (education outcomes), I’ve argued before that lack of effect or small effects on the health side should certainly influence our judgment of the potential education outcomes of deworming.

*I think given the design it’s not that helpful to call this a ‘clinical’ trial at all – but that’s another story.

20

03 2013

Alwyn Young just broke your regression

Alwyn Young — the same guy whose paper carefully accounting for growth in East Asian was popularized by Krugman and sparked an enormous debate — has been circulating a paper on African growth rates. Here’s the 2009 version (PDF) and October 2012 version. The abstract of the latter paper:

Measures of real consumption based upon the ownership of durable goods, the quality of housing, the health and mortality of children, the education of youth and the allocation of female time in the household indicate that sub-Saharan living standards have, for the past two decades, been growing about 3.4 to 3.7 percent per annum, i.e. three and a half to four times the rate indicated in international data sets. (emphasis added)

The Demographic and Health Surveys are large-scale nationally-representative surveys of health, family planning, and related modules that tend to ask the same questions across different countries and over large periods of time. They have major limitations, but in the absence of high-quality data from governments they’re often the best source for national health data. The DHS doesn’t collect much economic data, but they do ask about ownership of certain durable goods (like TVs, toilets, etc), and the answers to these questions are used to construct a wealth index that is very useful for studies of health equity — something I’m taking advantage of in my current work. (As an aside, this excellent report from Measure DHS (PDF) describes the history of the wealth index.)

What Young has done is to take this durable asset data from many DHS surveys and try to estimate a measure of GDP growth from actually-measured data, rather than the (arguably) sketchier methods typically used to get national GDP numbers in many African countries. Not all countries are represented at any given point in time in the body of DHS data, which is why he ends up with a very-unbalanced panel data set for “Africa,” rather than being able to measure growth rates in individual countries. All the data and code for the paper are available here.

Young’s methods themselves are certain to spark ongoing debate (see commentary and links from Tyler Cowen and Chris Blattman), so this is far from settled — and may well never be. The takeaway is probably not that Young’s numbers are right so much as that there’s a lot of data out there that we shouldn’t trust very much, and that transparency about the sources and methodology behind data, official or not, is very helpful. I just wanted to raise one question: if Young’s data is right, just how many published papers are wrong?

There is a huge literature on cross-country growth ‘s empirics. A Google Scholar search for “cross-country growth Africa” turns up 62,400 results. While not all of these papers are using African countries’ GDPs as an outcome, a lot of them are. This literature has many failings which have been duly pointed out by Bill Easterly and many others, to the extent that an up-and-coming economist is likely to steer away from this sort of work for fear of being mocked. Relatedly, in Acemoglu and Robinson’s recent and entertaining take-down of Jeff Sachs, one of their insults criticisms is that Sachs only knows something because he’s been running “kitchen sink growth regressions.”

Young’s paper just adds more fuel to that fire. If African GDP growth has been 3 1/2 to 4 times greater than the official data says, then every single paper that uses the old GDP numbers is now even more suspect.

Why we should lie about the weather (and maybe more)

Nate Silver (who else?) has written a great piece on weather prediction — “The Weatherman is Not a Moron” (NYT) — that covers both the proliferation of data in weather forecasting, and why the quantity of data alone isn’t enough. What intrigued me though was a section at the end about how to communicate the inevitable uncertainty in forecasts:

…Unfortunately, this cautious message can be undercut by private-sector forecasters. Catering to the demands of viewers can mean intentionally running the risk of making forecasts less accurate. For many years, the Weather Channel avoided forecasting an exact 50 percent chance of rain, which might seem wishy-washy to consumers. Instead, it rounded up to 60 or down to 40. In what may be the worst-kept secret in the business, numerous commercial weather forecasts are also biased toward forecasting more precipitation than will actually occur. (In the business, this is known as the wet bias.) For years, when the Weather Channel said there was a 20 percent chance of rain, it actually rained only about 5 percent of the time.

People don’t mind when a forecaster predicts rain and it turns out to be a nice day. But if it rains when it isn’t supposed to, they curse the weatherman for ruining their picnic. “If the forecast was objective, if it has zero bias in precipitation,” Bruce Rose, a former vice president for the Weather Channel, said, “we’d probably be in trouble.”

My thought when reading this was that there are actually two different reasons why you might want to systematically adjust reported percentages ((ie, fib a bit) when trying to communicate the likelihood of bad weather.

But first, an aside on what public health folks typically talk about when they talk about communicating uncertainty: I’ve heard a lot (in classes, in blogs, and in Bad Science, for example) about reporting absolute risks rather than relative risks, and about avoiding other ways of communicating risks that generally mislead. What people don’t usually discuss is whether the point estimates themselves should ever be adjusted; rather, we concentrate on how to best communicate whatever the actual values are.

Now, back to weather. The first reason you might want to adjust the reported probability of rain is that people are rain averse: they care more strongly about getting rained on when it wasn’t predicted than vice versa. It may be perfectly reasonable for people to feel this way, and so why not cater to their desires? This is the reason described in the excerpt from Silver’s article above.

Another way to describe this bias is that most people would prefer to minimize Type II Error (false negatives) at the expense of having more Type I error (false positives), at least when it comes to rain. Obviously you could take this too far — reporting rain every single day would completely eliminate Type II error, but it would also make forecasts worthless. Likewise, with big events like hurricanes the costs of Type I errors (wholesale evacuations, cancelled conventions, etc) become much greater, so this adjustment would be more problematic as the cost of false positives increases. But generally speaking, the so-called “wet bias” of adjusting all rain prediction probabilities upwards might be a good way to increase the general satisfaction of a rain-averse general public.

The second reason one might want to adjust the reported probability of rain — or some other event — is that people are generally bad at understanding probabilities. Luckily though, people tend to be bad about estimating probabilities in surprisingly systematic ways! Kahneman’s excellent (if too long) book Thinking, Fast and Slow covers this at length. The best summary of these biases that I could find through a quick Google search was from Lee Merkhofer Consulting:

 Studies show that people make systematic errors when estimating how likely uncertain events are. As shown in [the graph below], likely outcomes (above 40%) are typically estimated to be less probable than they really are. And, outcomes that are quite unlikely are typically estimated to be more probable than they are. Furthermore, people often behave as if extremely unlikely, but still possible outcomes have no chance whatsoever of occurring.

The graph from that link is a helpful if somewhat stylized visualization of the same biases:

In other words, people think that likely events (in the 30-99% range) are less likely to occur than they are in reality, that unlike events (in the 1-30% range) are more likely to occur than they are in reality, and extremely unlikely events (very close to 0%) won’t happen at all.

My recollection is that these biases can be a bit different depending on whether the predicted event is bad (getting hit by lightning) or good (winning the lottery), and that the familiarity of the event also plays a role. Regardless, with something like weather, where most events are within the realm of lived experience and most of the probabilities lie within a reasonable range, the average bias could probably be measured pretty reliably.

So what do we do with this knowledge? Think about it this way: we want to increase the accuracy of communication, but there are two different points in the communications process where you can measure accuracy. You can care about how accurately the information is communicated from the source, or how well the information is received. If you care about the latter, and you know that people have systematic and thus predictable biases in perceiving the probability that something will happen, why not adjust the numbers you communicate so that the message — as received by the audience — is accurate?

Now, some made up numbers: Let’s say the real chance of rain is 60%, as predicted by the best computer models. You might adjust that up to 70% if that’s the reported risk that makes people perceive a 60% objective probability (again, see the graph above). You might then adjust that percentage up to 80% to account for rain aversion/wet bias.

Here I think it’s important to distinguish between technical and popular communication channels: if you’re sharing raw data about the weather or talking to a group of meteorologists or epidemiologists then you might take one approach, whereas another approach makes sense for communicating with a lay public. For folks who just tune in to the evening news to get tomorrow’s weather forecast, you want the message they receive to be as close to reality as possible. If you insist on reporting the ‘real’ numbers, you actually draw your audience further from understanding reality than if you fudged them a bit.

The major and obvious downside to this approach is that people know this is happening, it won’t work, or they’ll be mad that you lied — even though you were only lying to better communicate the truth! One possible way of getting around this is to describe the numbers as something other than percentages; using some made-up index that sounds enough like it to convince the layperson, while also being open to detailed examination by those who are interested.

For instance, we all the heat index and wind chill aren’t the same as temperature, but rather represent just how hot or cold the weather actually feels. Likewise, we could report some like “Rain Risk” or “Rain Risk Index” that accounts for known biases in risk perception and rain aversion. The weather man would report a Rain Risk of 80%, while the actual probability of rain is just 60%. This would give us more useful information for the recipients, while also maintaining technical honesty and some level of transparency.

I care a lot more about health than about the weather, but I think predicting rain is a useful device for talking about the same issues of probability perception in health for several reasons. First off, the probabilities in rain forecasting are much more within the realm of human experience than the rare probabilities that come up so often in epidemiology. Secondly, the ethical stakes feel a bit lower when writing about lying about the weather rather than, say, suggesting physicians should systematically mislead their patients, even if the crucial and ultimate aim of the adjustment is to better inform them.

I’m not saying we should walk back all the progress we’ve made in terms of letting patients and physicians make decisions together, rather than the latter withholding information and paternalistically making decisions for patients based on the physician’s preferences rather than the patient’s. (That would be silly in part because physicians share their patients’ biases.) The idea here is to come up with better measures of uncertainty — call it adjusted risk or risk indexes or weighted probabilities or whatever — that help us bypass humans’ systematic flaws in understanding uncertainty.

In short: maybe we should lie to better tell the truth. But be honest about it.

When randomization is strategic

Here’s a quote from Tom Yates on his blog Sick Populations about a speech he heard by Rachel Glennerster of J-PAL:

Glennerster pointed out that the evaluation of PROGRESA, a conditional cash transfer programme in Mexico and perhaps the most famous example of randomised evaluation in social policy, was instigated by a Government who knew they were going to lose the next election. It was a way to safeguard their programme. They knew the next Government would find it hard to stop the trial once it was started and were confident the evaluation would show benefit, again making it hard for the next Government to drop the programme. Randomisation can be politically advantageous.

I think I read this about Progresa / Oportunidades before but had forgotten it, and thus it’s worth re-sharing. The way in which Progresa was randomized (different areas were stepped into the program, so there was a cohort of folks who got it later than others, but all the high need areas got it within a few years) made this more politically feasible as well. I think this situation, in which a government institutes a study of a program to keep it alive through subsequent changes of government, will probably be a less common tactic than its opposite, in which a government designs an evaluation of a popular program that a) it thinks doesn’t work, b) it wants to cut, and c) the public otherwise likes, just to prove that it should be cut — but only time will tell.

16

08 2012

A misuse of life expectancy

Jared Diamond is going back and forth with Acemoglu and Robinson over his review of their new book, Why Nations Fail. The exchange is interesting in and of itself, but I wanted to highlight one passage from Diamond’s response:

The first point of their four-point letter is that tropical medicine and agricultural science aren’t major factors shaping national differences in prosperity. But the reasons why those are indeed major factors are obvious and well known. Tropical diseases cause a skilled worker, who completes professional training by age thirty, to look forward to, on the average, just ten years of economic productivity in Zambia before dying at an average life span of around forty, but to be economically productive for thirty-five years until retiring at age sixty-five in the US, Europe, and Japan (average life span around eighty). Even while they are still alive, workers in the tropics are often sick and unable to work. Women in the tropics face big obstacles in entering the workforce, because of having to care for their sick babies, or being pregnant with or nursing babies to replace previous babies likely to die or already dead. That’s why economists other than Acemoglu and Robinson do find a significant effect of geographic factors on prosperity today, after properly controlling for the effect of institutions.

I’ve added the bolding to highlight an interpretation of what life expectancy means that is wrong, but all too common.

It’s analagous to something you may have heard about ancient Rome: since life expectancy was somewhere in the 30s, the Romans who lived to be 40 or 50 or 60 were incredibly rare and extraordinary. The problem is that life expectancy — by which we typically mean life expectancy at birth — is heavily skewed by infant mortality, or deaths under one year of age. Once you get to age five you’re generally out of the woods — compared to the super-high mortality rates common for infants (less than one year old) and children (less than five years old). While it’s true that there were fewer old folks in ancient Roman society, or — to use Diamond’s example — modern Zambian society, the difference isn’t nearly as pronounced as you might think given the differences in life expectancy.

Does this matter? And if so, why? One area where it’s clearly important is Diamond’s usage in the passage above: examining the impact of changes in life expectancy on economic productivity. Despite the life expectancy at birth of 38 years, a Zambian male who reaches the age of thirty does not just have eight years of life expectancy left — it’s actually 23 years!

Here it’s helpful to look at life tables, which show mortality and life expectancy at different intervals throughout the lifespan. This WHO paper by Alan Lopez et al. (PDF) examining mortality between 1990-9 in 191 countries provides some nice data: page 253 is a life table for Zambia in 1999. We see that males have a life expectancy at birth of just 38.01 years, versus 38.96 for females (this was one of the lowest in the world at that time). If you look at that single number you might conclude, like Diamond, that a 30-year old worker only has ~10 years of life left. But the life expectancy for those males remaining alive at age 30 (64.2% of the original birth cohort remains alive at this age) is actually 22.65 years. Similarly, the 18% of Zambians who reach age 65, retirement age in the US, can expect to live an additional 11.8 years, despite already having lived 27 years past the life expectancy at birth.

These numbers are still, of course, dreadful — there’s room for decreasing mortality at all stages of the lifespan. Diamond’s correct in the sense that low life expectancy results in a much smaller economically active population. But he’s incorrect when he estimates much more drastic reductions in the economically productive years that workers can expect once they reach their economically productive 20s, 30s, and 40s.

—-

[Some notes: 1. The figures might be different if you limit it to “skilled workers” who aren’t fully trained until age 30, as Diamond does; 2. I’m also assumed that Diamond is working from general life expectancy, which was similar to 40 years total, rather than a particular study that showed 10 years of life expectancy at age 30 for some subset of skilled workers, possibly due to high HIV prevalence — that seems possible but unlikely; 3. In these Zambia estimates, about 10% of males die before reaching one year of age, or over 17% before reaching five years of age. By contrast, between the ages of 15-20 only 0.6% of surviving males die, and you don’t see mortality rates higher than the under-5 ones until above age 85!; and 4. Zambia is an unusual case because much of the poor life expectancy there is due to very high HIV/AIDS prevalence and mortality — which actually does affect adult mortality rates and not just infant and child mortality rates. Despite this caveat, it’s still true that Diamond’s interpretation is off. ]