Learning from Evaluation when Context Matters

Learning from Evaluation when Context Matters Lant Pritchett Harvard Kennedy School and Center for Global Development At Evidence on a silver platter: Evaluation results for policy making in Development Cooperation, November 5 2015

Four big problems with the RCT revolution or randomistas It is not good science experiments without theory cannot have external validity It is not a good theory of organizational learning about implementation intensive programs (which is most everything) It is not able to focus on development and hence is mainly useful for charity work It is not based on a realistic positive model or theory of change of policy adoption

Every bubble has a transition from we to they : We have to get in on this to What were they thinking?

Getting over the hype cycle about RCTs to slope of enlightenment

Doing science via RCTs embedded as independent impact evaluation of projects Stage of hype cycle Peak of inflated expectations Trough of disillusionment Slope of enlightenment Doing science with RCTs Build RCTs into donor/ngo financed projects which create treatments to look at the impact of interventions. After many RCTs are done do a systematic review of what works to guide policy If doing experiments were science there would be a Nobel Prize for alchemy This is not science it is the parody of science. This cannot work (and we knew that) This does not work (we can now show that) All methods have to be more embedded in a theory that provides a notion of context and hence invariance laws that are able to encompass all empirical findings.

Cannot work and does not work Cannot work as a method because if the bad studies have heterogeneity (e.g. impacts differ across contexts) then good studies cannot logically be expected to reduce that heterogeneity Does not work: Eva Vivalt s survey of 600 RCT studies Evans on education Bold, et al on replication Pritchett and Sandefur (2015)

Cannot work: No claim to external validity is coherent because the gap between observational and RCT results is the result of behavior Distribution of the impact on test scores of reduced class size from non-observational studies (about the non- RCT literature) Typical 2σ β OLS(j)=0 β OLS(i)=.2 β RCT(i)=.3 β OLS(k)=.4 Zero Gold standard RCT from one specific context (country, region, grade, range of class sizes)

Does not work: Rigorous evidence isn t Vivalt 2015 shows the evaluations of impacts of similar programs have very high variance Bold et al show context includes implementing organization Evans shows systematic reviews are not so systematic Pritchett and Sandefur (2015) shows OLS from context beats RCT from another context to predict impact Muralidharan (2015) shows even lots of zero estimated impacts do reveal why they failed

Not a good model of how organizations actually learn to do things better Stage of hype cycle Peak of inflated expectations Trough of disillusionment Slope of enlightenment Forcing organizations to acknowledge failure Introduce rigorous independent evaluation into programs and organizations (donor and other) will learn what works Organizations resist evaluations that come anywhere near core beliefs/practices Without organizational cooperation what works gets subverted Learning during implementation can be enhanced by with implementing organizations but only at the expense of the independence of the evaluation

Example: Cameras in classrooms Duflo, Hanna et al show cameras in classrooms of a small NGO in Rajasthan work to increase teacher attendance and raise learning performance of students Duflo et al (2008) show introducing better biometrics (and pay incentives) in Rajasthan reduces presence of ANMs in health sub-centers Dhaliwal and Hanna (2013) show biometrics in Karnataka India to increase attendance doesn t change attendance of doctors and reduces PHC use

A study of implementing biometrics to track attendance of medical personnel at PHCs in Karnataka 0-0,05-0,1-0,15-0,2 Treatment impact on patient perceptions Staff availability Staff Quality Knowledge of entitlements Source: Dhaliwal and Hanna 2013 Use of biometrics to track attendance in treatment with some threat of docking days missed Attendance of doctors tracked was around 30 percent Program made patient perception worse The program had positive effects on health status because they used the PHC less 0,2 0,1 0-0,1-0,2-0,3 Percentage change in place of delivery PHC Large Hospital (Public or Private)

Learning when the fitness function is rugged and contextual over a hyperdimensional design pace

Not a good model of to do better development (as opposed to charity mitigating the consequences of lack of development) Stage of hype cycle Peak of inflated expectations Trough of disillusionment Slope of enlightenment Evaluating what is possible to be evaluated The RCT would improve evidence and lead to better development outcomes Most program evaluations want large N do generate statistical power and hence focus on individualized treatments (e.g. CCTs, deworming, livelihood programs) Most deep causes of development are ontologically at the social/political/economic level, not individual so national development ultimately matters more for all outcomes than program design RCTs are only a small part of the development agenda

Trillion dollar questions versus million dollar questions The gain from India s growth performance after an incipient crisis in 1991 (for which policy response appeared to matter and the policy response was influenced by donor action) cumulative added trillions of dollars to Indian output The latest Science magazine paper show replication of an approach to livelihood programs in six countries saw gains in per capita consumption of PPP$54 per year and that is gross not net even applied to all billion extreme poor on that planet that is a gross not net gain of 54 billion.

Not a good model of to do better development (as opposed to charity mitigating the consequences of lack of development) Stage of hype cycle Peak of inflated expectations Trough of disillusionment Slope of enlightenment Political economy of policy making and policy adoption By providing unambiguously rigorous and easy to understand information to policy makers they will adopt new programs Many government mistakes are due to deeply embedded ideas and interests. The idea that policy makers are simply waiting for new evidence is a fanciful view of political process RCTs can be built into the experimentation and scaling of policies and programs that policy makers otherwise want to adopt but this is more experiential learning than impact evaluation

Don t get me wrong There should be enormously more not less use of randomization but to achieve this requires not RCTs used for independent evaluation of impact (from inputs to outcomes) but making randomization inside organizations possible to explore a hyper-dimensional and rugged design space using within treatment variation to achieve organizational goals (with some small use of impact evaluation)