Evaluating the Evaluators

December 20, 2009

Balancing the personal and the statistical in teacher assessments

It is no secret that finding and rewarding high-quality teachers is key to addressing America’s educational shortcomings. But how governments should assess and improve teacher quality eludes decision-makers. There are two primary modes of teacher evaluation: the personal-evaluation model and the statistical, test-based model. Each model has serious problems when used on its own, but each also has a role to play in overall educational improvement. School districts would be best served by using statistical models to diagnose teacher underperformance, and then using personal-evaluation methods to treat problem cases.
Problems with Personal Evaluations
Personal observation of teachers by their principals has the virtue of providing immediate feedback. But an ineffective or unmotivated principal may offer useless or even harmful evaluations; this overly time-consuming method of evaluation breeds even more inefficiency in a public education system fraught with shortcomings. A recent innovation is to instead use peer-review panels, in which veteran teachers observe and evaluate their new or underperforming colleagues. But the benefits and drawbacks of that approach mirror those of principal-led evaluations: personal evaluations are only effective if wholly unbiased. A recent study by The New Teacher Project, a teacher-training organization, found that “less than 1 percent of teachers receive unsatisfactory ratings, even in schools where students fail to meet basic academic standards, year after year.”
There is a fundamental problem with having the same cohort of administrators and veterans both evaluate and treat underperforming teachers, because the former task can interfere with the latter. For instance, lower standards might be set in order to minimize the number of teachers deemed underperforming, thereby reducing the amount of time and money needed for retraining. Moreover, Frederick M. Hess, director of education policy studies at the American Enterprise Institute, told the HPR that personal evaluation “requires institutional arrangements, protocols, and established routines that buffer personal relationships.” Underperforming schools are less likely to meet these requirements, and so personal-evaluation methods are not likely to help them break the cycle of underachievement.
Problems with Statistical Evaluations
At the same time, the idea of assessing teachers based on raw standardized test scores has largely been discredited. There are too many confounding variables within a student population to make test scores alone an effective standard. For example, a teacher who facilitated a 10-point improvement from an already high-achieving student and one who brought the same bump from a low-achieving student would be rated equally under this system. But the low-achieving student probably began with weaker study skills, and her teacher deserves recognition for building foundational habits.
For this reason, an increasingly popular evaluative tool is the value-added model (VAM). VAMs account for factors such as students’ backgrounds and previous performance, and compare students’ formula-predicted scores to the scores they actually achieve. VAMs are now tied to merit-pay bonuses in the Houston Independent School District and will be incorporated into the New York Department of Education’s Teacher Data Reports. Nevertheless, VAMs can have seemingly random results and therefore risk alienating teachers. Ellen Viruleg, a student at the Harvard Graduate School of Education, told the HPR, “teachers in the Houston school district are now seeing [merit pay] as a lottery because the results of the VAM are so unpredictable.”
A Composite Approach
For these reasons, neither personal nor statistical evaluations can suffice on their own. But used in a targeted way, each can effectively supplement the other. First, statistical models should be used to initially diagnose teacher underperformance. We should have more confidence in the ability of statisticians to develop impartial evaluative measures than in the ability of principals and teachers to distance themselves from school politics. Andrew Rotherham, co-founder of the think tank Education Sector, told the HPR that “while [VAMs] are not as bulletproof as you might believe from the rhetoric, they’re closer to usable than you hear from the critics.”
Still, there is a place for personal methods in teacher-improvement efforts. Once a statistical model has identified an underperforming teacher, it should be up to a peer-review panel or principal to mentor the teacher to raise his performance. This will hopefully minimize the potential for bias, while employing personal methods where they can be most effective: in aiding, rather than simply identifying, underperforming teachers. Getting better teachers, then, is partly a matter of figuring out the right way to decide what “better” really means.
Photo Credit: Flickr Stream of Kevin Dooley