What can course evaluations say about teaching?
April 27, 2021 | Claudia Stanny
What can course evaluations say about teaching?
A new chair sits down to review faculty for annual evaluations. The average score in the department for the “overall evaluation of teaching” question is 4.4 (on a 1-5 scale, with 5 being the highest rating). Two instructors expressed interest in being nominated for a new college-level teaching award established this year by the Dean, which is limited to one nomination per department. Dr. Research Methods has a score of 4.1 on the “overall” question; Dr. Current Issues has a score of 4.5. Who should the chair nominate?
Many chairs and many faculty members might argue that the chair should nominate Dr. Current Issues, who received an above-average score on the “overall” question and is therefore a stronger candidate than Dr. Research Methods, who received a score lower than the department mean on this question. However, this decision might be a classic example of a well-documented judgment error in which decision-makers falsely believe that all numeric differences represent meaningful differences (e.g., Tversky & Kahneman, 1971). In addition, the chair may have failed to consider other factors that frequently generate differences in course ratings, such as the difficulty of the course and possible biases associated with rating instructors of different genders or ethnic backgrounds.
Why are course evaluations difficult to interpret?
A large and growing literature raises a variety of questions about measurement bias in course ratings and the appropriate interpretation of ratings on end-of-course evaluations from students. Recently, Kreitzer and Sweet-Cushman (2021) published a meta-analysis of over 100 peer-reviewed articles that examined the role of various sources of bias that can influence course ratings, independently of course quality or teacher skill. Similarly, Linse (2017) reviewed the research literature on course evaluations and recommended caution in interpreting these ratings as evidence of teaching skill or student learning.
Boyleston (2015) reports evidence that faculty over-interpret small numeric differences in course ratings and treat small differences as meaningful rather than the product of random variation. When Boyleston asked faculty to interpret course evaluation data, his reviewers frequently underestimated the underlying variability of course ratings and over-interpreted small, meaningless numeric differences in course ratings. They persisted in making this decision error even after Boyleston warned them that small differences were most likely explained by random variation.
In addition to the problem of interpreting small numeric differences as meaningful indicators of teaching quality, factors irrelevant to teaching quality can contribute to variation in SAI ratings. People underestimate how much variability is associated with irrelevant and random factors. Curby et al. (2019) estimated the amount of variation produced by instructor characteristics, course characteristics, and “occasion” characteristics (a combination of semester, cohort of students, time of day, and similar attributes of a course section offered in a particular term to a collection of students). They found that these random factors explained a larger percentage of differences in course ratings than any other predictive component. Instructor characteristics (which would include teaching skill as well as irrelevant personal factors such as age, gender, ethnicity, and attractiveness) accounted for only 22.9% of the variance in course ratings. Moreover, they found that the same instructor was rated more highly when teaching one course compared than when teaching a different course (e.g., a research methods course versus a popular elective in the major) or when the instructor taught different cohorts of students in the same course. The take-home message from this analysis is that ratings on course evaluations must be interpreted with caution. The same instructor will be rated differently when teaching different courses, when teaching different cohorts of students, when teaching at different times of day, when teaching in different modalities (online versus face-to-face), and when teaching in different terms.
Suggestions for the effective evaluation of teaching
Individuals who are charged with the evaluation of teaching quality must look to additional sources of evidence to interpret and verify the evidence suggested by course evaluation ratings. Researchers offer several suggestions to chairs and others asked to evaluate the quality of teaching (Kreitzer & Sweet-Cushman, 2021; Linse, 2017).
- Use course evaluations to give students a voice about their learning experiences. Student Assessments of Instructions (SAIs), despite their name, do not evaluate teaching directly. Instead, SAIs provide student perceptions about how they experience teaching and learning in a course. As such, SAIs bring the student voice into the evaluation process and provide potentially useful formative feedback.
- Pay attention to response rates. You can trust information from a large group of students more than information from a few students. Low response rates increase the probability that the sample does not accurately represent the population of students enrolled in the class.
- Do not rely on global rating questions (e.g., the overall rating of the course). Global rating questions are susceptible to the “halo effect,” in which biases about personal characteristics color judgments about other characteristics (e.g., raters will rate an attractive individual to be more honest or more intelligent, even when the only information is a photo). Global ratings are less useful than aggregated scores based on specific questions about concrete teaching behaviors.
- Treat comments with caution. A small collection of comments is frequently contradictory. Again, avoid being persuaded by one or two memorable comments. Interpret comments in the context of overall findings. Look for patterns of repeated themes, including themes that emerge across multiple courses or multiple terms.
- Seek multiple sources of evidence. The student voice has a place in the evaluation of teaching, but students are not experts on teaching. Look for additional sources of evidence that illustrate how an instructor designs and manages a course. Examples include course syllabi, handouts and instructions for assignments and projects, grading rubrics, sample exams, and in-class observations of teaching.
Boysen, G. A. (2015) Significant interpretation of small mean differences in student evaluation of teaching: An evaluation of warning effectiveness. Scholarship of Teaching and Learning in Psychology, 1, 150-162. http://doi.org/10.1037/stl0000042
Curby, T., McKnight, P., Alexander, L., & Erchov, S. (2019). Sources of variance in end-of-course student evaluations. Assessment & Evaluation in Higher Education. https://doi.org/10.1080/02602938.2019.1607249
Kreitzer, R.J., Sweet-Cushman, J. (2021). Evaluating Student Evaluations of Teaching: a Review of Measurement and Equity Bias in SETs and Recommendations for Ethical Reform. Journal Academic Ethics. https://doi.org/10.1007/s10805-021-09400-w
Linse, A. R. (2017). Interpreting and using student ratings data: Guidance for faculty serving as administrators and on evaluation committees. Studies in Educational Evaluation, 54, 94-106. https://doi.org/10.1016/j.stueduc.2016.12.004
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110. http://doi.org/10.1037/h0031322