Inter-rater reliability is a way of assessing the level of agreement between two or more judges (aka raters).
Observation research often involves two or more trained observers making judgments about specific observed behaviors, and researchers would like to know if they agree with each other or not.
The greater the level of agreement, the higher the internal reliability of the study.
Types of Inter-Rater Reliability
There are two common methods of assessing inter-rater reliability: percent agreement and Cohen’s Kappa.
- Percent agreement involves simply tallying the percentage of times two raters agreed. This number will range from 0 to 100. The closer to 100, the greater the agreement.
- Cohen’s Kappa is very similar to percent agreement, but the formula used takes into account that sometimes raters will agree with each other as a matter of chance. Therefore, it is considered a more rigorous assessment of inter-rater reliability. The formula will result in a number ranging from 0 to 1; the closer to 1, the greater the level of agreement.
Inter-Rater Reliability Examples
- Grade Moderation at University – Experienced teachers grading the essays of students applying to an academic program
- Observational Research Moderation – Observing the interactions of couples in a shopping mall while two observers rate their behaviors in terms of affectionate, neutral, or distant
- Judges Comparing Notes in a Sporting Event to Moderate Results – Assessing the degree of agreement among the judges of a DanceSport competition – see this study
- Getting Outsider Expert Reviews of New Exams – Asking experienced math teachers to rate the level of difficulty of questions on a new exam
- Cross-Referencing with Experts – Asking subject matter experts (SMEs) to look at a new measure of perceived motor skills (PMS) of youth with visual impairments (YVI) and rate its level of face validity – see this study
- Comparing Scores on Two Similar Tests – Scoring the performance of new bus drivers on a VR test course that simulates driving conditions
- Experienced and Inexperienced Professionals Comparing Notes – Asking experienced nursing professionals to score the performance of new nurses taking part in a series of simulated medical emergencies – see this study
- Experienced Professionals Rating Inexperienced Colleagues – Experienced paramedics rating the ability of trainees to perform CPR in a first-aid course
- Multiple Administrators Evaluating their Staff – School administrators observing and evaluating the teaching demo of a new teacher
- Multiple Teachers Comparing Notes – Teachers’ ratings regarding the quality of essays written by EFL learners – see this study
1. The Ainsworth Strange Situations Test
Dr. Mary Ainsworth developed a laboratory method of assessing the attachment style of very young children. The Strange Situations Tests consists of 8 scenarios lasting a few minutes each that present the child with mildly stressful predicaments.
The behavior of the child is watched and rated by trained observers seated behind a two-way mirror. They rate the child’s actions in each scenario according to a pre-defined set of criteria, from which they have had extensive training.
For example, in one scenario, the mother returns to the room where she left the child. The observers rate the child’s actions upon the mother’s return in terms of affective sharing or avoidance of proximity.
As reported by Simonelli & Parolin, (2016), “Inter-rater agreement for SSP is high, especially among within-laboratory researchers and in a lesser extent but still reassuring when inter-laboratory rates are examined” (p. 4).
2. Coding the Linguistic Patterns of Parent/Child Interactions
Understanding the factors involved in linguistic development can give researchers and educators valuable insights into one of the most important skills a person can acquire. Verbal skills play a key role in academic and career success over the entire lifespan.
This is why a large volume of research has been devoted to this area of study. Among the various methodologies employed, observing the interactions between parent and child during the early years yields the most data, albeit incredibly time-consuming and challenging.
Researchers will either have trained observers record behavior in-home, or ask the primary caregiver and child to come to the lab on campus. As the parent and child engage in various semi-structured activities, their behaviors will be monitored and scored by trained observers.
To make sure those scores are reliable, the researchers will assess the inter-rater reliability of those ratings. Depending on the number of raters and other facets of the study, either a percent agreement or Kappa statistic will be calculated.
3. Bandura Bobo Doll Study
One of the most influential studies in psychology took place in the 1960s by Dr. Albert Bandura and his colleagues. The basic version of these studies involved having children watch a video of an adult being aggressive or non-aggressive towards a Bobo doll. Then the children were observed as they played freely in a separate room with a Bobo doll.
In the Bandura et al. (1961) study, children were observed in their nursery school before participating in the study. Two trained judges rated the children’s behavior on four dimensions using a 5-point scale: physical aggression, verbal aggression, aggression toward inanimate objects, and aggressive inhibition.
To assess inter-rater reliability, a correlation was calculated on the combined aggression scores. “The reliability of the composite aggression score, estimated by means of the Pearson product-moment correlation, was .89.” This correlation indicates substantial agreement among the raters, which means that we can be confident in the results and internal validity of the study.
4. Judging the Reliability of Judges at a Tasting Competition
The tasting industry is fiercely competitive. Winning a prestigious award can have substantial economic implications for a food or beverage company. Given that there is so much on the line at these sophisticated events, it is surprising to learn that there is reason to doubt the credibility of the judging.
In the words of Hodgson (2008), “Why is it that a particular [beverage] wins a Gold medal at one competition and fails to win any award at another?” (p. 105).
To assess the reliability of judges’ ratings, a panel of four judges tasted replicate samples of 30 beverages entered in a California competition from 2005 – 2008. Between 65 and 70 judges participated in the study each year, and rated the beverages on the same scale used in the competitions.
The results indicated that in less than half of the judge panels, beverage quality was the determining factor in the ratings. Moreover, only about 10 percent of the judges were able to replicate the ratings they had actually given during the competition in which they judged that beverage.
These results suggest that the inter-rater reliability of judges’ evaluations during beverage tasting competitions is quite low.
5. Judging Synchronized Swimming
In synchronized swimming competitions, performance is rated by a panel of judges. More than 20 judges can be involved in evaluating the quality of the routines. That’s a lot of judges. Given the importance of their evaluations, it is vital to have confidence in their assessments.
This was the purpose of a study conducted by Ponciano et al. (2017). First, they video recorded the routines of three well-trained synchronized swimmers. Then, their performance was evaluated by ten qualified judges with at least ten years of experience at the national and international level.
Inter-rater reliability was assessed by calculating a Cronbach alpha on the ratings that took place on two separate occasions. The results revealed a high level of agreement among the raters at T1 (0.85) and T2 (0.83). The researchers conclude that “The content of the video was interpreted almost the same way by the 10 evaluators and allowed evaluation consistency after 7 days” (p. 185).
This study demonstrates the reliability of judges’ ratings of synchronized swimming and the utility of using video as a training tool.
Psychological research often relies on the assessments of trained observers. However, people naturally have varied opinions on what they see, which can lead to questions regarding the internal validity of research.
Therefore, before data collection commences, raters are trained extensively in what to look for and how to categorize those observations. After the data has been collected, those ratings are then subjected to statistical analyses to determine the degree of agreement. If the raters are consistent with each other in their judgments, then inter-rater reliability will be high.
Ainsworth, M. D. S., Blehar, M., Waters, E., & Wall, S. (1978). Patterns of attachment: A psychological study of the Strange Situation. Hillsdale: Erlbaum.
Bandura, A., Ross, D. & Ross, S.A. (1961). Transmission of aggression through imitation of aggressive models. Journal of Abnormal and Social Psychology, 63, 575-82.
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assessment: An introduction to tests and measurement (6th ed.). New York: McGraw-Hill.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.
Cronbach, L. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64(3), 391-418.
Hinton, J., Mays, M., Hagler, D., Randolph, P., Brooks, R., DeFalco, N., Kastenbaum, B., & Miller, K. (2017). Testing nursing competence: Validity and reliability of the nursing performance profile. Journal of Nursing Measurement, 25(3), 431. https://doi.org/10.1891/1061-37184.108.40.2061
Hodgson, R. (2008). An examination of judge reliability at a major U.S. wine competition. Journal of Wine Economics, 3, 105-113. https://doi.org/10.1017/S1931436100001152
Ponciano, Kátia & Fugita, Meico & Figueira Junior, Aylton & da Silva, Cláudia & Meira Jr, Cassio & Bocalini, Danilo. (2017). Reliability of judge’s evaluation of the synchronized swimming technical elements by video. Revista Brasileira de Medicina do Esporte. 24. 10.1590/1517-869220182403170572.
Premelč, J., Vučković, G., James, N., & Leskošek, B. (2019). Reliability of Judging in DanceSport. Frontiers in Psychology, 10. https://doi.org/10.3389/fpsyg.2019.01001
Simonelli, Alessandra & Parolin, Micol. (2016). Strange Situation Test. Virgil Zeigler-Hill and Todd K. Shackelford (Eds.) In Encyclopedia of Personality and Individual Differences (pp.1-4). https://doi.org/10.1007/978-3-319-28099-8_2043-1
Solomon, J., & George, C. (2016). The measurement of attachment security and related constructs in infancy and early childhood. In J. Cassidy, & P. R. Shaver (Eds.), Handbook of attachment: Theory, research, and clinical applications (3rd ed., pp. 366-396). New York: Guilford Press.
Stribing, A., Stodden, D., Monsma, E., Lieberman, L., & Brian, A. (2021). Content/face validity of motor skill perception questionnaires for youth with visual impairments: A Delphi method. British Journal of Visual Impairment, 1-9. https://doi.org/10.1177/0264619621990687
Daller, M., & Phelan, D. (2007). What is in a teacher’s mind? The relation between teacher ratings of EFL essays and different aspects of lexical richness. Cambridge University Press. https://doi.org/10.1017/CBO9780511667268.016