Criterion validity is a type of validity that examines whether scores on one test are predictive of performance on another.
For example, if employees take an IQ text, the boss would like to know if this test predicts actual job performance.
- If an IQ test does predict job performance, then it has criterion validity.
- If an IQ test does not predict job performance, then it does not have criterion validity.
To make that determination, a correlation is calculated between the IQ scores and a measure of job performance.
The higher the value of the correlation, the stronger the relation between the two and the higher the criterion validity.
Sometimes this is also called predictive validity.
Of course, there are other factors related to job performance, so the correlation will never be perfect (i.e., 1). In most situations, there will be many factors associated with a particular performance outcome, and in some cases, hundreds.
Examples of Criterion Validity
1. Leadership Inventories and Leadership Skills
A leadership inventory can predict whether someone will be good in a leadership role.
Predictor Variable: High score on a leadership inventory
Criterion Variable: Aptitude for a leadership role
It takes a long time to know which employees have leadership potential. They have to be seen in various situations over a period of years to develop a solid understanding of their personality and ability to handle pressure.
That is very inefficient, especially for a new company that is expanding rapidly.
This is where personality inventories come into play. By administering a test that assesses leadership traits, a company can obtain a lot of data about a large number of employees very rapidly.
The key issue is: make sure a test with criterion validity is administered. As long as the test has criterion validity, it will be able to predict, with some degree of accuracy, which employees are suitable for a leadership role.
Here is a list of commonly used leadership inventories.
2. The SAT and College GPA
Studies have found that SATs have a weak to moderate ability to predict your college GPA.
Predictor Variable: SAT Score
Criterion Variable: College GPA
There have been many studies on the criterion validity of SAT scores in predicting college GPAs (Kobrin et al., 2008).
The basic premise is that the SAT has criterion validity regarding college performance. The typical studying involves obtaining the SAT scores of hundreds, even thousands of students, and then correlating those scores with first year or final year GPAs.
Although it is difficult to make a sweeping statement that adequately covers so many studies, the results range from finding weak to moderately strong associations between SAT and GPA.
A moderately strong association is more impressive than what it sounds. It’s just one score on one test, but it predicts future performance on a criterion fairly well. If researchers were to include other factors, such as motivation and time-management skills, the ability to predict a student’s college GPA would become increasingly more accurate.
3. The Housing Market
Predictor variables, including number of new homes purchased, building permits awarded, interest rates on mortgages, and employment rate have high criterion validity in predicting the prices of houses.
Predictor Variables: Building permits issued, interest rates on mortgages, employment rate
Criterion Variable: Housing prices
The housing market is a classic indicator of economic performance. The volume of sales each quarter are affected by numerous factors, including: the employment rate, interest rates, building supply, and consumer confidence, just to name a few.
Each one of those factors can be measured and correlated with the housing market. Some factors have strong criterion validity, while others may have moderate or low criterion validity. However, when economists put them all together, the ability to predict the housing market improves significantly.
Of course, it’s still not an exact science, so there will always be some margin of error in those forecasts.
4. Psychological Correlates of Academic Performance
Self-efficacy and effort management are found to have high criterion validity in academic performance tests because they are predictors of a high GPA.
Predictor Variable: Self-efficacy and effort management
Criterion Variable: A high GPA
Richardson et al. (2012) examined a large number of studies between 1997 and 2010 that involved identifying psychological variables associated with academic performance. The researchers included over 7,000 studies and identified over 80 distinct variables that correlated with GPA.
Each of those 80 variables have a degree of criterion validity. That is, a student’s score on each one of those variables is predictive of grades to some extent. The real question is: which ones are the best predictors?
After conducting some very thorough analyses, the results indicated that psychological factors such as self-efficacy and effort management were the strongest correlates of GPA. In other words, student self-efficacy and effort management have criterion validity regarding GPA.
5. The In-Basket Activity
The in-basket job simulation test examines a manager’s ability to prioritize tasks. It gets a job applicant to sort items in an in-basket and sort the order in which to do them.
Predictor Variable: Performance in the in-basket exercise
Criterion Variable: Applicant’s aptitude as a manager
The In-Basket activity is a job simulation task that is designed to assess an applicant’s ability to prioritize.
First, the applicant is seated at an official-looking desk and instructed to sort through the in-basket documents. The basket contains memos, email printouts, messages, and descriptions of various tasks that the company needs completed.
The applicant is given a short period of time to read the assorted documents and arrange them in order of priority.
This is an example of the type of assessment tool that an HR department will implement because they believe it has criterion validity. Performance in this activity is predictive of the ability to prioritize competing demands on the job.
6. Criterion Validity and Life-Expectancy
A life-expectancy test will have criterion validity if it can reliably predict the correlation between a predictor variable such as frequent exercise and longevity of life.
Predictor Variable: Regular exercise
Criterion Variable: A long life.
It seems like every month another study on life-expectancy is published.
Many of the studies have similar methodologies; at stage 1, thousands of people are assessed on a multitude of factors, including dietary habits, frequency of exercise, and psychological factors such as social support and personality characteristics.
At stage 2, approximately 20-50 years later, the researchers gather data on physical health such as cardiovascular disease and cancer.
By examining the correlations between the factors assessed at stage 1 with the health status of participants at stage 2, the researchers can determine which factors have criterion validity. That is, which factors at stage 1 are related to health at stage 2.
7. The NFL Combine
The NFL Combine is an annual test of college football platers’ aptitude to play in the NFL. Most of these tests don’t have criterion validity, but the sprint test for running backs does predict future performance in the NHL.
Predictor Variable: NFL combine sprint test
Criterion Variable: Running back performance in the NFL
Every year, top college football players are invited to participate in the NFL’s combine. The event lasts several days and involves each athlete going through a wide range of physical challenges, such as running the 40-yard dash, jumping as high as they can, and taking an interesting IQ test called the Wonderlic.
Head coaches, scouts, and owners put a lot of faith in the results of these tests, but no one is really sure why. As research by Kuzmits & Adams (2008) has revealed, there is “…no consistent statistical relationship between combine tests and professional football performance, with the notable exception of sprint tests for running backs” (p. 1721). For a non-technical explanation, click here.
The NFL combine may be one of the most enduring set of tests that completely lack criterion validity.
8. Bus Driver Course Performance and Bus Accidents
To test the criterion validity of a driver course, researchers would have to follow-up on large experimental and control groups to see whether those who took the driver course were in less accidents.
Predictor Variable: Taking a bus driver safety course
Criterion Variable: Having less bus accidents on the job
Hiring skilled and cautious bus drivers is a paramount concern for many municipalities. A single accident can result in numerous injuries. Add in the duration of driving times and the number of buses operating at any given time, and the situation is ripe for frequent accidents.
Therefore, bus companies need to select their drivers carefully. One component of the hiring process involves applicants driving through a standardized course. The course has been designed to mimic several characteristics found in real driving conditions and each applicant’s performance can be objectively measured and scored.
When that score is then correlated with actual driving records of hired drivers over the next few years, its criterion validity can be assessed.
Hopefully, the bus company will discover that the driving course has criterion validity. In other words, performance on the course can predict actual job performance. So, applicants that do poorly on the course, should not be hired.
9. Job Simulation and Nursing Competence
Evaluations of competence sometimes have low criterion validity. For example, one study of nursing competence from external experts did not correlate with the evaluations of the day-to-day supervisors of those nurses, suggesting that either the experts or supervisors are conducting assessments with low criterion validity.
Predictor Variable: Assessments of performance by supervisors
Criterion Variable: Actual on-the-job performance
Nursing is an incredibly high-pressure, high-stakes occupation. Poor job performance can result in serious injury or worse. Therefore, the ability to develop accurate measures of performance that have criterion validity is of substantial importance.
Unfortunately, relying on a paper and pencil measurement of skills fails to replicate the high-stress situations that nurses often find themselves facing.
However, “Evaluation of clinical performance in authentic settings is possible using realistic simulations that do not place patients at risk” (Hinton, et al., 2017, p. 432).
In the Hinton et al. study, nurses engaged in specific medical-surgical test scenarios with manikins in a high-fidelity laboratory while being observed by experienced professionals. Those ratings were then compared to their supervisor’s ratings on the job.
In this example, the researchers were attempting to establish the criterion validity of the simulation scenarios to predict on-the-job performance. Despite all the effort that went into this study, scores on the simulated scenarios “…were not well correlated with self-assessment and supervisor assessment surveys” (p. 455).
10. Wearable Trackers and Steps Walked
Step counters that you wear on your watch apparently have high criterion validity. To test this, Adamakis (2021) got people to jog on a treadmill, counted their steps, then compared it to the results on the step counter. The step counters did pretty well!
Predictor Variable: Steps recorded on a step counter
Criterion Variable: Actual steps walked
Ever wonder if those activity trackers on your phone are accurate? Well, research by Adamakis (2021) may shed some light on this question.
In this study, thirty adults wore two smartphones (one Android and one iOS), while running four apps: Runtastic Pedometer, Accupedo, Pacer, and Argus. They walked and jogged on a treadmill at three different speeds for 5 minutes. Two research assistants counted every step they took with a digital counter.
Criterion validity of the apps was then assessed by comparing the data from the apps with the 100% accurate digital counters. The results revealed that “The primary finding regarding step count was that all freeware accelerometer-based apps were valid…when comparing iOS and Android apps, Android apps performed slightly more accurately than iOS ones” (p. 9).
So, it seems that these apps have acceptable criterion validity, at least when it comes to counting steps.
This study was also a good example of concurrent validity because the validity of one test was established by conducting the test concurrently (e.g. at the same time) as another test known to be valid, to see if they get the same results.
Conclusion
With the prevalence of tests used to determine who gets into college or who gets hired as a bus driver, it would be nice to know if those tests are accurate. That is, is a person’s score on a given test at all related to actual performance, either at school or on the job?
As it turns out, there is a way to make this determination, and it’s called criterion validity. The usual methodology involves administering the test to a group of people and then assessing their performance in a given domain at a later date. That later date could be a matter of months or several years.
Fortunately, researchers have conducted a great deal of studies examining the criterion validity of thousands of various tests. Tests that lack support are usually dropped or modified, while tests that are supported by research can be used in many practical situations.
References
Adamakis, M. (2021). Criterion validity of iOS and Android applications to measure steps and distance in adults. Technologies, 9, 55. https://doi.org/10.3390/technologies9030055
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assessment: An introduction to tests and measurement (6th ed.). New York: McGraw-Hill.
Hinton, J., Mays, M., Hagler, D., Randolph, P., Brooks, R., DeFalco, N., Kastenbaum, B., & Miller, K. (2017). Testing nursing competence: Validity and reliability of the nursing performance profile. Journal of Nursing Measurement, 25(3), 431. https://doi.org/10.1891/1061-3749.25.3.431
Kobrin, J. L., Patterson, B. F., Shaw, E. J., Mattern, K. D., & Barbuti, S. M. (2008). Validity of the SAT for predicting first-year college grade point average (College Board Research Report No. 2008-5). New York, NY: College Board.
Richardson, M., Abraham, C., & Bond, R. (2012). Psychological correlates of university students’ academic performance: A systematic review and meta-analysis. Psychological Bulletin, 138(2), 353.