Validity refers to whether or not a test or an experiment is actually doing what it is intended to do.
Validity sits upon a spectrum. For example:
- Low Validity: Most people now know that the standard IQ test does not actually measure intelligence or predict success in life.
- High Validity: By contrast, a standard pregnancy test is about 99% accurate, meaning it has very high validity and is therefore a very reliable test.
There are many ways to determine validity. Most of them are defined below.
Types of Validity
1. Face Validity
Face validity refers to whether a scale “appears” to measure what it is supposed to measure. That is, do the questions seem to be logically related to the construct under study.
For example, a personality scale that measures emotional intelligence should have questions about self-awareness and empathy. It should not have questions about math or chemistry.
One common way to assess face validity is to ask a panel of experts to examine the scale and rate it’s appropriateness as a tool for measuring the construct. If the experts agree that the scale measures what it has been designed to measure, then the scale is said to have face validity.
If a scale, or a test, doesn’t have face validity, then people taking it won’t be serious.
Conbach explains it in the following way:
“When a patient loses faith in the medicine his doctor prescribes, it loses much of its power to improve his health. He may skip doses, and in the end may decide doctors cannot help him and let treatment lapse all together. For similar reasons, when selecting a test one must consider how worthwhile it will appear to the participant who takes it and other laymen who will see the results” (Cronbach, 1970, p. 182).
2. Content Validity
Content validity refers to whether a test or scale is measuring all of the components of a given construct. For example, if there are five dimensions of emotional intelligence (EQ), then a scale that measures EQ should contain questions regarding each dimension.
Similar to face validity, content validity can be assessed by asking subject matter experts (SMEs) to examine the test. If experts agree that the test includes items that assess every domain of the construct, then the test has content validity.
For example, the math portion of the SAT contains questions that require skills in many types of math: arithmetic, algebra, geometry, calculus, and many others. Since there are questions that assess each type of math, then the test has content validity.
The developer of the test could ask SMEs to rate the test’s construct validity. If the SMEs all give the test high ratings, then it has construct validity.
3. Construct Validity
Construct validity is the extent to which a measurement tool is truly assessing what it has been designed to assess.
There are two main methods of assessing construct validity: convergent and discriminant validity.
Convergent validity involves taking two tests that are supposed to measure the same construct and administering them to a sample of participants. The higher the correlation between the two tests, the stronger the construct validity.
With divergent validity, two tests that measure completely different constructs are administered to the same sample of participants. Since the tests are measuring different constructs, there should be a very low correlation between the two.
4. Internal Validity
Internal validity refers to whether or not the results of an experiment are due to the manipulation of the independent, or treatment, variables. For example, a researcher wants to examine how temperature affects willingness to help, so they have research participants wait in a room.
There are different rooms, one has the temperature set at normal, one at moderately warm, and the other at very warm.
During the next phase of the study, participants are asked to donate to a local charity before taking part in the rest of the study. The results showed that as the temperature of the room increased, donations decreased.
On the surface, it seems as though the study has internal validity: room temperature affected donations. However, even though the experiment involved three different rooms set at different temperatures, each room was a different size. The smallest room was the warmest and the normal temperature room was the largest.
Now, we don’t know if the donations were affected by room temperature or room size. So, the study has questionable internal validity.
5. External Validity
External validity refers to whether the results of a study generalize to the real world or other situations. A lot of psychological studies take place in a university lab. Therefore, the setting is not very realistic.
This creates a big problem regarding external validity. Can we say that what happens in a lab would be the same thing that would happen in the real world?
For example, a study on mindfulness involves the researcher randomly assigning different research participants to use one of three mindfulness apps on their phones at home every night for 3 weeks. At the end of three weeks, their level of stress is measured with some high-tech EEG equipment.
This study has external validity because the participants used real apps and they were at home when using those apps. The apps and the home setting are realistic, so the study has external validity.
6. Concurrent Validity
Concurrent validity is a method of assessing validity that involves comparing a new test with an already existing test, or an already established criterion.
For example, a newly developed math test for the SAT will need to be validated before giving it to thousands of students. So, the new version of the test is administered to a sample of college math majors along with the old version of the test.
Scores on the two tests are compared by calculating a correlation between the two. The higher the correlation, the stronger the concurrent validity of the new test.
7. Predictive Validity
Predictive validity refers to whether scores on one test are associated with performance on a given criterion. That is, can a person’s score on the test predict their performance on the criterion?
For example, an IT company needs to hire dozens of programmers for an upcoming project. But conducting interviews with hundreds of applicants is time-consuming and not very accurate at identifying skilled coders.
So, the company develops a test that contains programming problems similar to the demands of the new project. The company assesses predictive validity of the test by having their current programmers take the test and then compare their scores with their yearly performance evaluations.
The results indicate that programmers with high marks in their evaluations also did very well on the test. Therefore, the test has predictive validity.
Now, when new applicants’ take the test, the company can predict how well they will do at the job in the future. People that do well on the predictor variable test will most likely do well at the job.
8. Statistical Conclusion Validity
Statistical conclusion validity refers to whether the conclusions drawn by the authors of a study are supported by the statistical procedures.
For example, did the study apply the correct statistical analyses, were adequate sampling procedures implemented, did the study use measurement tools that are valid and reliable?
If the answers to those questions are all “yes,” then the study has statistical conclusion validity. However, if the some or all of the answers are “no,” then the conclusions of the study are called into question.
Using the wrong statistical analyses or basing the conclusions on very small sample sizes, make the results questionable. If the results are based on faulty procedures, then the conclusions cannot be accepted as valid.
9. Criterion Validity
Criterion validity is sometimes called predictive validity. It refers to how well scores on one measurement device are associated with scores on a given performance domain (the criterion).
For example, how well do SAT scores predict college GPA? Or, to what extent are measures of consumer confidence related to the economy?
An example of low criterion validity is how poorly athletic performance at the NFL’s combine actually predicts performance on the field on gameday. There are dozens of tests that the athletes go through, but about 99% of them have no association with how well they do in games.
However, nutrition and exercise are highly related to longevity (the criterion). Those constructs have criterion validity because hundreds of studies have identified that nutrition and exercise are directly linked to living a longer and healthier life.
There are so many types of validity because the measurement precision of abstract concepts is hard to discern. There can also be confusion and disagreement among experts on the definition of constructs and how they should be measured.
For these reasons, social scientists have spent considerable time developing a variety of methods to assess the validity of their measurement tools. Sometimes this reveals ways to improve techniques, and sometimes it reveals the fallacy of trying to predict the future based on faulty assessment procedures.
Cook, T.D. and Campbell, D.T. (1979) Quasi-Experimentation: Design and Analysis Issues for Field Settings. Houghton Mifflin, Boston.
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assessment: An introduction to tests and measurement (6th ed.). New York: McGraw-Hill.
Cronbach, L. J. (1970). Essentials of Psychological Testing. New York: Harper & Row.
Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.
Simms, L. (2007). Classical and Modern Methods of Psychological Scale Construction. Social and Personality Psychology Compass, 2(1), 414 – 433. https://doi.org/10.1111/j.1751-9004.2007.00044.x