Face validity refers to whether a measurement appears to assess the thing it is supposed to assess.
The key term here is “appears.” The question it poses is:
“Does the test look like it measures what it has been designed to measure?”
This type of validity evaluation is subjective and usually conducted by people that will use the scale or experts in the domain of study.
Examples of Face Validity
The three main examples of ways to achieve face validity are:
- Consult a panel of research experts on your study design
- Consult a panel of workforce professionals on your study design
- Consult research participants on your study design during a pilot test
Below are the details on ten examples and real-life studies.
1. Panel of Research Experts
Probably one of the most common ways to assess face validity makes use of a panel of experts. The researcher contacts a small group (2-5) of noted experts in the domain of the scale they are measuring.
The scale is sent to each member of the panel and they rate each item on the scale in terms of face validity. That professional judgement is quantified by asking them to indicate their level of agreement with the following question: This item is appropriate for measuring ____________.
Each member of the panel would indicate how much they agree or disagree with that statement on a Likert scale, as seen below.
1 | 2 | 3 |
---|---|---|
disagree | neither agree or disagree | agree |
If the scale has face validity, then the experts will have a high degree of agreement on each item.
How much agreement is necessary to conclude a scale has face validity is open to debate and can vary depending on the panel’s ratings are compared.
2. Panel of Professionals
This example of how to assess face validity is basically the same as using a panel of research experts. However, there are situations in which asking working professionals may be more appropriate than asking researchers.
For example, suppose you are developing a questionnaire that will be given to a specific population, such as people seeking treatment for a phobia, or participants in marital counseling.
Then, it would be better to ask for the opinion of mental health professionals that work with that population on a daily basis.
Researchers may have some expertise in very specific aspects of the domain, but they may have very little direct contact with the population under study. In this case, mental health professionals would be better at assessing the questionnaire’s face validity.
3. Panel of Research Participants
Another way of testing face validity is to consult the research participants. Often, this involves doing a pilot test, then following-up with the research participants and asking them how valid the questions on a test really were.
Talking to the research participants during a pilot can significantly improve the quality of the test in future iterations. Participants provide feedback in a naturalistic way that isn’t just contained within the theoretical world of academic sciences.
For example, participants may highlight that you did ask valid questions, but you also missed other extremely important questions that would give a more holistic view of the study. This can lead to changes to the test that will improve both the face validity and overall quality of the test.
4. Cohen’s Kappa Statistic
Cohen’s Kappa is a statistical procedure that can be applied in a wide range of situations that involve two raters.
First, each rater is given a copy of the scale or questionnaire. They are asked to simply indicate “yes” or “no” for each item; “yes” means that item measures the construct and “no” means it does not.
The next step involves using a statistical formula to perform the necessary calculations. You can choose to do this by hand if you like, or input the data into a computer program such as SPSS and let the computer do all of the work.
The output will provide a number that ranges from 0-1. If the output is 0, then it means there was absolutely no agreement between the raters. If the output is 1, then it means they agreed 100% of the time. Therefore, the closer to 1, the better the face validity.
5. Motor Skills Perception Questionnaire
Youth with visual impairments (YVL) tend to be lower in physical activity because they have lower motor competence. This leads to increased sedentary behavior and, over time, a risk for obesity.
In order to conduct research on how parental perceptions of motor skills effect YVL behavioral choices, it is important to have a sound questionnaire that assess perception of motor competence (PMC).
Therefore, a group of researchers (Stribing et al., 2021) with experience working with this population generated 50 questions for parents regarding their child’s motor skills.
Next, the team distributed the questions to 22 researchers in this domain. Each expert rated each question on a 5-point scale on the following criteria:
To what extent is this question relevant for parents of YVLs?
- 0 = N/A
- 1 = very poorly
- 2 = poorly
- 3 = somewhat
- 4 = acceptable
- 5 = very acceptable
The ratings were analyzed and an item was included in the final questionnaire if it was rated as acceptable at least 80% of the time. Items with ratings below 80% were eliminated.
The end result is a questionnaire that has face validity as determined by experts.
6. New Mathematics Teacher
New teachers who create quizzes for their students often survey more experienced teachers to get their input on whether the quiz has face validity.
Being a new teacher can be both exciting and stressful. Teaching advanced math in an exclusive secondary school for gifted students can also be daunting.
So, a recently licensed teacher wants to make sure her first test is fair and challenging.
To help assess if the test is measuring what it is supposed to measure and is at the right level of difficulty, she asks several of her more experienced colleagues at the school to take a look. They each have experience with the course and the type of students.
She provides a rating table that includes a row for each question. The teachers rate each question in terms of:
- Level of difficulty, and
- Appropriateness for the students.
There is also space for comments on each question.
After the ratings are returned, the teacher examines the level of agreement among her colleagues and their comments to make a decision about each test item individually. Eventually, she has a test that she is confident is appropriate for her students.
7. Feelings of Burnout
An employer can use a questionnaire to test if their employees feel burnout. To test face validity, he asks a smaller cohort of employees to provide feedback on whether the quiz will effectively measure burnout.
The director of human resources is interested in having a better understanding of why so many of the company’s employees are calling in sick. He has a hunch that it is due to workplace stress because the company recently downsized and asked remaining employees to work longer hours.
However, he has never experienced burnout, so he is not sure if the questions he has generated for a survey are appropriate. He decides to send the questionnaire to 20 employees that have experienced burnout (based on their personnel files).
He asks each one to rate each question on how well the item measures feeling burned out (1=yes, measures feelings of burnout, 2=no, does not).
When the ratings are returned, he creates a table that shows the scores for each item. He decides the throw a question out if more than 80% of the employees gave it a rating of 2.
In the end, he has a questionnaire about burnout that has face validity for people experiencing the phenomenon.
8. Bayley Scales of Infant and Toddler Development
The Bayley Scales of Infant Development (BSID) tests the cognitive, language, social-emotional, motor, and adaptive functioning of infants and toddlers. The test has been refined through surveys of panels of experts over the years to improve its face validity.
The BSID can be used by hospitals to asses an infant’s development and reveal if the baby is progressing along expected norms or is experiencing a developmental delay.
The development of the test started in the 1960s and has gone through several revisions since.
To assess face validity, it was extremely important to choose a panel of experts wisely, such as pediatricians with considerable experience.
Since the BSID serve a very important purpose, face validity was just the first form of validity assessed. In fact, over the last several decades, the scale has undergone considerable testing and refinements.
9. The WHOQOL
The WGOQOL quality of life assessment achieved face validity by consulting a panel of experts and regular individuals from a range of cultural backgrounds.
What the heck is a WGOQOL? This acronym stands for The World Health Organization General Quality of Life Assessment. The scale was designed to “… assess respondents’ perception and subjective evaluation of various aspects of the quality of life” (Saxena & van Ommeren, 2005, p. 975).
As you can imagine, trying to develop a survey of “quality of life” that can be used all around the world was a daunting task. The most obvious reason being that there are so many cultures with so many variations in what is defined as “quality” of life.
In the case of the WHOQOL, face validity was initially assessed by creating cross-cultural focus groups in several field centers (Bangkok, Bath, Madras, Melbourne, Panama, St. Petersburg, Seattle, Tilburg, and Zagreb). Participants in the focus groups consisted of experts and individuals from the general population.
Those groups examined and discussed the items on the scale and then the developers selected items for inclusion accordingly.
10. Customer Satisfaction Survey
If a customer satisfaction survey is too broad and doesn’t ask questions directly related to the customer’s experience, it will have low face validity, and likely lead to failure of the survey.
A national bank has revamped their customer service. The improved version includes training reps to cite company policy in various complaint domains, and automated answering technology that routes customer calls.
The BoD spent considerable financial resources on the new initiative and customers are surveyed after their calls are handled by reps, with sample questions including:
How pleased are you with the bank’s mortgage rates?
Would you recommend opening a savings account to others?
Do you find the bank’s operating hours convenient?
After 6 months, a report on the new program is delivered to executives. The results are puzzling. Nearly 90% of customers failed to complete the survey.
A follow-up study was initiated, revealing that customers quit the survey because the questions had very little to do with the matter they called about. Furthermore, customers wanted to complain about their frustrations with the automated answering program and reps simply citing company policy rather than solving their problems. Those issues were not in the survey at all.
This is an example of what happens when a survey has no face validity with the people respond to it: they don’t take it seriously and quit.
11. Virtual Electrosurgical Skills Trainer (VEST)
You can test face validity after a study was conducted by surveying participants. This was done with the Virtual Electrosurgical Skill Trainer (VEST) apparatus.
The Virtual Electrosurgical Skill Trainer (VEST) is a way to train surgeons, anesthesiologists, and nurses on how to handle a fire in the operating room (OR). Wearing a head-mounted apparatus, trainees must respond to various fire scenarios in a simulated emergency and act to contain or extinguish the fire.
Forty-nine experienced professionals completed the training and were then asked to respond to a 16-item questionnaire and rate the usefulness of the VEST simulator on a 5-point scale.
Most questions (12 of 13) questions received an average rating of 3 or higher (92%), while five questions received an average rating of 4 or higher (38%).
Therefore, the results indicated that the VEST simulator has acceptable face validity as determined by a panel of experienced professionals (Dorozhkin et al., 2017).
Conclusion
As conveyed by the mastermind of psychological testing, Lee J. Cronbach (1916–2001), the importance of face validity should not be underestimated:
“When a patient loses faith in the medicine his doctor prescribes, it loses much of its power to improve his health. He may skip doses, and in the end may decide doctors cannot help him and let treatment lapse all together. For similar reasons, when selecting a test, one must consider how worthwhile it will appear to the participant who takes it and other laymen who will see the results” (Cronbach, 1970, p. 182).
For this reason, psychologists and other researchers go to considerable lengths to assess the face validity of their measurement and training tools.
References
Cronbach, L. J. (1970). Essentials of Psychological Testing. New York: Harper & Row.
Dorozhkin, D., Olasky, J., Jones, D. B., Schwaitzberg, S. D., Jones, S. B., Cao, C. G., … & De, S. (2017). OR fire virtual training simulator: design and face validity. Surgical Endoscopy, 31(9), 3527-3533.
Hardesty, D. M., & Bearden, W. O. (2004). The use of expert judges in scale development
Implications for improving face validity of measures of unobservable constructs. Journal of Business Research, 57, 98-107.
Saxena, S., & van Ommeren, M. (2005). World Health Organization Instruments for Quality of Life Measurement in Health Settings, Editor(s): Kimberly Kempf-Leonard, Encyclopedia of Social Measurement, 975-980. https://doi.org/10.1016/B0-12-369398-5/00508-9
Stribing, A., Stodden, D., Monsma, E., Lieberman, L., & Brian, A. (2021). Content/face validity of motor skill perception questionnaires for youth with visual impairments: A Delphi method. British Journal of Visual Impairment, 1-9. https://doi.org/10.1177/0264619621990687 Tavakol, M., & Dennick, R. (2011). Making Sense of Cronbach’s Alpha. International Journal of Medical Education, 2, 53-55. https://doi.org/10.5116/ijme.4dfb.8dfd