How Valid Is This Test?
No business wants to spend time and money on a measurement method that does not work. This is why most businesses know to ask this basic question: “How valid is this method or test?” The challenge only begins here, though, because you then need to be able to understand and evaluate the answer. To help you, try following these seven tactics.
(Excerpted with permission from the publisher, Wiley, from Talent Intelligence: What You Need to Know to Identify and Measure Talent by Nik Kinley and Shlomo Ben-Hur. Copyright © 2013.)
Ask for Evidence. We were recently looking at the validity of a popular U.S. interviewing system that described itself as being accurate and valid. On a Web page entitled “Validity,” the vendor described a wide variety of research showing that interviews can be valid predictors of success. Yet there was not a single mention of any research that the vendor had conducted into the validity of its own system. So rule No, 1 is that you need to get specific and ask vendors for the evidence that their particular method or tool is valid. And beware of statements such as, “The test is predictive,” but do not come with any specific validity figures or evidence.
Ask What Is Meant by Validity. Validity figures are not always what they appear to be. For starters, there is no one way for vendors to measure or report validity. When you are told that a measurement method has 80 percent validity, it could mean many different things. Classically, validity refers to whether the ratings and scores that people achieve on particular measures can predict their performance in a business. And by and large, this is what you should expect to hear. Yet we have seen some vendors define validity as being whether individuals agree with the results, so when a vendor tells you that a particular measure is valid, you need to ask, “In what way?”
In response to this question, you may sometimes hear phrases such as “content validity,” “criterion validity,” and “construct validity.” For many people, though, this kind of technical jargon can be confusing and can put them off from delving more deeply into the subject. But it need not do so. All you need to remember is that you are essentially trying to find out two things: “How do you know that the method or tool measures what it is supposed to?” and, “What business outcomes do results with this method predict, and to what degree?”
It is worth noting here that “performance” can mean different things. It can mean actual results (such as sales figures), managers’ appraisal ratings of individuals, and even self-ratings of performance. Beyond task performance, it can mean contribution to team performance or organizational citizenship behavior.
Furthermore, just because a measure can predict performance in skilled and semiskilled workers does not mean that it can also predict performance in managers. There are additional questions that you need to ask when told that a measure can predict performance: “What types of performance?” and, “In what types of people?” Moreover, with measures of potential, extra questions to ask are, “How far ahead can it predict performance?” and, “After how long?”
Beware of Very High Validity Figures. When looking at the degree to which methods or tools can predict outcomes, the single best predictor of performance, intelligence, can achieve maximum validities of only 0.5 to 0.6. If you hear anything more than that, start asking questions.
Check How Many People the Tool Has Been Validated With. One essential question to ask is, “How many people?” For instance, if you are told that a measure can predict, say, absenteeism in semiskilled workers, you need to ask how many people were tested. If the answer comes back with anything fewer than 100, then the results may not be reliable. For psychometric tests, ideally you should be looking for two thousand or more people to have been tested.
If the Method or Tool Uses Norm Groups, Check the Quality and Relevance of Them. Not all methods and tools use norm groups, but some rely on them. Norm groups are comparison groups, a kind of benchmark. They enable you to compare the score of a particular individual on a certain test or measurement method with the scores of other people who have also done the test. This is particularly useful with ability tests, such as measures of intelligence and physical fitness, as it can help you understand what scores mean. For example, an individual may get a score of 25 out of 30 on an intelligence test, which sounds good. But if you then find out that the average score is 27, that score of 25 does not look so good after all. We need to know how well others usually perform to understand precisely how good a score is.
As useful as norm groups may sound, the science of developing them and where they should and should not be used are much-debated issues. If you are going to use norm groups, then they should be good ones: if they are not, they may be misleading.
So what counts as a “good” norm group? You need to look for two qualities. The first is size — the number of people in the group. Simply put, the bigger, the better. With competency ratings from individual psychological assessments, the norm group may be very small — under 100. For psychometrics, however, it will ideally be in the thousands.
The second quality you should look for is relevance. Having a norm group of two thousand white males from Scandinavia is impressive, but if you are trying to interpret the scores of Singaporean women, it is of no use. To be effective, then, a norm group needs to be representative of the people you are assessing. This can be in terms of gender, age, ethnicity, and education level. It can also be in terms of industry, function, and type of role. The more relevant, the better. For job applicants being tested with an intelligence test, for example, the best norm group is not the scores of people already employed, but other applicants for the same type of roles.
One quick way to evaluate the quality of a norm group you are already using is to look at how many of the people you are assessing score above the average for the norm group. If the norm group is perfect, then 50 percent of your people will score above the norm average and 50 percent will score below it. If almost everyone is scoring above or below the norm average, then you know that the norm group may not be relevant enough.
Moreover, for larger organizations it may be worthwhile trying to create your own norm groups specific to your business. The absolute minimum you need for competency and individual psychological assessment ratings is around 50 people. This is low, though, and you would need to be a little cautious about comparisons. For psychometrics, the minimum is around 150 people, although once again this is low. A number you could be completely confident in would be around 2,000, so our suggestions are absolute minimums. Some vendors will try to charge you for creating a specific norm group for your business. Others do not charge. Obviously, we recom-mend the latter.
Remember Reliability. For relatively objective methods such as psychometric tests and SJTs, you do not need to ask about reliability. A test cannot be valid without also being reliable, so asking about validity is enough. However, for more subjective methods such as assessment centers and individual psychological assessment, it is important to ask about inter-rater reliability. This is the degree to which two assessors agree (or disagree) in their ratings and judgments about people. The less reliability and agreement there is between assessors, the less likely results are to be accurate.
Look for Independent Reviews. This final step is an important one: always look for independent evidence of whether measures work. An easy place to start here is to ask the vendor if any such research exists. You can also do a Web search for the name of the tool. Moreover, with psychometric tests, probably the best thing you can do is to check one of the independent, nonprofit bodies that publish test reviews. The national psychology associations or societies of many countries provide this kind of service. By far our favorite is provided by the University of Nebraska’s Buros Institute. Its reviews can contain some deeply technical information, but they also contain some clear and no-nonsense recommendations on whether to use tests.
These, of course, are just questions about validity. However, businesses need to think more broadly about the issue of whether measures work. We have discussed, for example, the need to ask about incremental validity. Yet businesses also need to think about what measures need to do over and above merely predicting performance. This could include things like helping managers engage potential new employees, identifying areas new employees may need support with, and helping plan for individuals’ development. Validity, then, is not the be-all and end-all, and the most valid test is sometimes not the one that will work best for your business. Nevertheless, it is a good place to start: a test that is not valid will not be able to do much for your business.