Psychometrics: Difference between revisions
m (→Introduction) |
m (→Basic Metrics) |
||
| Line 19: | Line 19: | ||
=Basic Metrics= | =Basic Metrics= | ||
==Norms== | ==Norms== | ||
Assessments all have a reference value to which the measures can be compared. This value is called an indicator, or sometimes the norm. For instance | Assessments all have a reference value to which the measures can be compared. This value is called an indicator, or sometimes the norm. For instance, in temperature measurement, absolute zero serves as the reference. For measures of distance, weight, speed, and force, we have reference values. The same goes for psychometrics. When measuring traits, null values are often used as the reference. In typologies, the reference value is more abstract and implicit; it concerns having the value, not having it, or its opposite. For factors used by the adaptive profiles, the reference value is the average, the value between the extreme high and the extreme low. | ||
==Scales== | ==Scales== | ||
Scales inform about the intensity of the dimension being measured. The scales | Scales inform about the intensity of the dimension being measured. The scales depend on the type of measurement being used: decile and value-based scales for traits, dichotomous or ranking scales for simple or multiple typologies, and standard-deviation scales for factors. | ||
Depending on the | Depending on the scale and measurement system used, the assessment technique will be more or less capable of accounting for the adaptation and development of the behaviors being described. | ||
==Standard error of measurement(SEM)== | ==Standard error of measurement(SEM)== | ||
The Standard Error of the Mean (SEM) | The Standard Error of the Mean (SEM) quantifies the expected margin of error for an individual score, given the assessment's imperfect reliability. The SEM represents the degree of confidence that a person’s “true” score lies within a particular range of scores. The smaller the SEM, the more accurate the measurement. | ||
==Sensitivity== | ==Sensitivity== | ||
An assessment technique is said to be sensitive if it allows discrimination, differentiation between people. Conversely, a technique that | An assessment technique is said to be sensitive if it allows discrimination, differentiation between people. Conversely, a technique that states substantially the same thing about many people without nuance is insensitive. To measure sensitivity, we study the distribution of dimension scores in a sample large enough and representative of the entire population. In theory, one should find at least one person on each score. Several indices should be considered: the mean, standard deviation, median, index of symmetry, and distribution flattening. | ||
=Reliability= | =Reliability= | ||
Revision as of 06:04, 1 January 2026
Introduction
Assessment techniques rely on statistics for both their development and validation. Not only do these statistics inform the quality of the measures, but they also provide insight into their use, application, and potential benefits for users.
Quality metrics become even more crucial when they have long-term and significant impacts on decisions, people, and organizations. When the goal of using the technique is to enhance the performance of individuals and organizations—rather than solely focusing on clinical diagnosis and recruitment by clinician psychologists, as was traditionally the case—new criteria emerge, and older ones become less relevant.
The refutability of the hypotheses being tested by statistics, or the independence of those who conduct the statistics, are essential considerations. Errors can easily cripple when testing hypotheses, starting with the quality of the sample being used. Results presented in textbooks appear more objective when produced by an independent party, but ensuring the absence of political or commercial bias is challenging. Although numerous reference points for statistical practice exist, slow progress in the use of assessment techniques has not enabled psychometric standards to evolve to reflect new applications, as evidenced at GRI.
Specific measurements are either extremely complicated or impossible to obtain, such as those assessing reliability over time by assuming that the same technique will be administered by the same people several years apart, with no other conditions changed. Even in very structured administrative organizations, conditions are constantly changing. Comparing the performance of candidates who have been retained with that of others who have not, to demonstrate that the first performed better than the others, is simply impossible. Additionally, performance values are not always readily obtainable.
Despite the difficulties in calculating performance metrics, it is generally accepted that tests must possess a minimum set of qualities to be used in applications such as recruitment. The following groups of statistics are generally considered to measure an assessment technique’s performance:
- Basic Metrics. It includes information on reference values, scales, and the sensitivity of techniques used to evaluate individual differences.
- Reliability assesses the accuracy of the technique over time.
- Validity assesses the quality of what is being measured, as well as its relevance.
- Work-Relatedness for measures to be used in “normal” work environments, as opposed to being used in clinical applications for diagnosis and curative treatments.
- Non-discrimination vis-à-vis protected social groups, which also includes information about a measure’s universal scope.
These five groups are now commented on.
Basic Metrics
Norms
Assessments all have a reference value to which the measures can be compared. This value is called an indicator, or sometimes the norm. For instance, in temperature measurement, absolute zero serves as the reference. For measures of distance, weight, speed, and force, we have reference values. The same goes for psychometrics. When measuring traits, null values are often used as the reference. In typologies, the reference value is more abstract and implicit; it concerns having the value, not having it, or its opposite. For factors used by the adaptive profiles, the reference value is the average, the value between the extreme high and the extreme low.
Scales
Scales inform about the intensity of the dimension being measured. The scales depend on the type of measurement being used: decile and value-based scales for traits, dichotomous or ranking scales for simple or multiple typologies, and standard-deviation scales for factors.
Depending on the scale and measurement system used, the assessment technique will be more or less capable of accounting for the adaptation and development of the behaviors being described.
Standard error of measurement(SEM)
The Standard Error of the Mean (SEM) quantifies the expected margin of error for an individual score, given the assessment's imperfect reliability. The SEM represents the degree of confidence that a person’s “true” score lies within a particular range of scores. The smaller the SEM, the more accurate the measurement.
Sensitivity
An assessment technique is said to be sensitive if it allows discrimination, differentiation between people. Conversely, a technique that states substantially the same thing about many people without nuance is insensitive. To measure sensitivity, we study the distribution of dimension scores in a sample large enough and representative of the entire population. In theory, one should find at least one person on each score. Several indices should be considered: the mean, standard deviation, median, index of symmetry, and distribution flattening.
Reliability
An assessment technique is said to be reliable when, applied several times to the same people and under the same conditions, similar results are obtained. Some speak of stability or fidelity rather than reliability.
The reliability of the technique must be differentiated from the reliability of the constructs being measured. An assessment technique that does not give similar results a few days apart would not be considered reliable. On the other hand, the construct measured may have a proven stability over several years, and an inaccuracy of its measurement in the short term. On the sole basis of this criterion, certain assessments should be avoided for recruitment or consulting but can be used, for example, in team dynamics.
The reliability of an assessment technique is an essential characteristic to be considered before studying validity based on criterion.
Reliability is measured by calculating correlation coefficients between two series of measurements. The technique is said to be reliable if the correlation coefficient is strong (close to 1). In reality, the measurement of reliability refers to the estimate of the proportion of variance of the results which are attributable to a deviation (or “error” in a broad sense) on this variance; The discrepancy may come from multiple sources: insufficiently standardized test conditions, changing general environment or characteristics of the person who has changed. Four ways of measuring the reliability of an assessment technique are generally used, each of which will provide information of a significantly different nature.
Test-retest method
The same people are asked to answer the same technique twice in a row, under the same conditions. The time between the two taking of the survey can be very variable, ranging from a few days or weeks to several years. The concept of reliability over time can be restricted to the assessment technique, the measuring tool, and not extended to people; the aim is to evaluate not the stable characteristics of the person but the capacities of the measurement tool to provide consistent information from one moment to the next. Taken a few days or weeks apart, it is useful to assess whether the information given during the first administration may have influenced the second or whether the technique may be too sensitive to changing environmental conditions.
Split-Half Method
With the split-half technique, the population sample is divided into two parts in order to obtain two distinct scores and to calculate the correlation between the two. There are several spliting methods: assessments answered first vs assessments answered second, and even assessments vs odd assessments. The correlation between the two scores constitutes an index of reliability of the overall score in the evaluation method. By spliting the assessment into two parts, one can see if the different shapes of measurements being plotted, measure the same thing. On the other hand, the reliability measured concerns only half of the assessment (the shorter the technique, the more its reliability’s probability is reduced).
Covariance Method
With the covariance method, one seeks to evaluate the consistency of the responses to the technique’s items. Inter-item consistency is influenced by two sources: the sampling of items and the heterogeneity of the groups of behaviors being measured. The more the groups of behaviors are homogeneous, the more the items are consistent between them. This involves examining the performance of each item. One can thus see if each item measures the same dimension as all the other items of the technique. For this, homogeneity coefficients are calculated. When the scores are dichotomous (passed/failed), the Kuder-Richardson index is calculated. If the scores are continuous (scoring on several points), a Cronbach's alpha is calculated.
Method of parallel forms
With the method of parallel forms technique, two equivalent versions of the same evaluation technique are built. The content, form and difficulty must be comparable. The two versions are then answered by the same people and a correlation coefficient is calculated between the two measurements. If the correlation between the two measures is high, the person's score can be generalized to other versions of the technique.
The parallel forms of the same technique enable to observe or neutralize certain biases such as the training to the assessment technique, or habituation. In practice, this the parallel forms method is rarely used, because it is difficult to actually constitute two equivalent forms of the same test.
General Reliability
The results of the various reliability calculations can be integrated into a single result called the generalized reliability coefficient. This coefficient makes it possible to extract the portion of variance solely connected to the assessment technique itself and independently of other sources such as time, administration conditions or residual inconsistency between the items. Generalized reliability can be used in criterion-related validity calculations and in meta-analyses.
Validity
An assessment technique is valid if it measures what it is supposed to measure. With validity, we are interested in the very content of the assessment, its quality and relevance. A validity analysis is generally done after considering the basic metrics including the sensitivity and reliability calculations which are necessary but not sufficient. There are four broad categories of validity:
- Empirical validity which includes content validity and face validity
- Theoretical validity which includes among others, the method of contrasting groups, convergent/divergent validity and internal consistency
- Criterion validity which includes predictive-criterion validity and concurrent-criterion validity
- Synthetic validity which brings the other forms of validity together
Empirical Validity
Empirical validity is historically the form of validity that appeared first. The objective of empirical validity is to assess whether the assessment technique covers all the aspects supposed to be covered, when participants will answer the technique. This procedure is typically applied to the assessment of skills, abilities and competencies.
Empirical validity is inappropriate for techniques which content is not uniform. This is the case for clinical personality assessment that seek to uncover functions behind behaviors that vary between individuals. The same technique would need to measure different functions for different people. Under such conditions, the technique’s content cannot be validated by considering the technique’s items and the varying functions and behaviors being assessed. In other cases than this one, two methods are considered for empirical validity.
Content Validity
With content validity, we are interested in all aspects related to the content of the technique, and in particular if the items of the technique are a representative sample of all the items that could have been presented to participants. Content validity is not measured but is assessed by experts who rate each item according to its representativeness in relation to the dimension being measured. The judgments of the experts are then surveyed to attribute a value to each dimension. The items that are not representative of the concept being measured are disregarded.
Face Validity
With face validity, we are interested in what the technique seems to measure rather than what it really measures. Participants who answer the assessment are asked if the technique’s content seems related to the measurement’s objective. Face validity will influence the person's attitude towards the technique. If the technique has good face validity, one can expect better cooperation from participants to answer it.
Theoretical Validity
Theoretical valitiy examines and validates whether a measured dimension corresponds to the author's theory. The focus is on the dimensions or constructs being measured by the assessment technique. Theoretical or construct validity, whose first debates date back to the 1950s highlights the importance for psychological theories to formulate hypotheses that can be verified by a rational and structured process. Several techniques are commented below.
Contrasting Groups
The contrasting group method examines the extent to which the overall technique’s score differentiates groups. This procedure is suitable for the validation of personality assessments. For example, we could make the assumption that a company’s sales force has overall behavior adaptive profiles that are different from those of the accounting department. If the first group obtains significantly different results from the second, we can infer that the technique allows to differentiate them. The contrasting group method allows to evidence criteria from groups that reflect the qualities of their constituents.
Convergent/Divergent Validity
This method highlights that if two supposedly independent techniques present a connection (by calculating a correlation coefficient) then these two techniques more or less measure the same dimension: this is convergent validity. On the other hand, if two independent techniques are unrelated, they are measuring two different concepts: this is divergent validity.
Developmental Changes
This procedure, is typically used with children and adolescents, to validate their chronological development, including of their intelligence. This form of validation also applies to analyze adults’ behavioral adaptation in the workplace, but is not used with personality traits in clinical asessment where longitudinal studies are not the norm.
Correlation With Other Techniques
With this method, several techniques are taken by the same people, and correlations between the measures are calculated. Correlations that are high would indicate that the techniques measure the same constructs. Low correlations would indicate the opposite, that the constructs are unrelated. In between those two extreme, the techniques’ construct may partially share content.
Internal Consistency
With internal consistency, we analyse the validity at the level of the items that make up the technique. This procedure verifies whether each item indeed measures the dimension to which it is related, and whether it allows to differentiate people in this dimension. The correlation between the score for each item and the total score is calculated. One of the particularities of this procedure, commonly used for personality assessments, is that the criterion is none other than the total score of the technique. The measurement procedure is similar to the one of homogeneity. In the absence of data external to the technique, this procedure contributes little to the understanding of the dimensions being measured.
Factor Analysis
Used to identify personality factors (Explanatory Factor Analysis or EFA), in particular by following the lexical approach, this technique makes it possible in a validation situation (Confirmatory Factor Analysis or CFA) to calculate the relationships between data. Factor analysis can reveal a strong relationship between certain clusters and suggest simplifications.
Structural Equation Modeling
The use of this procedure has allowed to identify redundancy between the constructs being measured by a same assessment. Differently than other procedures, structural equation modeling allows fo better analyze the relationship between variables, and their reliability.
Criterion Validity
Criterion validity or criterion-related validity assesses how effectively the technique predicts a person's performance in a given activity by considering the value of a criterion. To measure criterion validity, a correlation coefficient is calculated between the technique’s score and the criterion score. Two forms of criterion validity should be distinguished.
Predictive Validity
The first form of criterion validity is the predictive-criterion validity or predictive validity. The measurement by the assessment technique precedes the measurement of the criterion. These are situations where we seek to forecast a person’s performance in a subsequent activity, based on measures from the past. Predictive validity is measured by the correlation coefficient between the assessment technique results and the values of the criterion in the job, once the person has been recruited. This validity is useful in situations of recruitment, career counselling and personal development.
Concurrent Validity
The second form of criterion validity is concurrent-criterion validity, or concurrent validity. The two measures are obtained simultaneously. In general, the employees who already are in the jobs answer the assessment technique, while obtaining the value of the criterion. This form of validity is useful when trying to analyze an existing situation rather than predicting an outcome. It makes it possible to construct the criterion with which a predictive validity procedure can then be conducted.
Criteria can take different forms. For intelligence assessment techniques, the criterion of passing the exams is often taken into consideration. For competency, skills and aptitude tests, the results often follow a specific training. With organizations, performance or non-performance criteria are often used, such as sales and production results or absenteeism. Precautions must be taken with regard to the homogeneity of the population being studied or the linearity of the relationship between the criterion and the dimensions being assessed. Validity studies are particularly interested in the formation of criteria, taking into account the multiple factors emerging from work situations.
Although criterion validity studies have long been regarded as important to prove a technique’s usefulness, their results are deceitful. They don’t account for the many other variables that impact performance. They assume the job should be performed with similar conditions, including the reward system, the recruitment and management practices, which never happen. Additionally the sampling must include employees who have been in the job long enough, regardless of their poor performance, or satisfaction, and may consequently have quit the job before the study’s end. In best cases, including with techniques measuring personality traits used in selection, criterium validity studies evidence a 0.2 to 0.4 correlation, which means that an haphazard occurrence of the criterion could be reduced by 2 percent to 8 percent.
Neverthless, criterion-related validity studies are performed on demand, especially when large number of candidates and employees are imvolved, to provide meaningful indicators. Then the results can be combined with longitudinal analyzes, both qualitative and quantitative, to better analyze the other variables at play, and devise more practical, effective and actionable plans.
Synthetic Validity
The generalization of aptitude and general intelligence (GMA) tests on similar positions shows results that are too variable or insufficient, compromising the generalization of the use of tests, in particular aptitude tests. From the mid-1970s meta-analyses began to be used in psychology. New methodologies were recommended to work on larger population samples; the number of people included in the statistical studies is shown to be insufficient to allow generalizations. The results obtained by combining several studies by meta-analysis and by working on larger samples allowed new generalizations, but generalizations of synthetic validity through meta-analyses need to be better understood. In particular, the environmental variables must be specified, including the criteria set by the organization for assessing the results. The sources of specificity need to be better understood as well.
Work Relatedness
Work relatedness refers to the applicability of an assessment technique in the context of work and employment as opposed to making use of the technique for other applications in other fields such as forensic, clinical psychology and psychatry.
Work relatedness is a major aspect that qualifies a technique for being used in the workplace, with behaviors being “normal” even if sometimes out of the chart and extreme, but not requiring clinical attention. Work relatedness are generally covered in legislative code, as for instance in the US and French laws.
In the US law, Under Title VII of the Civil Rights Act, the Equal Employment Opportunity Commission (EEOC) information requested through the pre-employment process should be limited to those essential for determining if a person is qualified for the job. If a screening practice disproportionately screens out protected groups, the employer must prove the practice is "job-related for the position in question and consistent with business necessity." Additionally, the Americans with Disabilities Act (ADA) contains the strictest "job-relatedness" requirement specifically regarding medical or disability-related inquiries. With the provision 42 U.S.C. § 12112(d), an employer cannot ask disability-related questions or require medical exams before a job offer. After the job offer, such inquiries are allowed only if they are "job-related and consistent with business necessity."
In the French law, two articles are specific to work relatedness. Article L 121-6: “The information requested in any form whatsoever from a job applicant or an employee can only be used to assess his ability to hold the job offered or his professional skills. And on the other hand “this information must have a direct and necessary link with the job offered or with the assessment of professional aptitudes.” And Article L 121-7: "Methods and techniques for assisting in the recruitment or assessment of employees and job candidates must be relevant to the purpose pursued.”
Non Discrimination
The aspect of non-discrimination is taken into account to protect certain minority, disadvantaged or discriminated social groups. These aspects of non-discrimination are generally governed by social legislation. In the USA non discrimination is enforded by federal laws: In particular Title VII dating from 1964 and amended in 1972, the article against age discrimination (ADEA – Age Discrimination in Employment Act) dating from 1967, and the article protecting the disabled (ADA – American with Disabilities Act). The Equal Employment Opportunity Commission (EEOC) dating from 1964, is responsible for investigating and prosecuting companies in the event of non-compliance with the law.
In France, Article L 121-45 says: "No person may be excluded from a recruitment procedure, no employee may be sanctioned or dismissed because of their origin, their sex, their mores, their family situation, their belonging to an ethnic group, nation or race, political opinions, trade union or mutualist activities, religious beliefs or, except incapacity noted by the occupational physician within the framework of Title IV of Book II of this code, because of his medical condition or disability..."
From a practical standpoint, non-discrimination statistics are rarely available from publishers, but the legislations set appreciable benchmarks to avoid measurements inappropriate to a work environment.
The discussion on non discrimination, operating in the workspace, opens up another discussion about the potential universality of the dimension being measured. This focus on universitliy emerged as factors from lexicon approach are, by construction, not related to gender, age, and other characteristics that identify protected groups. One only need to be a real human being, not an avatar! Like other measure of weight, temperature or length, the values and averages may vary from one category of people to another, but those differences are irrelevant in regard to the characteristics expected in the job and that can potentialy be assessed by a technique, as long as one is a human being.