Psychometrics: Difference between revisions
(Created page with "=Introduction= Assessment techniques use statistics for both their construction and validation. Not only the measures’ quality is dependent upon those statistics, but also they inform about their use, application, and potential benefits for users. Quality metric is all the more important when they have long term and consequential effects on a decision, people and their organization. When the objective of using the technique is improving people and organization’s per...") |
|||
| (26 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
=Introduction= | =Introduction= | ||
Assessment techniques | Assessment techniques rely on statistics for both their development and validation. Not only do these statistics inform the quality of the measures, but they also provide insight into their use, application, and potential benefits for users. | ||
When the | |||
Quality metrics become even more crucial when they have long-term and significant impacts on decisions, people, and organizations. When the goal of using the technique is to enhance the performance of individuals and organizations—rather than solely focusing on clinical diagnosis and recruitment by clinician psychologists, as was traditionally the case—new criteria emerge, and older ones become less relevant. | |||
The refutability of the hypotheses being tested by statistics, or the independence of those who | The refutability of the hypotheses being tested by statistics, or the independence of those who conduct the statistics, are essential considerations. Errors can easily cripple when testing hypotheses, starting with the quality of the sample being used. Results presented in textbooks appear more objective when produced by an independent party, but ensuring the absence of political or commercial bias is challenging. Although numerous reference points for statistical practice exist, slow progress in the use of assessment techniques has not enabled psychometric standards to evolve to reflect new applications, as evidenced at GRI. | ||
Specific measurements are either extremely complicated or impossible to obtain, such as those assessing reliability over time by assuming that the same technique will be administered by the same people several years apart, with no other conditions changed. Even in very structured administrative organizations, conditions are constantly changing. Comparing the performance of candidates who have been retained with that of others who have not, to demonstrate that the first performed better than the others, is simply impossible. Additionally, performance values are not always readily obtainable. | |||
Despite the difficulties | Despite the difficulties in calculating performance metrics, it is generally accepted that assessment techniques must possess a minimum set of qualities to be used including in specific applications such as recruitment. | ||
The following groups of | The following groups of statistics are generally considered to measure an assessment technique’s performance: | ||
* Basic Metrics. It includes information | * '''Basic Metrics'''. It includes information on reference values, scales, and the sensitivity of techniques used to evaluate individual differences. | ||
* Reliability assesses the accuracy of the technique over time. | * '''Reliability''' assesses the accuracy of the technique over time. | ||
* Validity assesses the | * '''Validity''' assesses the quality of what is being measured, as well as its relevance. | ||
* Work Relatedness for measures to be used in “normal” work | * '''Work-Relatedness''' for measures to be used in “normal” work environments, as opposed to being used in clinical applications for diagnosis and curative treatments. | ||
* Non-discrimination vis-à-vis protected social groups which also | * '''Non-discrimination''' vis-à-vis protected social groups, which also includes information about a measure’s universal scope. | ||
These five groups are now commented. | These five groups are now commented on. | ||
=Basic Metrics= | =Basic Metrics= | ||
==Norms== | ==Norms== | ||
Assessments all have a reference value to which the measures can be compared. This value is called an indicator, or sometimes the norm. For instance | Assessments all have a reference value to which the measures can be compared. This value is called an indicator, or sometimes the norm. For instance, in temperature measurement, absolute zero serves as the reference. For measures of distance, weight, speed, and force, we have reference values. The same goes for psychometrics. When measuring traits, null values are often used as the reference. In typologies, the reference value is more abstract and implicit; it concerns having the value, not having it, or its opposite. For factors used by the adaptive profiles, the reference value is the average, the value between the extreme high and the extreme low. | ||
==Scales== | ==Scales== | ||
Scales inform about the intensity of the dimension being measured. The scales | Scales inform about the intensity of the dimension being measured. The scales depend on the type of measurement being used: decile and value-based scales for traits, dichotomous or ranking scales for simple or multiple typologies, and standard-deviation scales for factors. | ||
Depending on the | Depending on the scale and measurement system used, the assessment technique will be more or less capable of accounting for the adaptation and development of the behaviors being described<ref> See[[Scales_and_Intensity | more here on GRI's wiki about scales and intensity.]]</ref>. | ||
==Standard error of measurement(SEM)== | ==Standard error of measurement(SEM)== | ||
The Standard Error of the Mean (SEM) | The Standard Error of the Mean (SEM) quantifies the expected margin of error for an individual score, given the assessment's imperfect reliability. The SEM represents the degree of confidence that a person’s “true” score lies within a particular range of scores. The smaller the SEM, the more accurate the measurement. | ||
==Sensitivity== | ==Sensitivity== | ||
An assessment technique is said to be sensitive if it | An assessment technique is said to be sensitive if it enables discrimination and differentiation among individuals. Conversely, a technique that states substantially the same thing about many people without nuance is insensitive. To measure sensitivity, we study the distribution of dimension scores in a sufficiently large, representative sample of the population. In theory, one should find at least one person on each score. Several indices should be considered: the mean, standard deviation, median, index of symmetry, and distribution flattening. | ||
=Reliability= | =Reliability= | ||
An assessment technique is said to be reliable when, applied | An assessment technique is said to be reliable when, when applied repeatedly to the same individuals under the same conditions, similar results are obtained. Some speak of stability or fidelity rather than reliability. | ||
The reliability of the technique must be differentiated from the reliability of the constructs being measured. An assessment technique that yields similar results on different days would not be considered reliable. On the other hand, the construct measured may be stable over several years, yet its measurement may be inaccurate in the short term. On the sole basis of this criterion, specific assessments should be avoided for recruitment or consulting but can be used, for example, in team dynamics. | |||
The reliability of an assessment technique is an essential consideration when assessing criterion-based validity. | |||
The reliability of an assessment technique is an essential | |||
Reliability is measured by calculating correlation coefficients between two series of measurements. The technique is | Reliability is measured by calculating correlation coefficients between two series of measurements. The technique is considered reliable when the correlation coefficient is high (close to 1). In reality, the measurement of reliability refers to the estimate of the proportion of variance of the results that is attributable to a deviation (or “error” in a broad sense) of this variance. The discrepancy may arise from multiple sources: insufficiently standardized test conditions, environmental changes, or changes in the person's characteristics. Four methods for measuring the reliability of an assessment technique are commonly used, each providing information of a significantly different nature. | ||
==Test-retest method== | ==Test-retest method== | ||
The same people are asked to answer the same technique twice in a row, under the same conditions. The | The same people are asked to answer the same technique twice in a row, under the same conditions. The interval between the two surveys can vary widely, ranging from a few days or weeks to several years. The concept of reliability over time can be restricted to the assessment technique and the measuring tool, and not extended to people; the aim is to evaluate not the stable characteristics of the person but the capacity of the measurement tool to provide consistent information from one moment to the next. When administered a few days or weeks apart, it is helpful to assess whether the information provided during the first administration may have influenced the second, or whether the technique is too sensitive to environmental variation. | ||
==Split-Half Method== | ==Split-Half Method== | ||
With the split-half technique, the | With the split-half technique, the sample is divided into two halves to obtain two distinct scores, and the correlation between them is calculated. There are several splitting methods: assessments answered first vs assessments answered second, and even assessments vs odd assessments. The correlation between the two scores constitutes an index of the reliability of the overall score. By splitting the assessment into two parts, one can determine whether the different measurement scales plotted represent the same quantity. On the other hand, the reliability measure covers only half of the assessment (i.e., the shorter the technique, the greater the likelihood that reliability is reduced). | ||
==Covariance Method== | ==Covariance Method== | ||
With the covariance method, one seeks to evaluate the consistency of the responses to the technique’s items. Inter-item consistency is influenced by two sources: | With the covariance method, one seeks to evaluate the consistency of the responses to the technique’s items. Inter-item consistency is influenced by two sources: item sampling and the heterogeneity of the groups of behaviors being measured. The more homogeneous the groups of behaviors are, the more consistent the items are across them. This involves examining the performance of each item. One can thus determine whether each item measures the same dimension as the other items in the technique. To this end, homogeneity coefficients are calculated. When the scores are dichotomous (passed/failed), the Kuder-Richardson index is calculated. If the scores are continuous (scoring on several points), a Cronbach's alpha is calculated. | ||
==Method of parallel forms== | ==Method of parallel forms== | ||
Using the parallel forms method, two equivalent versions of the same evaluation technique are constructed. The content, form, and difficulty must be comparable. The same individuals then complete both versions, and the correlation coefficient between the two measurements is calculated. If the correlation between the two measures is high, the person's score can be generalized to other versions of the technique. | |||
Parallel forms of the same technique enable the observation or neutralization of certain biases, such as bias arising from the training of the assessment technique or from habituation. In practice, this parallel forms method is rarely used because it is challenging to construct two equivalent forms of the same assessment technique. | |||
==General Reliability== | ==General Reliability== | ||
The results of | The results of various reliability calculations can be combined into a single measure, the generalized reliability coefficient. This coefficient enables the extraction of the portion of variance attributable solely to the assessment technique, independent of other sources such as time, administration conditions, or residual item-level inconsistency. Generalized reliability can be used in criterion-related validity calculations and in meta-analyses. | ||
=Validity= | =Validity= | ||
An assessment technique is valid if it measures what it is supposed to measure. | An assessment technique is valid if it measures what it is supposed to measure. We are interested in the content, quality, and relevance of the assessment. A validity analysis is typically conducted after considering basic metrics, including sensitivity and reliability calculations, which are necessary but not sufficient. There are four broad categories of validity: | ||
* Empirical validity which includes content validity and face validity | * '''Empirical validity''', which includes content validity and face validity | ||
* Theoretical validity which includes among others, the method of contrasting groups, convergent/divergent validity and internal consistency | * '''Theoretical validity''', which includes, among others, the method of contrasting groups, convergent/divergent validity, and internal consistency | ||
* Criterion validity which includes predictive-criterion validity and concurrent-criterion validity | * '''Criterion validity''', which includes predictive-criterion validity and concurrent-criterion validity | ||
* Synthetic validity which brings the other forms of validity together | * '''Synthetic validity''', which brings the other forms of validity together | ||
==Empirical Validity== | ==Empirical Validity== | ||
Empirical validity is historically the form of validity that appeared first. The objective of empirical validity is to assess whether the assessment technique | Empirical validity is historically the form of validity that appeared first. The objective of empirical validity is to assess whether the assessment technique encompasses all aspects intended to be addressed when participants respond to it. This procedure is typically used to assess skills, abilities, and competencies. | ||
Empirical validity is inappropriate for techniques | Empirical validity is inappropriate for techniques whose content is not uniform. This is the case for clinical personality assessment, which seeks to uncover the functions underlying behaviors that vary across individuals. The same technique would need to measure different functions across individuals. Under such conditions, the technique’s content cannot be validated by examining its items alone or by considering the varying functions and behaviors being assessed. In other cases than this one, two methods are considered for empirical validity. | ||
==Content Validity== | ==Content Validity== | ||
With content validity, we are interested in all aspects | With respect to content validity, we are interested in all aspects of the technique's content, and in particular whether the items of the technique constitute a representative sample of all items that could have been presented to participants. Content validity is not measured; instead, experts rate each item based on its representativeness of the dimension under study. The experts' judgments are then surveyed to assign a value to each dimension. Items that are not representative of the concept being measured are excluded. | ||
==Face Validity== | ==Face Validity== | ||
From a face-validity perspective, we are interested in what the technique appears to measure rather than what it actually measures. Participants who complete the assessment are asked whether the technique’s content seems to be related to the measurement’s objective. Face validity will influence an individual's attitude toward the technique. If the technique has good face validity, one can expect greater cooperation from participants in responding to it. | |||
==Theoretical Validity== | ==Theoretical Validity== | ||
Theoretical | Theoretical validity assesses whether a measured dimension aligns with the author's theory. The focus is on the dimensions or constructs being measured by the assessment technique. Theoretical or construct validity, whose first debates date back to the 1950s, underscores the importance for psychological theories of formulating hypotheses that can be verified through a rational and structured process. Several techniques are commented on below. | ||
==Contrasting Groups== | ==Contrasting Groups== | ||
The contrasting group method examines the extent to which the overall technique’s score differentiates groups. This procedure is suitable for | The contrasting group method examines the extent to which the overall technique’s score differentiates groups. This procedure is suitable for validating personality assessments. For example, we could assume that a company’s sales force has overall behavioral profiles that differ from those of the accounting department. If the first group yields significantly different results from the second, we can infer that the technique enables differentiation between the two groups. The contrasting group method allows for evidence from groups that reflect the qualities of their constituents. | ||
==Convergent/Divergent Validity== | ==Convergent/Divergent Validity== | ||
This method highlights that if two | This method highlights that if two ostensibly independent techniques show a correlation (as measured by a correlation coefficient), they essentially measure the same dimension: this is convergent validity. On the other hand, if two independent techniques measure different concepts, they exhibit divergent validity<ref> Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.</ref>. | ||
==Developmental Changes== | ==Developmental Changes== | ||
This procedure | This procedure is typically used with children and adolescents to validate their chronological development, including their intelligence. This form of validation also applies to the analysis of adults’ behavioral adaptation in the workplace. Still, it is not used in clinical assessment of personality traits, where longitudinal studies are not the norm. | ||
==Correlation With Other Techniques== | ==Correlation With Other Techniques== | ||
With this method, several techniques | With this method, the same individuals answer several techniques, and correlations among the measures are calculated. High correlations indicate that the techniques measure the same constructs. Low correlations would suggest the opposite, that the constructs are unrelated. Between those two extremes, the techniques’ construct may partially share content. | ||
==Internal Consistency== | ==Internal Consistency== | ||
Using internal consistency, we analyse the validity at the level of the items that comprise the technique. This procedure verifies whether each item indeed measures the dimension to which it is related and whether it enables differentiation among individuals in this dimension. The correlation between each item's score and the total score is calculated. One particularity of this procedure, commonly used for personality assessments, is that the criterion is the total score of the technique. The measurement procedure is similar to that of homogeneity. In the absence of external data, this procedure contributes little to understanding the dimensions being measured. | |||
==Factor Analysis== | ==Factor Analysis== | ||
Used to identify personality factors (Explanatory Factor Analysis or EFA), | Used to identify personality factors (Explanatory Factor Analysis, or EFA), particularly by following the lexical approach, this technique, in a validation context (Confirmatory Factor Analysis, or CFA), enables the calculation of relationships among variables. Factor analysis can reveal strong relationships among specific clusters and suggest simplifications. | ||
==Structural Equation Modeling== | ==Structural Equation Modeling== | ||
The use of this procedure has | The use of this procedure has enabled the identification of redundancy among constructs measured by the same assessment. Unlike other methods, structural equation modeling enables more rigorous analysis of relationships among variables and their reliability. | ||
=Criterion Validity= | =Criterion Validity= | ||
Criterion validity or criterion-related validity assesses how | Criterion validity, or criterion-related validity, assesses how well a technique predicts a person's performance in a given activity by considering the criterion's value. To measure criterion validity, a correlation coefficient is calculated between the technique’s score and the criterion score. Two forms of criterion validity should be distinguished. | ||
==Predictive Validity== | ==Predictive Validity== | ||
The first form of criterion validity is | The first form of criterion validity is predictive criterion validity, or predictive validity. The assessment technique precedes the measurement of the criterion. These are situations in which we seek to forecast a person’s performance in a subsequent activity based on measures from the past. Predictive validity is measured by the correlation coefficient between the assessment technique results and the criterion values in the job, after the person has been recruited. This validity applies to recruitment, career counselling, and personal development. | ||
==Concurrent Validity== | ==Concurrent Validity== | ||
The second form of criterion validity is concurrent-criterion validity, or concurrent validity. The two measures are obtained simultaneously. In general, | The second form of criterion validity is concurrent-criterion validity, or concurrent validity. The two measures are obtained simultaneously. In general, employees already in the job respond to the assessment technique while obtaining the criterion value. This form of validity is useful when analyzing an existing situation rather than predicting an outcome. It enables the construction of a criterion against which a predictive validity procedure can then be conducted. | ||
Criteria can take different forms. | Criteria can take different forms. In intelligence assessment, the criterion of passing exams is often considered. For competency, skills, and aptitude tests, the results are usually based on specific training. With organizations, performance or non-performance criteria are frequently used, such as sales and production results or absenteeism. Precautions must be taken regarding the homogeneity of the study population or the linearity of the relationship between the criterion and the dimensions being assessed. Validity studies are particularly interested in the formation of criteria, taking into account the multiple factors emerging from work situations<ref>McCormick, E. J. (1979). Job analysis : Methods and applications. New York: AMACOM.<br/>McCormick, E. J. (1983). Job and task analysis. In M. D. Dunnette (Ed.), Handbook of Industrial and Organisational psychology (pp. 651-696). New York: Willey.<br/>Campbell, J. P. (1990). Modeling the performance prediction problem in industrial and organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 1, pp. 687-732). Palo Alto, CA: Consulting Psychologists Press.<br/>Messick, S. (1995). Validity of psychological assessment: Validation of inferences from person's responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749.</ref> | ||
. | |||
Although criterion validity studies have long been regarded as | Although criterion validity studies have long been regarded as essential to prove a technique’s usefulness, their results are deceitful. They don’t account for the many other variables that impact performance. They assume the job should be performed under similar conditions, including the reward system and recruitment and management practices, which never occur. Additionally, the sampling must include employees who have been in the job long enough, regardless of their poor performance or satisfaction, and may consequently have quit the job before the study’s end. In the best cases, including studies using personality-trait measures in selection, criterion validity studies report correlations of 0.2 to 0.4, indicating that a haphazard occurrence of the criterion could be reduced by 2 percent to 8 percent. | ||
Nevertheless, criterion-related validity studies are conducted on demand, particularly when large numbers of candidates and employees are involved, to provide meaningful indicators. The results can then be combined with longitudinal analyses, both qualitative and quantitative, to examine the other variables at play better and to devise more practical, effective, and actionable plans. | |||
==Synthetic Validity== | ==Synthetic Validity== | ||
The generalization of aptitude and general intelligence (GMA) | The generalization of aptitude and general intelligence (GMA) test results to similar positions yields results that are too variable or insufficient, compromising the generalizability of test use, particularly aptitude tests. Since the mid-1970s, meta-analyses have been used in psychology<ref>Glass, G. V. (1976). Primary, Secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.<br/>Schmidt, F. L., Hunter J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529-540.<br/>Schmidt, F. L. (1992). What do data really mean ? Research findings, meta-analysis, and cummulative knowledge in psychology, American Psychologist, 47, 1173-1181.</ref>. New methodologies were recommended to work with larger population samples; the number of people included in the statistical studies is insufficient to support generalizations.<ref>Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for test used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373-406.<br/>Schmidt, F. L., Gast-Rosenberg, L., & Hunter, J. E. (1980). Validity generalization results for computer programmers. Journal of Applied Psychology, 65, 643-661.</ref>. The results obtained by combining several studies through meta-analysis and by working with larger samples enabled new generalizations<ref>Salgado, J. F., Ones, D. S. et Wiswevaran, C. (2001). Predictors used for personnel selection : An overview of constructs, methods and techniques. In N. Anderson, D. S. Ones, H. K.</ref>; however, generalizations of synthetic validity from meta-analyses require further clarification. In particular, the environmental variables must be specified, including the criteria set by the organization for assessing the results. The sources of specificity also need to be better understood.<ref>Algera, J. A., Jansen, P. S., Roe, R. A., Vijn P. (1984). Validity generalization: some critical remarks on the Schmidt-Hunter procedure, Journal of Occupational Psychology, 57, 197-210.</ref> | ||
=Work Relatedness= | =Work Relatedness= | ||
Work relatedness refers to the applicability of an assessment technique in the context of work and employment as opposed to | Work-relatedness refers to the applicability of an assessment technique in the context of work and employment, as opposed to its use for other applications in fields such as forensic science, clinical psychology, and psychiatry. | ||
Work relatedness is a | Work-relatedness is a significant criterion for classifying a technique for use in the workplace; behaviors are considered “normal” even when occasionally outside the norm or extreme, provided they do not require clinical attention. Work-relatedness is generally covered in legislative codes, as in U.S. and French law. | ||
In the US | In the US, under Title VII of the Civil Rights Act, the Equal Employment Opportunity Commission (EEOC) information requested in the pre-employment process should be limited to that essential to determining whether a person is qualified for the job. If a screening practice disproportionately screens out protected groups, the employer must prove the practice is "job-related for the position in question and consistent with business necessity." Additionally, the Americans with Disabilities Act (ADA) contains the strictest "job-relatedness" requirement with respect to medical or disability-related inquiries. Under 42 U.S.C. § 12112(d), an employer may not ask disability-related questions or require medical examinations before making a job offer. After the job offer, such inquiries are allowed only if they are "job-related and consistent with business necessity." | ||
In the French law, two articles are specific to work relatedness. Article L 121-6: “The information requested in any form whatsoever from a job applicant or an employee can only be used to assess his ability to hold the job offered or his professional skills. And on the other hand | In the French law, two articles are specific to work-relatedness. Article L 121-6: “The information requested in any form whatsoever from a job applicant or an employee can only be used to assess his ability to hold the job offered or his professional skills. And on the other hand, this information must have a direct and necessary link with the job offered or with the assessment of professional aptitudes.” And Article L 121-7: "Methods and techniques for assisting in the recruitment or assessment of employees and job candidates must be relevant to the purpose pursued.” | ||
=Non Discrimination= | =Non Discrimination= | ||
The | The principle of non-discrimination is invoked to protect minority, disadvantaged, or discriminated-against social groups. These aspects of non-discrimination are generally governed by social legislation. | ||
In the USA non discrimination is | |||
In the USA, non-discrimination is enforced by federal laws, including Title VII, dating from 1964 and amended in 1972; the Age Discrimination in Employment Act (ADEA), dating from 1967; and the Americans with Disabilities Act (ADA). The Equal Employment Opportunity Commission (EEOC), established in 1964, is responsible for investigating and prosecuting employers for noncompliance with the law. | |||
In France, Article L 121-45 says: "No person may be excluded from a recruitment procedure, no employee may be sanctioned or dismissed because of their origin, their sex, their mores, their family situation, their belonging to an ethnic group, nation or race, political opinions, trade union or mutualist activities, religious beliefs or, except incapacity noted by the occupational physician within the framework of Title IV of Book II of this code, because of his medical condition or disability..." | In France, Article L 121-45 says: "No person may be excluded from a recruitment procedure, no employee may be sanctioned or dismissed because of their origin, their sex, their mores, their family situation, their belonging to an ethnic group, nation or race, political opinions, trade union or mutualist activities, religious beliefs or, except incapacity noted by the occupational physician within the framework of Title IV of Book II of this code, because of his medical condition or disability..." | ||
From a practical standpoint, non-discrimination statistics are rarely available from publishers, but | From a practical standpoint, non-discrimination statistics are rarely available from publishers, but legislation sets clear benchmarks to prevent measurements inappropriate to the workplace. | ||
The discussion | The discussion of non-discrimination in the workplace opens another debate about the potential universality of the dimension being measured. This focus on universality emerged as factors from the lexicon approach are, by construction, not related to gender, age, and other characteristics that identify protected groups. One only needs to be a real human being, not an avatar! As with different measures of weight, temperature, or length, values and averages may vary across groups, but such differences are irrelevant to the characteristics expected for the job and can be assessed using a technique, provided that the individual being assessed is a human being. | ||
=Notes= | =Notes= | ||
[[Category:Articles]] | [[Category:Articles]] | ||
[[Category:Assessment]] | [[Category:Assessment]] | ||
Latest revision as of 23:54, 1 January 2026
Introduction
Assessment techniques rely on statistics for both their development and validation. Not only do these statistics inform the quality of the measures, but they also provide insight into their use, application, and potential benefits for users.
Quality metrics become even more crucial when they have long-term and significant impacts on decisions, people, and organizations. When the goal of using the technique is to enhance the performance of individuals and organizations—rather than solely focusing on clinical diagnosis and recruitment by clinician psychologists, as was traditionally the case—new criteria emerge, and older ones become less relevant.
The refutability of the hypotheses being tested by statistics, or the independence of those who conduct the statistics, are essential considerations. Errors can easily cripple when testing hypotheses, starting with the quality of the sample being used. Results presented in textbooks appear more objective when produced by an independent party, but ensuring the absence of political or commercial bias is challenging. Although numerous reference points for statistical practice exist, slow progress in the use of assessment techniques has not enabled psychometric standards to evolve to reflect new applications, as evidenced at GRI.
Specific measurements are either extremely complicated or impossible to obtain, such as those assessing reliability over time by assuming that the same technique will be administered by the same people several years apart, with no other conditions changed. Even in very structured administrative organizations, conditions are constantly changing. Comparing the performance of candidates who have been retained with that of others who have not, to demonstrate that the first performed better than the others, is simply impossible. Additionally, performance values are not always readily obtainable.
Despite the difficulties in calculating performance metrics, it is generally accepted that assessment techniques must possess a minimum set of qualities to be used including in specific applications such as recruitment. The following groups of statistics are generally considered to measure an assessment technique’s performance:
- Basic Metrics. It includes information on reference values, scales, and the sensitivity of techniques used to evaluate individual differences.
- Reliability assesses the accuracy of the technique over time.
- Validity assesses the quality of what is being measured, as well as its relevance.
- Work-Relatedness for measures to be used in “normal” work environments, as opposed to being used in clinical applications for diagnosis and curative treatments.
- Non-discrimination vis-à-vis protected social groups, which also includes information about a measure’s universal scope.
These five groups are now commented on.
Basic Metrics
Norms
Assessments all have a reference value to which the measures can be compared. This value is called an indicator, or sometimes the norm. For instance, in temperature measurement, absolute zero serves as the reference. For measures of distance, weight, speed, and force, we have reference values. The same goes for psychometrics. When measuring traits, null values are often used as the reference. In typologies, the reference value is more abstract and implicit; it concerns having the value, not having it, or its opposite. For factors used by the adaptive profiles, the reference value is the average, the value between the extreme high and the extreme low.
Scales
Scales inform about the intensity of the dimension being measured. The scales depend on the type of measurement being used: decile and value-based scales for traits, dichotomous or ranking scales for simple or multiple typologies, and standard-deviation scales for factors.
Depending on the scale and measurement system used, the assessment technique will be more or less capable of accounting for the adaptation and development of the behaviors being described[1].
Standard error of measurement(SEM)
The Standard Error of the Mean (SEM) quantifies the expected margin of error for an individual score, given the assessment's imperfect reliability. The SEM represents the degree of confidence that a person’s “true” score lies within a particular range of scores. The smaller the SEM, the more accurate the measurement.
Sensitivity
An assessment technique is said to be sensitive if it enables discrimination and differentiation among individuals. Conversely, a technique that states substantially the same thing about many people without nuance is insensitive. To measure sensitivity, we study the distribution of dimension scores in a sufficiently large, representative sample of the population. In theory, one should find at least one person on each score. Several indices should be considered: the mean, standard deviation, median, index of symmetry, and distribution flattening.
Reliability
An assessment technique is said to be reliable when, when applied repeatedly to the same individuals under the same conditions, similar results are obtained. Some speak of stability or fidelity rather than reliability.
The reliability of the technique must be differentiated from the reliability of the constructs being measured. An assessment technique that yields similar results on different days would not be considered reliable. On the other hand, the construct measured may be stable over several years, yet its measurement may be inaccurate in the short term. On the sole basis of this criterion, specific assessments should be avoided for recruitment or consulting but can be used, for example, in team dynamics.
The reliability of an assessment technique is an essential consideration when assessing criterion-based validity.
Reliability is measured by calculating correlation coefficients between two series of measurements. The technique is considered reliable when the correlation coefficient is high (close to 1). In reality, the measurement of reliability refers to the estimate of the proportion of variance of the results that is attributable to a deviation (or “error” in a broad sense) of this variance. The discrepancy may arise from multiple sources: insufficiently standardized test conditions, environmental changes, or changes in the person's characteristics. Four methods for measuring the reliability of an assessment technique are commonly used, each providing information of a significantly different nature.
Test-retest method
The same people are asked to answer the same technique twice in a row, under the same conditions. The interval between the two surveys can vary widely, ranging from a few days or weeks to several years. The concept of reliability over time can be restricted to the assessment technique and the measuring tool, and not extended to people; the aim is to evaluate not the stable characteristics of the person but the capacity of the measurement tool to provide consistent information from one moment to the next. When administered a few days or weeks apart, it is helpful to assess whether the information provided during the first administration may have influenced the second, or whether the technique is too sensitive to environmental variation.
Split-Half Method
With the split-half technique, the sample is divided into two halves to obtain two distinct scores, and the correlation between them is calculated. There are several splitting methods: assessments answered first vs assessments answered second, and even assessments vs odd assessments. The correlation between the two scores constitutes an index of the reliability of the overall score. By splitting the assessment into two parts, one can determine whether the different measurement scales plotted represent the same quantity. On the other hand, the reliability measure covers only half of the assessment (i.e., the shorter the technique, the greater the likelihood that reliability is reduced).
Covariance Method
With the covariance method, one seeks to evaluate the consistency of the responses to the technique’s items. Inter-item consistency is influenced by two sources: item sampling and the heterogeneity of the groups of behaviors being measured. The more homogeneous the groups of behaviors are, the more consistent the items are across them. This involves examining the performance of each item. One can thus determine whether each item measures the same dimension as the other items in the technique. To this end, homogeneity coefficients are calculated. When the scores are dichotomous (passed/failed), the Kuder-Richardson index is calculated. If the scores are continuous (scoring on several points), a Cronbach's alpha is calculated.
Method of parallel forms
Using the parallel forms method, two equivalent versions of the same evaluation technique are constructed. The content, form, and difficulty must be comparable. The same individuals then complete both versions, and the correlation coefficient between the two measurements is calculated. If the correlation between the two measures is high, the person's score can be generalized to other versions of the technique.
Parallel forms of the same technique enable the observation or neutralization of certain biases, such as bias arising from the training of the assessment technique or from habituation. In practice, this parallel forms method is rarely used because it is challenging to construct two equivalent forms of the same assessment technique.
General Reliability
The results of various reliability calculations can be combined into a single measure, the generalized reliability coefficient. This coefficient enables the extraction of the portion of variance attributable solely to the assessment technique, independent of other sources such as time, administration conditions, or residual item-level inconsistency. Generalized reliability can be used in criterion-related validity calculations and in meta-analyses.
Validity
An assessment technique is valid if it measures what it is supposed to measure. We are interested in the content, quality, and relevance of the assessment. A validity analysis is typically conducted after considering basic metrics, including sensitivity and reliability calculations, which are necessary but not sufficient. There are four broad categories of validity:
- Empirical validity, which includes content validity and face validity
- Theoretical validity, which includes, among others, the method of contrasting groups, convergent/divergent validity, and internal consistency
- Criterion validity, which includes predictive-criterion validity and concurrent-criterion validity
- Synthetic validity, which brings the other forms of validity together
Empirical Validity
Empirical validity is historically the form of validity that appeared first. The objective of empirical validity is to assess whether the assessment technique encompasses all aspects intended to be addressed when participants respond to it. This procedure is typically used to assess skills, abilities, and competencies.
Empirical validity is inappropriate for techniques whose content is not uniform. This is the case for clinical personality assessment, which seeks to uncover the functions underlying behaviors that vary across individuals. The same technique would need to measure different functions across individuals. Under such conditions, the technique’s content cannot be validated by examining its items alone or by considering the varying functions and behaviors being assessed. In other cases than this one, two methods are considered for empirical validity.
Content Validity
With respect to content validity, we are interested in all aspects of the technique's content, and in particular whether the items of the technique constitute a representative sample of all items that could have been presented to participants. Content validity is not measured; instead, experts rate each item based on its representativeness of the dimension under study. The experts' judgments are then surveyed to assign a value to each dimension. Items that are not representative of the concept being measured are excluded.
Face Validity
From a face-validity perspective, we are interested in what the technique appears to measure rather than what it actually measures. Participants who complete the assessment are asked whether the technique’s content seems to be related to the measurement’s objective. Face validity will influence an individual's attitude toward the technique. If the technique has good face validity, one can expect greater cooperation from participants in responding to it.
Theoretical Validity
Theoretical validity assesses whether a measured dimension aligns with the author's theory. The focus is on the dimensions or constructs being measured by the assessment technique. Theoretical or construct validity, whose first debates date back to the 1950s, underscores the importance for psychological theories of formulating hypotheses that can be verified through a rational and structured process. Several techniques are commented on below.
Contrasting Groups
The contrasting group method examines the extent to which the overall technique’s score differentiates groups. This procedure is suitable for validating personality assessments. For example, we could assume that a company’s sales force has overall behavioral profiles that differ from those of the accounting department. If the first group yields significantly different results from the second, we can infer that the technique enables differentiation between the two groups. The contrasting group method allows for evidence from groups that reflect the qualities of their constituents.
Convergent/Divergent Validity
This method highlights that if two ostensibly independent techniques show a correlation (as measured by a correlation coefficient), they essentially measure the same dimension: this is convergent validity. On the other hand, if two independent techniques measure different concepts, they exhibit divergent validity[2].
Developmental Changes
This procedure is typically used with children and adolescents to validate their chronological development, including their intelligence. This form of validation also applies to the analysis of adults’ behavioral adaptation in the workplace. Still, it is not used in clinical assessment of personality traits, where longitudinal studies are not the norm.
Correlation With Other Techniques
With this method, the same individuals answer several techniques, and correlations among the measures are calculated. High correlations indicate that the techniques measure the same constructs. Low correlations would suggest the opposite, that the constructs are unrelated. Between those two extremes, the techniques’ construct may partially share content.
Internal Consistency
Using internal consistency, we analyse the validity at the level of the items that comprise the technique. This procedure verifies whether each item indeed measures the dimension to which it is related and whether it enables differentiation among individuals in this dimension. The correlation between each item's score and the total score is calculated. One particularity of this procedure, commonly used for personality assessments, is that the criterion is the total score of the technique. The measurement procedure is similar to that of homogeneity. In the absence of external data, this procedure contributes little to understanding the dimensions being measured.
Factor Analysis
Used to identify personality factors (Explanatory Factor Analysis, or EFA), particularly by following the lexical approach, this technique, in a validation context (Confirmatory Factor Analysis, or CFA), enables the calculation of relationships among variables. Factor analysis can reveal strong relationships among specific clusters and suggest simplifications.
Structural Equation Modeling
The use of this procedure has enabled the identification of redundancy among constructs measured by the same assessment. Unlike other methods, structural equation modeling enables more rigorous analysis of relationships among variables and their reliability.
Criterion Validity
Criterion validity, or criterion-related validity, assesses how well a technique predicts a person's performance in a given activity by considering the criterion's value. To measure criterion validity, a correlation coefficient is calculated between the technique’s score and the criterion score. Two forms of criterion validity should be distinguished.
Predictive Validity
The first form of criterion validity is predictive criterion validity, or predictive validity. The assessment technique precedes the measurement of the criterion. These are situations in which we seek to forecast a person’s performance in a subsequent activity based on measures from the past. Predictive validity is measured by the correlation coefficient between the assessment technique results and the criterion values in the job, after the person has been recruited. This validity applies to recruitment, career counselling, and personal development.
Concurrent Validity
The second form of criterion validity is concurrent-criterion validity, or concurrent validity. The two measures are obtained simultaneously. In general, employees already in the job respond to the assessment technique while obtaining the criterion value. This form of validity is useful when analyzing an existing situation rather than predicting an outcome. It enables the construction of a criterion against which a predictive validity procedure can then be conducted.
Criteria can take different forms. In intelligence assessment, the criterion of passing exams is often considered. For competency, skills, and aptitude tests, the results are usually based on specific training. With organizations, performance or non-performance criteria are frequently used, such as sales and production results or absenteeism. Precautions must be taken regarding the homogeneity of the study population or the linearity of the relationship between the criterion and the dimensions being assessed. Validity studies are particularly interested in the formation of criteria, taking into account the multiple factors emerging from work situations[3] .
Although criterion validity studies have long been regarded as essential to prove a technique’s usefulness, their results are deceitful. They don’t account for the many other variables that impact performance. They assume the job should be performed under similar conditions, including the reward system and recruitment and management practices, which never occur. Additionally, the sampling must include employees who have been in the job long enough, regardless of their poor performance or satisfaction, and may consequently have quit the job before the study’s end. In the best cases, including studies using personality-trait measures in selection, criterion validity studies report correlations of 0.2 to 0.4, indicating that a haphazard occurrence of the criterion could be reduced by 2 percent to 8 percent.
Nevertheless, criterion-related validity studies are conducted on demand, particularly when large numbers of candidates and employees are involved, to provide meaningful indicators. The results can then be combined with longitudinal analyses, both qualitative and quantitative, to examine the other variables at play better and to devise more practical, effective, and actionable plans.
Synthetic Validity
The generalization of aptitude and general intelligence (GMA) test results to similar positions yields results that are too variable or insufficient, compromising the generalizability of test use, particularly aptitude tests. Since the mid-1970s, meta-analyses have been used in psychology[4]. New methodologies were recommended to work with larger population samples; the number of people included in the statistical studies is insufficient to support generalizations.[5]. The results obtained by combining several studies through meta-analysis and by working with larger samples enabled new generalizations[6]; however, generalizations of synthetic validity from meta-analyses require further clarification. In particular, the environmental variables must be specified, including the criteria set by the organization for assessing the results. The sources of specificity also need to be better understood.[7]
Work Relatedness
Work-relatedness refers to the applicability of an assessment technique in the context of work and employment, as opposed to its use for other applications in fields such as forensic science, clinical psychology, and psychiatry.
Work-relatedness is a significant criterion for classifying a technique for use in the workplace; behaviors are considered “normal” even when occasionally outside the norm or extreme, provided they do not require clinical attention. Work-relatedness is generally covered in legislative codes, as in U.S. and French law.
In the US, under Title VII of the Civil Rights Act, the Equal Employment Opportunity Commission (EEOC) information requested in the pre-employment process should be limited to that essential to determining whether a person is qualified for the job. If a screening practice disproportionately screens out protected groups, the employer must prove the practice is "job-related for the position in question and consistent with business necessity." Additionally, the Americans with Disabilities Act (ADA) contains the strictest "job-relatedness" requirement with respect to medical or disability-related inquiries. Under 42 U.S.C. § 12112(d), an employer may not ask disability-related questions or require medical examinations before making a job offer. After the job offer, such inquiries are allowed only if they are "job-related and consistent with business necessity."
In the French law, two articles are specific to work-relatedness. Article L 121-6: “The information requested in any form whatsoever from a job applicant or an employee can only be used to assess his ability to hold the job offered or his professional skills. And on the other hand, this information must have a direct and necessary link with the job offered or with the assessment of professional aptitudes.” And Article L 121-7: "Methods and techniques for assisting in the recruitment or assessment of employees and job candidates must be relevant to the purpose pursued.”
Non Discrimination
The principle of non-discrimination is invoked to protect minority, disadvantaged, or discriminated-against social groups. These aspects of non-discrimination are generally governed by social legislation.
In the USA, non-discrimination is enforced by federal laws, including Title VII, dating from 1964 and amended in 1972; the Age Discrimination in Employment Act (ADEA), dating from 1967; and the Americans with Disabilities Act (ADA). The Equal Employment Opportunity Commission (EEOC), established in 1964, is responsible for investigating and prosecuting employers for noncompliance with the law.
In France, Article L 121-45 says: "No person may be excluded from a recruitment procedure, no employee may be sanctioned or dismissed because of their origin, their sex, their mores, their family situation, their belonging to an ethnic group, nation or race, political opinions, trade union or mutualist activities, religious beliefs or, except incapacity noted by the occupational physician within the framework of Title IV of Book II of this code, because of his medical condition or disability..."
From a practical standpoint, non-discrimination statistics are rarely available from publishers, but legislation sets clear benchmarks to prevent measurements inappropriate to the workplace.
The discussion of non-discrimination in the workplace opens another debate about the potential universality of the dimension being measured. This focus on universality emerged as factors from the lexicon approach are, by construction, not related to gender, age, and other characteristics that identify protected groups. One only needs to be a real human being, not an avatar! As with different measures of weight, temperature, or length, values and averages may vary across groups, but such differences are irrelevant to the characteristics expected for the job and can be assessed using a technique, provided that the individual being assessed is a human being.
Notes
- ↑ See more here on GRI's wiki about scales and intensity.
- ↑ Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
- ↑ McCormick, E. J. (1979). Job analysis : Methods and applications. New York: AMACOM.
McCormick, E. J. (1983). Job and task analysis. In M. D. Dunnette (Ed.), Handbook of Industrial and Organisational psychology (pp. 651-696). New York: Willey.
Campbell, J. P. (1990). Modeling the performance prediction problem in industrial and organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 1, pp. 687-732). Palo Alto, CA: Consulting Psychologists Press.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from person's responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749. - ↑ Glass, G. V. (1976). Primary, Secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.
Schmidt, F. L., Hunter J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529-540.
Schmidt, F. L. (1992). What do data really mean ? Research findings, meta-analysis, and cummulative knowledge in psychology, American Psychologist, 47, 1173-1181. - ↑ Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for test used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373-406.
Schmidt, F. L., Gast-Rosenberg, L., & Hunter, J. E. (1980). Validity generalization results for computer programmers. Journal of Applied Psychology, 65, 643-661. - ↑ Salgado, J. F., Ones, D. S. et Wiswevaran, C. (2001). Predictors used for personnel selection : An overview of constructs, methods and techniques. In N. Anderson, D. S. Ones, H. K.
- ↑ Algera, J. A., Jansen, P. S., Roe, R. A., Vijn P. (1984). Validity generalization: some critical remarks on the Schmidt-Hunter procedure, Journal of Occupational Psychology, 57, 197-210.