Standard-setting study

Standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the US, in particular for high-stakes assessments, and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined; it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Such studies require quite an amount of resources, involving a number of professionals, in particular with psychometric background. Standard-setting studies are for that reason impractical for regular class room situations, yet in every layer of education, standard setting is performed and multiple methods exist.

Standard-setting studies are typically performed using focus groups of 5-15 subject-matter-experts that represent key stakeholders for the test. For example, in setting cut scores for educational testing, experts might be instructors familiar with the capabilities of the student population for the test.

Types of standard-setting studies

Standard-setting studies fall into two categories, item-centered and person-centered. Examples of item-centered methods include the Angoff, Ebel, Nedelsky,^[1] Bookmark, and ID Matching methods, while examples of person-centered methods include the Borderline Survey and Contrasting Groups approaches. These are so categorized by the focus of the analysis; in item-centered studies, the organization evaluates items with respect to a given population of persons, and vice versa for person-centered studies.

Item-centered studies are related to criterion-referenced tests and to norm-referenced tests.

Item-centered studies

Angoff Method^[2] (item-centered): This method requires the assembly of a group of subject matter experts (SMEs), who are asked to evaluate each item and estimate the proportion of minimally competent examinees that would correctly answer the item. The ratings are averaged across raters for each item and then summed to obtain a panel-recommended raw cutscore. This cutscore then represents the score which the panel estimates a minimally competent candidate would get. This is of course subject to decision biases such as the overconfidence bias. Calibration with other, more objective, sources of data is preferable. Several variants of the method exist.
Modified Angoff Method (item-centered): Subject matter experts are generally briefed on the Angoff method and allowed to take the test with the performance levels in mind. SMEs are then asked to provide estimates for each question of the proportion of borderline or "minimally acceptable" participants that they would expect to get the question correct. The estimates are generally in p-value type form (e.g., 0.6 for item 1: 60% of borderline passing participants would get this question correct). Several rounds are generally conducted with SMEs allowed to modify their estimates given different types of information (e.g., actual participant performance information on each question, other SME estimates, etc.). The final determination of the cut score is then made (e.g., by averaging estimates or taking the median), which is often documented in a report along with secondary results such as the inter-rater reliability or the Beuk compromise. Software programs are typically used to calculate these.^[3] This method is generally used with multiple-choice questions.
Dichotomous Modified Angoff Method (item-centered): In the dichotomous modified Angoff approach, instead of using difficulty level type statistics (typically p-values), SMEs are asked to simply provide a 0/1 for each question ("0" if a borderline acceptable participant would get the question wrong and "1" if a borderline acceptable participant would get the item right)
Nedelsky Method (item-centered): SMEs make decisions on a question-by-question basis regarding which of the question distracters they feel borderline participants would be able to eliminate as incorrect. This method is generally used with multiple-choice questions only.
Bookmark Method (item-centered): Items in a test (or a representative subset of items) are ordered by difficulty (e.g., IRT response probability value) from easiest to hardest. SMEs place a "bookmark" in the "ordered item booklet" such that a student at the threshold of a performance level would be expected to respond successfully to the items prior to the bookmark with a likelihood equal to or greater than the specified response probability value (and with a likelihood less than that value for items after the bookmark). For example, for a response probability of .67 (RP67) SMEs would place a bookmark such that an examinee at the threshold of the performance level would have at least a 2/3 likelihood of success on items prior to the bookmark and less than a 2/3 likelihood of success on the items after the bookmark" This method is considered efficient with respect to setting multiple cut scores on a single test and can be used with tests composed of multiple item types (e.g., multiple-choice, construct response, etc.).^[4]^[5]^[6]
Item-Descriptor (ID) Matching ^[7] (item-centered): ID Matching (a) combines the advantages of the Bookmark method; that is, the ordered item book and the information about empirical item difficulty conveyed in that ordering; and (b) hypothesized lower cognitive complexity and cognitive load of other methods; that is no error-prone probability judgments are required;^[8] matching the features of items to features of achievement level descriptions, which is well suited to people in general,^[9] and particularly to the knowledge and expertise of educators; and no need to hold a borderline examinee in mind while making the cut score judgment.

Person-centered studies

Rather than the items that distinguish competent candidates, person-centered studies evaluate the examinees themselves. While this might seem more appropriate, it is often more difficult because examinees are not a captive population, as is a list of items. For example, if a new test comes out regarding new content (as often happens in information technology tests), the test could be given to an initial sample called a beta sample, along with a survey of professional characteristics. The testing organization could then analyze and evaluate the relationship between the test scores and important statistics, such as skills, education, and experience. The cutscore could be set as the score that best differentiates between those examinees characterized as "passing" and those as "failing."

Borderline groups method (person-centered): A description is prepared for each performance category. SMEs are asked to submit a list of participants whose performance on the test should be close to the performance standard (borderline). The test is administered to these borderline groups and the median test score is used as the cut score. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
Contrasting groups method (person-centered): SMEs are asked to categorize the participants in their classes according to the performance category descriptions. The test is administered to all of the categorized participants and the test score distributions for each of the categorized groups are compared. Where the distributions of the contrasting groups intersect is where the cut score would be located. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).

References

^ Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3–19.
^ Zieky, M.J. (2001). So much has changed: how the setting of cutscores has evolved since the 1980s. In Cizek, G.J. (Ed.), Setting Performance Standards, p. 19-52. Mahwah, NJ: Lawrence Erlbaum Associates.
^ Assessment Systems Corporation: Angoff Analysis Tool (free software). https://assess.com/angoff-analysis-tool/
^ Lewis, D. M., Mitzel, H. C., Green, D. R. (June, 1996). Standard Setting: A Bookmark Approach. In D. R. Green (Chair), IRT-Based Standard-Setting Procedures Utilizing Behavioral Anchoring. Paper presented at the 1996 Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ.
^ Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2000). The Bookmark Procedure: Cognitive Perspectives on Standard Setting. Chapter in Setting Performance Standards: Concepts, Methods, and Perspectives (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
^ Lewis, D. M., Mitzel, H. C., Mercado, R. L., & Schulz, E. M. (2012). The Bookmark Standard Setting Procedure. Chapter in Setting Performance Standards: Foundations, Methods, and Innovations Second Edition (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
^ Ferrara, S., & Lewis, D. (2012). The Item-Descriptor (ID) Matching method. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 255-282).
^ Nickerson, R. S. (2005). Cognition and chance: The psychology of probabilistic reasoning. Mahwah, NJ: Lawrence Erlbaum Associates.
^ Murphy, G. L. (2002). The big book of concepts. Cambridge, MA: The MIT Press

[1] Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3–19.

[2] Zieky, M.J. (2001). So much has changed: how the setting of cutscores has evolved since the 1980s. In Cizek, G.J. (Ed.), Setting Performance Standards, p. 19-52. Mahwah, NJ: Lawrence Erlbaum Associates.

[3] Assessment Systems Corporation: Angoff Analysis Tool (free software). https://assess.com/angoff-analysis-tool/

[4] Lewis, D. M., Mitzel, H. C., Green, D. R. (June, 1996). Standard Setting: A Bookmark Approach. In D. R. Green (Chair), IRT-Based Standard-Setting Procedures Utilizing Behavioral Anchoring. Paper presented at the 1996 Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ.

[5] Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2000). The Bookmark Procedure: Cognitive Perspectives on Standard Setting. Chapter in Setting Performance Standards: Concepts, Methods, and Perspectives (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

[6] Lewis, D. M., Mitzel, H. C., Mercado, R. L., & Schulz, E. M. (2012). The Bookmark Standard Setting Procedure. Chapter in Setting Performance Standards: Foundations, Methods, and Innovations Second Edition (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

[7] Ferrara, S., & Lewis, D. (2012). The Item-Descriptor (ID) Matching method. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 255-282).

[8] Nickerson, R. S. (2005). Cognition and chance: The psychology of probabilistic reasoning. Mahwah, NJ: Lawrence Erlbaum Associates.

[9] Murphy, G. L. (2002). The big book of concepts. Cambridge, MA: The MIT Press

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]