Skip to main content

Luz Bay

Followers

33

Following

5

Public Views

Interests

Uploads

Papers by Luz Bay

The Impact of Generating Model on Preknowledge Detection in CAT

Springer proceedings in mathematics & statistics, 2024

Setting Performance Standards Using the Body of Work Method

Routledge, May 13, 2013

IDENTIFIERS Low Stakes Tests; New Hampshire; *New Hampshire Educational Improve and Assess Prog; *Repeating Response Tendency

In state assessment programs in which performance has no real immediate consequence for the indiv... more In state assessment programs in which performance has no real immediate consequence for the individual examinee, the issue of examinee motivation arises. Some examinees may respond to questions in ways that do not reflect their real knowledge of the test domain. In this-study, a new approach was developed to identify students who have responded to questions in a way that does not reflect their knowledge adequately. The new method was designed to detect examinees that responded according to some repeating pattern. The new method was then compared to two indices previously developed by Drasgow, Levine, and Williams (1985) using data from New Hampshire's statewide mathematics assessment for 16,630, 16,387, and 14,445 students from grades 3, 6, and 10 respectively. These indices were found to be distributed in a manner that was more or less expected, something that suggests that the indices may be useful in detecting examinees that are responding in a way not in accord with the und...

Classifying Student Performance as a Method for Setting Achievement Levels for NAEP Writing

One way standard setting methods can be distinguished is by whether they are test-centered or exa... more One way standard setting methods can be distinguished is by whether they are test-centered or examinee-centered (Jaeger, 1989; Kane, 1998). In test-centered methods judges decide on a level of performance that is considered just adequate for the standard to be met on each item or task in the test. The item-level judgments are combined to obtain a set of performance level cutpoints for the test as a whole. In examinee-centered methods judges categorize examinees directly into performance levels using deflnitions of adequate performance for each level and information about the level of achievement of each examinee. Test-centered methods have been used for setting NAEP standards. This paper examines an examinee-centered standard setting method (Booklet Classiflcation) as a possible standard setting method for the 1998 NAEP Writing assessment. The study described in this paper is the second NAEP 1998 Writing fleld trial for achievement levels-setting. There are three NAEP achievement le...

Findings Regarding Issues in NAEP Standard Setting...........................................................................vi

Background Information for Field Trials Research .

This study was funded by the National Assessment Governing Board under Contract ED-NAG-10-C-0003.... more

Judgmental Standard Setting for Academic Preparedness 5...................................................................... The Bookmark Standard-‐Setting Methodology

Measured Progress iiNational Assessment Governing Board BOARD MEMBERSHIP

Psychometric properties of scale scores and performance levels for performance assessments using ... more

Detection of Cheating on Multiple-Choice Examinations

An index is proposed to detect cheating on multiple-choice examinations, and its use is evaluated... more An index is proposed to detect cheating on multiple-choice examinations, and its use is evaluated through simulations. The proposed index is based on the compound binomial distribution. In total, 360 simulated data sets reflecting 12 different cheating (copying) situations were obtained and used for the study of the sensitivity of the index in detecting cheating and the conditions that affect its effectiveness. A computer program in C language was written to analyze each data set. The simulated data sets were also used to compare an index developed by R. Frary and others (1977) and error-similarity analysis (F. Belleza and S. Belleza, 1989). In general, the new index was effective in detecting cheaters as long as enough items were copied. It was sensitive enough to detect cheating when between 25 and 50% of the items were copied in a 50-item test, but was less sensitive when the test was shorter. It was also less sensitive when there were fewer cheaters in a class. Although effectiveness is influenced by test length, it is not influenced by class size. Similarities and differences among the three indexes are discussed. (Contains 2 tables, 11 figures, and 14 references.) (SLD)

Comparing Student Performance on Different Item Formats Relative to Achievement Levels Cutpoints

ACT's NAEP Redesign Project: Assessment Design Is the Key to Useful and Stable Assessment Results. Working Paper Series

This report presents an investigation by the American College Testing Program (ACT) of an alterna... more This report presents an investigation by the American College Testing Program (ACT) of an alternative design for the National Assessment of Educational Progress (NAEP). The proposed design greatly simplifies the data collection and analysis procedures needed to produce assessment results and has the potential to produce results that are more timely and easier to interpret. The plan calls for developing individual NAEP forms, where each individual form represents, as closely as possible, the assessment questions from the domain of knowledge being measured by an NAEP construct. Sets of these forms could be administered, in random order, to students in the schools. This would replace the balanced incomplete block design (BIB) currently used. Assessments constructed under the BIB design do not closely represent, at least for the 1996 science assessment, the content framework. Enhanced procedures are also suggested for developing precise content and statistical specifications for individ...

Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests by Cizek, G. J., & Bunch, M. B

Journal of Educational Measurement, 2010

is a must read for practitioners who use item response theory to calibrate test data. It also wou... more is a must read for practitioners who use item response theory to calibrate test data. It also would serve as a tremendous resource for measurement researchers who daily navigate the circuitous paths of various IRT estimation software programs to analyze and understand their assessment data. The book is part of the Methodology in the Social Sciences series of books designed specifically for applied researchers and students. Books in this series emphasize the illustration of methodology through the interpretation of computer output rather than emphasizing statistical theory. Following this format, each of the 12 chapters is packed with annotated examples of how to use IRT estimation software and the subsequent output. Each of the 12 chapters begins with a brief introduction followed by a thorough discussion of a particular IRT model or IRT application and concludes with a summary and additional set of footnotes that provide further insight and sources into the content covered in the chapter. The author does an excellent job of supplementing explanations of various models with calibration examples and output of multiple data sets using several different IRT calibration software programs including BILOG, MULTILOG, BIGSTEPS, and NOHARM. In all cases the input code and abridged output are provided. Several example data sets are used throughout the text. In some cases different models are used on the same data set and the goodness-of-fit of the various models is examined. This book is more practitioner-oriented and applied than previous classic books that provide foundational understanding of IRT models and applications (such as

Field Trials To Determine Which Rating Method(s) To Use in the 1998 NAEP Achievement Levels-Setting Process for Civics and Writing

Field trials were conducted to test rating methods and the impact of feedback about consequences ... more Field trials were conducted to test rating methods and the impact of feedback about consequences for the 1998 achievement levels-setting (ALS) process for the National Assessment of Educational Progress (NAEP). The field trials provided the opportunity to try out different methods similar to those used successfully by others, as well as to try out some new methods. The American College Testing program (ACT) had proposed a new method to be tested in the field trials. Although successful implementation of the method had been reported, the method was found to be biased, and the ACT stopped tests with the method after the first field trials. Reservations about item maps were not overcome in the field trial process, and item maps were eliminated as a choice. Concerns about computational procedures and the logistic demands of the Booklet Classification Method eliminated this approach. The Technical Advisory Committee on Standard Setting recommended a new combination method based on the method developed by M. Reckase in conjunction with the strong research base and extensive experience by ACT associated with the Mean Estimation method. Procedures based on this approach were used to set achievement levels for the 1998 NAEP in civics and writing. Appendixes contain examples of charts used in the rating method study. (Contains 18 tables, 33 figures, and 25 references.) (Author/SLD)

Early Efforts

Developing Achievement Levels for the 1998 NAEP in Writing Interim Report: Field Trials

OVERVIEW Two field trials were conducted for the achievement levels-setting (ALS) process for the... more OVERVIEW Two field trials were conducted for the achievement levels-setting (ALS) process for the1998 Civics National Assessment of Educational Progress (NAEP). ACT proposed to conduct field trials as a means of collecting research information regarding new methods and procedures designed for the 1998 ALS process. ACT wanted to conduct the research involving panelists before the pilot study so that the pilot studies could be used to test the procedures selected for the ALS. Experiences with the 1994 pilot studies for geography and U.S. history led ACT to recommend this additional set of studies for research purposes so that pilot studies could be used for practice and fine-tuning. ACT had proposed to conduct only two small-scale field trials. Once the design of the studies started to take shape, however, plans changed so that two field trials were planned for each subject—civics and writing—included in the 1998 NAEP ALS procedures. Further, the scale of the field trials expanded to ...

The Use of Person Fit to Model Panelist Fit

The use of person-fit methods to determine the extent to which a panelist's ratings fit the i... more The use of person-fit methods to determine the extent to which a panelist's ratings fit the item response theory (IRT) models used in the National Assessment of Educational Progress (NAEP) is demonstrated. Person-fit methods are statistical methods that allow the identification of nonfitting response vectors. To determine whether panelists' ratings fit the IRT models used in the NAEP, the 1(z) statistic (F. Drasgow, M. Levine, and E. Williams, 1985) was used. Rating data from the 1994 NAEP achievement level setting process were obtained for grade 12 geography, for which 29 panelists (primarily teachers) set levels. A response vector was created for each panelist for each achievement level using each of three p-value criteria and simulated item score string estimation (ISSE) values were created. The 1(z) statistic was calculated for each of the 27 response vectors associated with each of the 29 panelists. Means and standard deviations of the 1(z) distributions were computed f...

Multiple-Choice Examinations

AUTHOR Bay, Luz TITLE Detection of Cheating on Multiple-Choice Examinations. PUB DATE 1995-04-21 ... more AUTHOR Bay, Luz TITLE Detection of Cheating on Multiple-Choice Examinations. PUB DATE 1995-04-21 NOTE 47p.;Paper presented at the Annual Meeting of the American Educational Research Association (San Francisco, CA, April 18-22, 1995). PUB TYPE Reports Research (143) Speeches/Meeting Papers (150) EDRS PRICE MF01/PCO2 Plus Postage. DESCRIPTORS *Cheating; Class Size; *Identification; *Multiple Choice Tests; Simulation; Test Length; *Testing Problems IDENTIFIERS *Binomial Distribution

A Demonstration of Using Person-Fit Statistics in Standard Setting

The use of person-fit methods to determine the extent to which a panelist's ratings fit the i... more The use of person-fit methods to determine the extent to which a panelist's ratings fit the item response theory (IRT) models used in the National Assessment of Educational Progress (NAEP) is demonstrated. Person-fit methods are statistical methods that allow the identification of nonfitting response vectors. To determine whether panelists' ratings fit the IRT models used in the NAEP, the 1(z) statistic (F. Drasgow, M. Levine, and E. Williams, 1985) was used. Rating data from the 1994 NAEP achievement level setting process were obtained for grade 12 geography, for which 29 panelists (primarily teachers) set levels. A response vector was created for each panelist for each achievement level using each of three p-value criteria and simulated item score string estimation (ISSE) values were created. The 1(z) statistic was calculated for each of the 27 response vectors associated with each of the 29 panelists. Means and standard deviations of the 1(z) distributions were computed f...

Validity of Student Scores in the New Hampshire Educational Assessment and Improvement Program

In state assessment programs in which performance has no real immediate consequence for the indiv... more In state assessment programs in which performance has no real immediate consequence for the individual examinee, the issue of examinee motivation arises. Some examinees may respond to questions in ways that do not reflect their real knowledge of the test domain. In this-study, a new approach was developed to identify students who have responded to questions in a way that does not reflect their knowledge adequately. The new method was designed to detect examinees that responded according to some repeating pattern. The new method was then compared to two indices previously developed by Drasgow, Levine, and Williams (1985) using data from New Hampshire's statewide mathematics assessment for 16,630, 16,387, and 14,445 students from grades 3, 6, and 10 respectively. These indices were found to be distributed in a manner that was more or less expected, something that suggests that the indices may be useful in detecting examinees that are responding in a way not in accord with the underlying test model. However, results suggest that these indices will only occasionally detect responding according to a repeating pattern. Use of the new method, called the PM method, to detect responding in a repeated pattern is supported by the study. (Contains 8 tables and 23 references.) (SLD) Reproductions supplied by EDRS are the best that can be made from the original document.

Booklet Classification Study

Research studies using booklet classification were implemented by the American College Testing Pr... more Research studies using booklet classification were implemented by the American College Testing Program to investigate the linkage between the National Assessment of Educational Progress (NAEP) Achievement Levels Descriptions and the cutpoints set to represent student performance with respect to the achievement levels. This paper describes the process and reports the results of the booklet classification study (BCS) implemented for the science achievement levels. It explores the possibility of using booklet classification as a way to set achievement levels by investigating methodologies for computing achievement level cutpoints using booklet classification data. These methodologies were applied to BCS data for science in this study and had been applied to geography and U.S. history. The BCS for science achievement levels involved grades 4 and 8, with 13 panelists for each grade level. Eighteen booklets were selected from NAEP forms, and 22 from other sources. The BCS for science, geography, and U.S. history have all resulted in panelists' classifying student performance at a lower level than plausible values scores indicate. These results indicate that cutpoints computed from booklet classification data would be higher than cutpoints based on the item-by-item rating methods that were used operationally. Procedures using the proportional odds model and nonparametric discriminant analysis were developed as a way to compute Achievement Level cutpoints using booklet classification data. Further refinements to these procedures, especially the nonparametric discriminant analysis, are needed before they could be used operationally to set cutpoints. (Contains 15 tables, 3 figures, and 13 references.) (SLD)