Why Reliability and Validity Are Important to Learning Assessment

A photograph of a handwritten A plus grade in red marker with a red marker lying next to it.

A basic knowledge of test score reliability and validity is important for making instructional and evaluation decisions about students. The purpose of testing is to obtain a score for an examinee that accurately reflects the examinee’s level of attainment of a skill or knowledge as measured by the test. Since instructors assign grades based on assessment information gathered about their students, the information must have a high degree of validity in order to be of value. Assessment data collected will be influenced by the type and number of students being tested. This variance in student groups from semester to semester will affect how difficult or easy test items and tests will appear to be. This variance in scores from group to group makes reliability and validity an important consideration when developing and administering assessments and evaluating student learning.

Reliability and Validity    

It is common among instructors to refer to types of assessment, whether a selected response test (i.e. multiple-choice, true/false, etc.) or a constructed response test that requires rubric scoring (i.e. essays, performances, etc.) as being reliable and valid. Technically, it is not the test itself but rather the resulting test score or rubric score that must have a high degree of reliability and validity. Reliability refers to the degree to which scores from a particular test are consistent from one use of the test to the next. Validity refers to the degree to which a test score can be interpreted and used for its intended purpose. Reliability is a very important piece of validity evidence. A test score could have high reliability and be valid for one purpose, but not for another purpose.

An example often used for reliability and validity is that of weighing oneself on a scale. The results of each weighing may be consistent, but the scale itself may be off a few pounds. Thus, we could say that the testing instrument is producing reliable weight values, but the values are not valid for their intended use because the scale is off by a few pounds.       

There are other pieces of validity evidence in addition to reliability that are used to determine the validity of a test score. Of great importance is that the test items or rubrics match the learning outcomes that the test is measuring and that the instruction given matches the outcomes and what is assessed. Ultimately then, validity is of paramount importance because it refers to the degree to which a resulting score can be used to make meaningful and useful inferences about the test taker.

Item and Rubric Quality

An important piece of validity evidence is item validity. Item validity refers to how well the test items and rubrics function in terms of measuring what was intended to be measured; in other words, the  quality of the items and rubrics. Selected-response item quality is determined by an analysis of the students’ responses to the individual test items. Rubric quality is based on:

  1. the match of the rubric content to the outcomes being measured and
  2. the degree to which the wording in each cell of a rubric row is parallel in terms of the wording used and homogeneous in terms of the content being measured.          

In order to improve the quality of selected-response tests that will be used again, poorly functioning items need to be identified so they can be fixed, eliminated, or replaced. Ambiguous or misleading items need to be identified. Item analysis requires calculating item statistics such as how many students chose each answer choice for a particular item and how many higher scoring students chose the correct answer to each item compared to lower scoring students. Obtaining item statistics usually requires the use of an item analysis program or a learning management system that provides the information.

Ideally, most of the work to ensure the quality of rubrics should be done prior to using the rubrics for awarding points. This pre-administration work would require a well-constructed rubric and student response samples to evaluate. Qualified raters would score the responses for agreement, and the rater information would be used to make fixes to the rubrics. Since an ideal rubric analysis by an individual instructor can rarely be done due to time and resource restraints, the best that can be done for a quality analysis is to collect the student responses and look for patterns in the responses that might identify ambiguous or misleading wording in the rubric and make fixes as needed.