A standard error of measurement, often denoted SEm, estimates the variation around a “true” score for an individual when repeated measures are taken. Show
It is calculated as: SEm = s√1-R where:
Note that a reliability coefficient ranges from 0 to 1 and is calculated by administering a test to many individuals twice and calculating the correlation between their test scores. The higher the reliability coefficient, the more often a test produces consistent scores. Example: Calculating a Standard Error of MeasurementSuppose an individual takes a certain test 10 times over the course of a week that aims to measure overall intelligence on a scale of 0 to 100. They receive the following scores: Scores: 88, 90, 91, 94, 86, 88, 84, 90, 90, 94 The sample mean is 89.5 and the sample standard deviation is 3.17. If the test is known to have a reliability coefficient of 0.88, then we would calculate the standard error of measurement as: SEm = s√1-R = 3.17√1-.88 = 1.098 How to Use SEm to Create Confidence IntervalsUsing the standard error of measurement, we can create a confidence interval that is likely to contain the “true” score of an individual on a certain test with a certain degree of confidence. If an individual receives a score of x on a test, we can use the following formulas to calculate various confidence intervals for this score:
For example, suppose an individual scores a 92 on a certain test that is known to have a SEm of 2.5. We could calculate a 95% confidence interval as:
This means we are 95% confident that an individual’s “true” score on this test is between 87 and 97. Reliability & Standard Error of MeasurementThere exists a simple relationship between the reliability coefficient of a test and the standard error of measurement:
To illustrate this, consider an individual who takes a test 10 times and has a standard deviation of scores of 2. If the test has a reliability coefficient of 0.9, then the standard error of measurement would be calculated as:
However, if the test has a reliability coefficient of 0.5, then the standard error of measurement would be calculated as:
This should make sense intuitively: If the scores of a test are less reliable, then the error in the measurement of the “true” score will be higher. Learning Objectives The collection of data involves measurement. Measurement of some characteristics such as height and weight are relatively straightforward. The measurement of psychological attributes such as self esteem can be complex. A good measurement scale should be both reliable and valid. These concepts will be discussed in turn. ReliabilityThe notion of reliability revolves around whether you would get at least approximately the same result if you measure something twice with the same measurement instrument. A common way to define reliability is the correlation between parallel forms of a test. Letting "test" represent a parallel form of the test, the symbol \(r_{test,test}\)is used to denote the reliability of the test. True Scores and ErrorAssume you wish to measure a person's mean response time to the onset of a stimulus. For simplicity, assume that there is no learning over tests which, of course, is not really true. The person is given \(1,000\) trials on the task and you obtain the response time on each trial. The mean response time over the \(1,000\) trials can be thought of as the person's "true" score, or at least a very good approximation of it. Theoretically, the true score is the mean that would be approached as the number of trials increases indefinitely. An individual response time can be thought of as being composed of two parts: the true score and the error of measurement. Thus if the person's true score were \(345\) and their response on one of the trials were \(358\), then the error of measurement would be \(13\). Similarly, if the response time were \(340\), the error of measurement would be \(-5\). Now consider the more realistic example of a class of students taking a \(100\)-point true/false exam. Let's assume that each student knows the answer to some of the questions and has no idea about the other questions. For the sake of simplicity, we are assuming there is no partial knowledge of any of the answers and for a given question a student either knows the answer or guesses. Finally, assume the test is scored such that a student receives one point for a correct answer and loses a point for an incorrect answer. In this example, a student's true score is the number of questions they know the answer to and their error score is their score on the questions they guessed on. For example, assume a student knew \(90\) of the answers and guessed correctly on \(7\) of the remaining \(10\) (and therefore incorrectly on \(3\)). Their true score would be \(90\) since that is the number of answers they knew. Their error score would be \(7 - 3 = 4\) and therefore their actual test score would be \(90 + 4\). Every test score can be thought of as the sum of two independent components, the true score and the error score. This can be written as: \[y_{test}=y_{true}+y_{error}\] The following expression follows directly from the Variance Sum Law: \[\sigma _{Test}^{2}=\sigma _{True}^{2}+\sigma _{Error}^{2}\] Reliability in Terms of True Scores and ErrorIt can be shown that the reliability of a test, \(r_{test,test}\), is the ratio of true-score variance to test-score variance. This can be written as: \[r_{test,test}=\frac{\sigma _{True}^{2}}{\sigma _{Test}^{2}}=\frac{\sigma _{True}^{2}}{\sigma _{True}^{2}+\sigma _{Error}^{2}}\] PDF of derivation It is important to understand the implications of the role the variance of true scores plays in the definition of reliability: If a test were given in two populations for which the variance of the true scores differed, the reliability of the test would be higher in the population with the higher true-score variance. Therefore, reliability is not a property of a test per se but the reliability of a test in a given population. Assessing Error of MeasurementThe reliability of a test does not show directly how close the test scores are to the true scores. That is, it does not reveal how much a person's test score would vary across parallel forms of test. By definition, the mean over a large number of parallel tests would be the true score. The standard deviation of a person's test scores would indicate how much the test scores vary from the true score. This standard deviation is called the standard error of measurement. In practice, it is not practical to give a test over and over to the same person and/or assume that there are no practice effects. Instead, the following formula is used to estimate the standard error of measurement. \[S_{measurement}=S_{test}\sqrt{1-r_{test,test}}\] where \(S_{measurement}\)is the standard error of measurement, \(S_{test}\) is the standard deviation of the test scores, and \(r_{test,test}\) is the reliability of the test. Taking the extremes, if the reliability is \(0\) then the standard error of measurement is equal to the standard deviation of the test; if the reliability is perfect (\(1.0\)) then the standard error of measurement is \(0\). Increasing ReliabilityIt is important to make measures as reliable as is practically possible. Suppose an investigator is studying the relationship between spatial ability and a set of other variables. The higher the reliability of the test of spatial ability, the higher the correlations will be. Similarly, if an experimenter seeks to determine whether a particular exercise regiment decreases blood pressure, the higher the reliability of the measure of blood pressure, the more sensitive the experiment. More precisely, the higher the reliability the higher the power of the experiment. Power is covered in detail here. Finally, if a test is being used to select students for college admission or employees for jobs, the higher the reliability of the test the stronger will be the relationship to the criterion. Two basic ways of increasing reliability are
Items that are either too easy so that almost everyone gets them correct or too difficult so that almost no one gets them correct are not good items: they provide very little information. In most contexts, items which about half the people get correct are the best (other things being equal). Items that do not correlate with other items can usually be improved. Sometimes the item is confusing or ambiguous. Increasing the number of items increases reliability in the manner shown by the following formula: \[r_{new,new}=\frac{kr_{test,test}}{1+(k-1)r_{test,test}}\] where \(k\) is the factor by which the test length is increased, \(r_{new,new}\) is the reliability of the new longer test, and \(r_{test,test}\) is the current reliability. For example, if a test with \(50\) items has a reliability of \(0.70\) then the reliability of a test that is \(1.5\) times longer (\(75\) items) would be calculated as follows: \[r_{new,new}=\frac{(1.5)(0.70)}{1+(1.5-1)(0.70)}\] which equals \(0.78\). Thus increasing the number of items from \(50\) to \(75\) would increase the reliability from \(0.70\) to \(0.78\). It is important to note that this formula assumes the new items have the same characteristics as the old items. Obviously adding poor items would not increase the reliability as expected and might even decrease the reliability. More Information on Reliability from William Trochim's Knowledge Source ValidityThe validity of a test refers to whether the test measures what it is supposed to measure. The three most common types of validity are face validity, empirical validity, and construct validity. We consider these types of validity below.
To take an example, suppose one wished to establish the construct validity of a new test of spatial ability. Convergent and divergent validity could be established by showing the test correlates relatively highly with other measures of spatial ability but less highly with tests of verbal ability or social intelligence. Reliability and Predictive ValidityThe reliability of a test limits the size of the correlation between the test and other measures. In general, the correlation of a test with another measure will be lower than the test's reliability. After all, how could a test correlate with something else as high as it correlates with a parallel form of itself? Theoretically it is possible for a test to correlate as high as the square root of the reliability with another measure. For example, if a test has a reliability of \(0.81\) then it could correlate as high as \(0.90\) with another measure. This could happen if the other measure were a perfectly reliable test of the same construct as the test in question. In practice, this is very unlikely. A correlation above the upper limit set by reliabilities can act as a red flag. For example, Vul, Harris, Winkielman, and Paschler (\(2009\)) found that in many studies the correlations between various fMRI activation patterns and personality measures were higher than their reliabilities would allow. A careful examination of these studies revealed serious flaws in the way the data were analyzed. Vul, E., Harris, C., Winkielman, P., & Paschler, H. (2009) Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition. Perspectives on Psychological Science, 4, 274-290. When measurement error increases what happens to reliability?As more error is introduced into the observed score, the lower the reliability will be. As measurement error is decreased, reliability is increased. With that said, administering two forms of an exam to one candidate to calculate reliability is not practical.
How are reliability and error related?Reliability is the degree to which a measure is free from random errors. But, due to the every present chance of random errors, we can never achieve a completely error-free, 100% reliable measure. The risk of unreliability is always present to a limited extent.
What does it mean if a test has high reliability?A measure is said to have a high reliability if it produces similar results under consistent conditions: "It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores.
Why does reducing error increase reliability?Reliability and Validity
Random error reduces the reliability of the measurement (whether we can reproduce the measurement and obtain the same results). Reliability refers to the consistency of the results obtained. As random errors increase, the measurement instrument is said to be less reliable.
|