Why is your decision to conduct a one tailed rather than a two tailed test potentially important?

Adriana Pérez, Barbara C. Tilley, in Stroke (Sixth Edition), 2016

One-Tailed or Two-Tailed Tests

A one-tailed test requires a smaller sample size to achieve the same effect with the same power. In phase 2 a one-sided test may be used to reduce sample size. In phase 3 investigators generally design the trial to learn whether the treatment group has an outcome better or worse than that of the control group, requiring a two-sided test. An exception, the EC/IC Bypass Study76 investigators designed the trial using a one-tailed test to compare surgery with best medical care. Investigators assumed that if the trial indicated that surgery was no better than best medical care, surgery would not be recommended in the future due to its cost.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323295444000633

Tests on Means of Continuous Data

R.H. Riffenburgh, in Statistics in Medicine (Third Edition), 2012

The Alternate Hypothesis Is Used to Choose a One- or Two-Tailed Test

The alternate hypothesis will dictate a two-tailed or a one-tailed test. We should decide this before seeing the data so that our choice will not be influenced by the outcome. We often expect the result to lie toward one tail, but expectation is not enough. If we are sure the other tail is impossible, such as for physical or physiological reasons, we unquestionably use a one-tailed test. Surgery to sever adhesions and return motion to a joint frozen by long casting will allow only a positive increase in angle of motion; a negative angle physically is not possible. A one-tailed test is appropriate. There are cases in which an outcome in either tail is possible, but a one-tailed test is appropriate. When making a decision about a medical treatment, i.e. whether or not we will alter treatment depending on the outcome of the test, the possibility requirement applies to the alteration in treatment, not the physical outcome. If we will alter treatment only for significance in the positive tail and it will in no way be altered for significance in the negative tail, a one-tailed test is appropriate.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123848642000123

Biostatistical Basis of Inference in Heart Failure Study

Longjian Liu MD, PHD, MSC (LSHTM), FAHA, in Heart Failure: Epidemiology and Research Methods, 2018

Two-tailed or one-tailed test

Another important concept in significance testing is whether we use a one-tailed or two-tailed test of significance.

The answer is that it depends on our hypothesis. When our research hypothesis states the direction of the difference or relationship, then we use a one-tailed probability. For example, a one-tailed test may be used to test this null hypothesis: serum 25-hydroxyvitamin D concentration in heart failure patients will be not lower than heart failure without chronic kidney disease. In the case, the null hypothesis predicts the direction of the difference.

On the other hand, a two-tailed test would be used to test this null hypothesis: There is no significant difference in serum 25-hydroxyvitamin D concentration between heart failure patients with or without chronic kidney disease. In the case, the direction of comparison (lower or higher) is not specified. Fig. 4.10 shows standard z distribution, when z = 1.96, the P-value = .05 (if for a two-tailed test, P-value = .025 + 0.025 = 0.05).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323485586000049

Tests on Categorical Data

R.H. Riffenburgh, in Statistics in Medicine (Third Edition), 2012

Example Completed, Large λ: Does a New Drug Cause Birth Defects?

A large number of opportunities to occur (638) but a small chance of occurring on any one opportunity (0.0175) indicates Poisson. Because the adjuvant drug would be used clinically if defects reduce but not if they stay the same or increase, a one-tailed test is appropriate. λ = nπ = 638 × 0.0175 = 11.165, which is larger than the value in Table VII, so the normal approximation is used. ps = 7/638 = 0.0110. Substitution in Eq. (9.10) yields z = −1.2411. By normal curve symmetry, we can look up 1.2411 in Table 9.13. By interpolation, we can see that the p-value, that is, the α that would have given this result, is about 0.108. A reduction in defect rate has not been shown.

Table 9.13. A Fragment of Normal Distribution Table Ia

z (No. Std. Deviations to Right of Mean)One-tailed α (Area in Right Tail)Two-tailed α (Area in Both Tails)
1.20 .115 .230
1.30 .097 .194
1.40 .081 .162
1.50 .067 .134

aFor selected distances (z) to the right of the mean, given are one-tailed α, the area under the curve in the positive tail, and two-tailed α, the areas combined for both tails.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123848642000093

Research and Methods

Michael Borenstein, in Comprehensive Clinical Psychology, 1998

3.14.3.3 Role of Tails in Power Analysis

The significance test is always defined as either one- or two-tailed. A two-tailed test is a test that will be interpreted if the effect meets the criterion for significance and falls in either direction. As such, it is appropriate for the vast majority of research studies. A one-tailed test is a test that will be interpreted only if the effect meets the criterion for significance and falls in the expected direction (i.e., the treatment improves the cure rate).

A one-tailed test is appropriate only if an effect in the unexpected direction would be functionally equivalent to no effect. For example, assume that the treatment we are using for depression is relatively inexpensive and carries a minimal risk of side effects. We will be testing a new treatment which is more expensive but carries the potential for a greater effect. The possible conclusions are that (i) the old treatment is better, (ii) there is no difference, or (iii) the new treatment is better. For our purposes, however, conclusions (i) and (ii) are functionally equivalent since either would lead us to retain the standard treatment. In this case, a one-tailed test, whose only goal is to test whether or not conclusion (iii) is true, might be an appropriate choice.

Note that a one-tailed test should be used only in a study in which, as in this example, an effect in the reverse direction is, for all intents and purposes, identical to “no effect.” It is not appropriate to use a one-tailed test merely because one is able to specify the expected direction of the effect prior to running the study. In psychological research, for example, we typically expect that the new procedure will increase, rather than decrease, the cure rate. Nevertheless, a finding that it decreases the cure rate would be important, since it would demonstrate a possible flaw in the underlying theory. Even in the example cited, one would want to be certain that a profound effect in the reverse direction could safely be ignored— under a one-tailed test, it cannot be interpreted. In behavioral research, the use of a one-tailed test can be justified only rarely.

For a given effect size, sample size, and alpha, a one-tailed test is more powerful than a two-tailed test (a one-tailed test with alpha set at 0.05 has the same power as a two-tailed test with alpha set at 0.10). However, the number of tails should be set based on the substantive issue (“Will an effect in the reverse direction be meaningful?”). In general, it would not be appropriate to run a test as one-tailed rather than two-tailed as a means of increasing power. (Note also that power is higher for the one-tailed test only under the assumption that the observed effect falls in the expected direction. When the test is one-tailed, the power for an effect in the reverse direction is zero by definition).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080427073002091

Biostatistics—Part I

Tom Brody Ph.D., in Clinical Trials (Second Edition), 2016

VII One-Tailed Test Versus Two-Tailed Test

The terms one-tailed test and two-tailed test are encountered, for example, when conducting analytical studies on manufactured tablets and when conducting clinical trials. When doing calculations, these terms are encountered when plugging a Z-value into a table of areas under the standard normal curve, and acquiring a P value. A one-tailed test is also called a one-sided test, and a two-tailed test is also called a two-sided test.

This standard table has been called, “Standard Normal Distribution Areas” (38), “Areas in Tail of the Standard Normal Distribution” (39), and “Areas Under the Standard Normal Curve” (40).

The heading of the table of areas under the standard normal curve typically directs the reader to one column of numbers, which is to be used for one-tailed tests, and to another column of numbers, which is to be used for two-tailed tests (41).

A one-tailed test is used to determine whether the mean of group 1 is greater than the mean of group 2, while a two-tailed test is used to determine whether the mean of group 1 is different than the mean of group 2 (42). By “different,” what is meant here is whether there is a statistically significant difference. More accurately, by “different,” what is meant is if the difference is plausible within an acceptable degree of error (43). As explained by Dawson and Trapp (44), the one-tailed test is a directional test, while the two-tailed test is a nondirectional test.

A one-tailed test should be used where the goal is to determine whether the value of a mean of a sample is significantly greater than the value of the mean for the corresponding population. The one-tailed test is also used where the goal is to determine whether the value of a mean of a sample is significantly greater than the value of the mean of another sample.

Thus, a one-tailed test is used where the goal is to determine whether a new, improved pill dissolves faster in water than an older formulation of the pill. Also, a one-tailed test is used where the goal is to determine whether a drug having expected curative properties results in a better cure than an inactive placebo.

To provide another example, a one-tailed test is used where the goal is to determine whether vials containing a vaccine are contaminated with 10 or more bacteria (45). In this case, the analyst is only interested in whether the vials contain 10 or more bacteria, in view of industry-wide specifications requiring that vials must contain less than 10 bacteria. Generally, the one-tailed test is used to determine whether sample A is significantly greater than sample B, in the situation where it would not be reasonable to expect sample A to be significantly less than sample B.

But a two-tailed test should be used where the goal is to determine the percentage of tablet weights that are greater or lesser (the sum of the percentage of tablets that are greater plus the sum of the percentage of tablets that are lesser) than the required specification, when comparing tablets made by manufacturer 1 with tablets made by manufacturer 2. Two-tailed tests are more widely used in clinical trials than the one-tailed test, in view of the fact that the two-tailed test is more stringent and more conservative (46,47).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128042175000096

Tests on Ranked Data

R.H. Riffenburgh, in Statistics in Medicine (Third Edition), 2012

Additional Example: Hardware to Repair Ankle Functionality

An orthopedist installs the hardware in the broken ankles of nine patients55. He scores the percent functionality of the joint. He asks: “Is the average percent functionality less than 90% that of normal functionality?” His data in percent are 75, 65, 100, 90, 35, 63, 78, 70, 80. A quick frequency plot of the data shows they are far from a normal distribution, so he uses a rank-based test. He subtracts 90% from each (to provide a base of 0) and ranks them, ignoring signs. Then he attaches the signs to the ranks to obtain the signed ranks. These results are shown in Table 11.5.

Table 11.5. Data for Ankle Repair

Deviation from 90% −15 −25 10 0 −55 −27 −12 −20 −10
Unsigned ranks 5 7 2.5 1 9 8 4 6 2.5
Signed ranks −5 −7 2.5 1 −9 −8 −4 −6 −2.5

The sum of positive signs will obviously be the smaller sum, namely 3.5= T. From Table VIII, the p-value for n= 9 with T= 3.5 for a two-tailed test lies between 0.020 and 0.027, about 0.024. Because he chose a one-tailed test, the tabulated value may be halved, or p= 0.012, approximately, clearly significant. He concludes that the patients’ average functionality is significantly below 90%.

Exercise 11.1. Normality of thermometer readings on a healthy patient. We are investigating the reliability of a certain brand of tympanic thermometer (temperature measured by a sensor inserted into the patient’s ear)110. Eight readings (°F) were taken in the right ear of a healthy patient at 2-minute intervals. Data were 98.1, 95.8, 97.5, 97.2, 97.7, 99.3, 99.2, 98.1. Is the median different from the population average of 98.6°F?

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123848642000111

Description and Analysis of Data, and Critical Appraisal of the Literature

Brian W. McCrindle, in Paediatric Cardiology (Third Edition), 2010

Relationship between Probability and Inference

A detailed explanation of the exact methods used to determine if the probability distribution of the sample is different from or similar to that of the overall or target is beyond the scope of this chapter. Suffice to say that each type of data, and each specific question, requires specific methodologies and tests, some of which will be briefly introduced later. The common point of any statistical test is that all types produce a P-value, which represents the probability that two distributions are similar. Statistical testing takes into account the number of subjects being tested, the observed variation in the data, the magnitude of any differences, and the underlying nature of the probability distribution, and thus the results are influenced by these features. Every statistical test comparing two groups starts with the hypothesis that both groups are equivalent. A two-tailed test tests the probability that group A is different from group B, either higher or lower, while a one-tailed test tests the probability that group A is either specifically higher or lower than group B, but not both. As a general rule, only two-tailed tests should be used in most situations, as they assess the probability of two groups being different without any presumptions about the direction of the difference between groups. There is no assumption that A is higher or lower than B, just that the two are different. One-tailed tests assume that the difference observed has a direction, for example, higher or lower, but not both. These tests should be used only in specific situations, an example being a non-inferiority trial. By convention, statistical significance is reached when the P-value obtained from the tests is under 0.05, meaning that the probability that both groups are equivalent, and not different, is lower than 5%. The P-value is an expression of the confidence we might have that the findings are true and not the result of random error. From our previous example of preferred treatment for heart failure, the P-value was less than 0.001 for 44% being different from 72%. This means that in a population where 72% of cardiologists prefer drug B the probability of having a random sample with 44% favoring drug B is less than 1 in 1000 trials. We can confidently conclude, therefore, that the second sample is truly different from the original sample, and that the opinion in the population of paediatric cardiologists had changed.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780702030642000242

Statistical Methods

Thomas D. Gauthier, Mark E. Hawley, in Introduction to Environmental Forensics (Third Edition), 2015

5.2.3.2 Hypothesis Testing

In hypothesis testing, the convention is to assume that there is no difference between two values (this is referred to as the “null hypothesis,” denoted Ho). This assumption holds unless we can determine that the probability of seeing such a large difference is so small that it is more likely that our initial assumption is wrong than it is we have encountered this rare occurrence. Often, the convention is to conclude that our initial assumption is wrong when the probability associated with the null hypothesis is 5% or less.

When we establish our null hypothesis, it is also important to establish an alternate hypothesis (denoted Ha). For a null hypothesis of no difference between two sample means, i.e., Ho: μa = μb there are three different alternate hypotheses: Ha: μa ≠ μb; Ha: μa > μb; or Ha: μa < μb. The a priori decision of which alternate hypothesis is selected determines whether we are performing a one-tailed test or a two-tailed test. If our decision level is set at 5% and our alternate hypothesis is Ha: μa ≠ μb, then we don’t care if μa > μb or μa < μb. In this case, our 5% decision level is interpreted as a 2.5% chance of rejecting our null hypothesis when in fact μa > μb and a 2.5% chance of rejecting the null hypothesis when in fact μa < μb. This is referred to as a two-tailed test. If, however, we really would like to know if the average concentration on site is greater than the average background level, i.e., μa > μb, and we are willing to accept a 5% chance of wrongly rejecting our null hypothesis, then we would be performing a one-tailed test.

In making this decision to accept or reject the null hypothesis, two types of errors are possible: Type I errors and Type II errors. A Type I error occurs when the null hypothesis is rejected (we cannot conclude that the values are drawn from the same population) when in fact they have been. A Type II error occurs when we accept the null hypothesis of no difference between means when in fact there is a difference.

The probability of committing a Type I error is equal to α (alpha), the significance level. Alpha is a value that is chosen by the investigator and usually set equal to 0.05. With alpha set equal to 0.05, we are willing to accept a 5% chance of rejecting our null hypothesis of no difference between means when in fact the null hypothesis is true. The probability of committing a Type II error is equal to β (beta), the power. The power of a test is more difficult to determine. It is a function of alpha, the standard error of the difference between the two means, and the size of the effect that we are trying to detect (for example, see Zar, 1984).

The power of the test is often neglected which can lead to a false confidence in decision making. For example, if α = 0.05, and β = 0.5, there is a 5% chance of rejecting the null hypothesis when in fact it is true, but there is a 50% chance of accepting the null hypothesis when it is false. For this reason, some statisticians believe that the only significant conclusion is when the null hypothesis is rejected (for example, see Oakes, 1986; Reckhow et al., 1990). If the null hypothesis cannot be rejected, the reason may be that the test had insufficient power to detect a difference (e.g., the sample size was too small or the data were highly variable). Thus, a poorly designed test based on too few measurements can be biased toward accepting the null hypothesis.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124046962000059

Statistical Analysis for Experimental-Type Designs

Elizabeth DePoy PhD, MSW, OTR, Laura N. Gitlin PhD, in Introduction to Research (Fifth Edition), 2016

Action 2: Select a Significance Level

A level of significance defines how rare or unlikely the sample data must be before the researcher can fail to accept the null hypothesis. The level of significance is a cutoff point that indicates whether the samples being tested are from the same population or from a different population. This numeric value indicates how confident the researcher is that the findings regarding the sample are not attributable to chance. For example, if you select a significance level of 0.05, you are 95% confident that your statistical findings did not occur by chance. If you repeatedly draw different samples from the same population, theoretically you will find similar scores 95 out of 100 times. Similarly, a confidence level of 0.1 indicates that the findings may be caused by chance 1 of every 10 times.

As you can see, the smaller the number, the more confidence the researcher has in the findings and the more credible the results. Because of the nature of probability theory, the researcher can never be certain that the findings are 100% accurate. Significance levels are selected by the researcher on the basis of sample size, level of measurement, and conventional norms in the literature. As a general rule, the larger the sample size, the smaller the numerical value in the level of significance. If you have a small sample size, you risk obtaining a study group that is not highly representative of the population, and thus your confidence level drops. A large sample size includes more elements from the population, and thus the chances of representation and confidence of findings increase. You therefore can use a stringent level of significance (0.01 or smaller).

One-Tailed and Two-Tailed Levels of Significance

Consider the normal curve in distribution of scores (see Fig. 20-4). Extreme scores can occur to either the left or the right of the bell shape. As we noted earlier, the extremes of the curve are called “tails.” If a hypothesis is nondirectional, it indicates that the investigator assumes that extreme scores can occur at either end of the curve or in either tail. In this situation, the investigator uses a test to determine whether the 5% of statistical values that are considered statistically significant are distributed between the two tails of the curve. If, on the other hand, the hypothesis is directional, the researcher will use a one-tailed test of significance. The portion of the curve in which statistical values are considered significant is in one side of the curve, either the right or the left tail. It is easier to obtain statistical sig­nificance with a one-tailed statistical test, but the researcher will run the risk of a Type I error. A two-tailed test is a more stringent statistical approach.

Type I and II Errors

Because researchers deal with probabilities in statistical inference, two types of statistical inaccuracy or error (Type I and Type II) can contribute to the inability to claim full confidence in findings.

Type I Errors

In a Type I error, also called an “alpha error,”3 the researcher errs by failing to accept the null hypothesis when it is true. In other words, the researcher claims a difference between groups when, if the entire population were measured, there would be no difference. This error can occur when the most extreme members of a population are selected by chance in a sample. Assume, for example, that you set the level of significance at 0.05, indicating that 5 times out of 100 the null hypothesis can be rejected when it is accurate. Because the probability of making a Type I error is equal to the level of significance chosen by the investigator, reducing the level of significance will reduce the chances of making this type of error. Unfortunately, as the probability of making a Type I error is reduced, the potential to make another type of error (Type II) increases.

Type II Errors

A Type II error, also called a “beta error,”3 occurs if the null hypothesis is mistakenly accepted when it should be not be. In other words, the researcher fails to ascertain group differences when they have occurred. If you make a Type II error, you will conclude, for example, that the intervention did not have a positive outcome on the dependent variable when it actually did. The probability of making a Type II error is not as apparent as that of making a Type I error.3 The likelihood of making a Type II error is based in large part on the power of the statistic to detect group differences.4

Determination and Consequences of Errors

Type I and II errors are mutually exclusive. However, as you decrease the risk of a Type I error, you increase the chances of a Type II error. Furthermore, it is difficult to determine whether either error has been made because actual population parameters are not known by the researcher. It is often considered more serious to make a Type I error because the researcher is claiming a significant relationship or outcome when there is none. Because other researchers or practitioners may act on that finding, the researcher wants to insure against Type I errors. However, failure to recognize a positive effect from an intervention, a Type II error, can also have serious consequences for professional practice. For example, on the basis of an inaccurate finding, a valuable and productive intervention may be discarded.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323261715000203

Why use a one

This is because a two-tailed test uses both the positive and negative tails of the distribution. In other words, it tests for the possibility of positive or negative differences. A one-tailed test is appropriate if you only want to determine if there is a difference between groups in a specific direction.

Why should the choice of a one

A one-tailed test requires a smaller sample size to achieve the same effect with the same power. Generally, to avoid the appearance that a one-tailed test was chosen only because statistical significance was not achieved with a two-tailed test, investigators avoid one-tailed tests.

What is the advantage of a one

The advantage of adopting the one-tailed test is an improvement in power to reject the null hypothesis if the null hypothesis is truly false.

When should you use a one

So when is a one-tailed test appropriate? If you consider the consequences of missing an effect in the untested direction and conclude that they are negligible and in no way irresponsible or unethical, then you can proceed with a one-tailed test.