Rank the evidence generated from the following designs from lowest to highest

The Microbiome in Health and Disease

Yinglin Xia, in Progress in Molecular Biology and Translational Science, 2020

8.1.1.3 Wilcoxon rank-sum test and Wilcoxon signed-rank test

Wilcoxon rank-sum test and Wilcoxon signed-rank test were proposed by Frank Wilcoxon in a single paper.599 Wilcoxon rank-sum test is used to compare two independent samples, while Wilcoxon signed-rank test is used to compare two related samples, matched samples, or to conduct a paired difference test of repeated measurements on a single sample to assess whether their population mean ranks differ. They are nonparametric alternatives to the unpaired and paired Student's t-tests (also known as “t-test for matched pairs” or “t-test for dependent samples”), respectively. The two nonparametric tests do not assume that the samples are normally distributed. The Wilcoxon unpaired two-sample test statistic is a technique equivalent to the statistic proposed by the German Gustav Deuchler in 1914. However, Deuchler incorrectly calculated the variance.670 Wilcoxon formulated a test of significance with a point null hypothesis against its complementary alternative in his 1945 paper. However, in this paper the null hypothesis was only given for the equal sample size case and only a few points were tabulated (though Wilcoxon gave larger tables in a later paper). A thorough analysis of the statistic was provided by Henry Mann and Donald Ransom Whitney in their 1947 paper.598 This is the reason that Wilcoxon rank-sum test is also called Wilcoxon-Mann-Whitney test and Mann-Whitney U test is equivalent to Wilcoxon rank-sum test. Wilcoxon rank-sum test and Wilcoxon signed-rank test were used to compare the median differences in alpha-diversity measures, proportion of core genera, and abundance of specific genera for categorical variables and variables in the case of matched samples, respectively, in the microbiome study by Falony et al.671 Other examples of using Mann-Whitney U test or Wilcoxon rank-sum test in microbiome studies are provided in the reports.272,533,606,672–675 For within-group comparison of alpha diversity generally Wilcoxon signed-rank test can be applied to analyze each pairwise within-group comparison of gut microbiota diversity (gene richness)676 and relative abundance of microbial phyla.672 Other examples of Wilcoxon signed-rank test use in microbiome studies can be found in these papers.415,606,677,678

Mann-Whitney U test and Wilcoxon rank-sum test are also often used to identify association between taxa or OTUs and covariates. However, these approaches conduct the association analysis based on the ranks of observed relative abundances, resulting in information loss and high false-negative rates.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S1877117320300478

Statistical Analysis in Preclinical Biomedical Research

Michael J. Marino, in Research in the Biomedical Sciences, 2018

3.6.5 Multiple Comparisons with Repeated Measures

The case described earlier for the paired t-test or the Wilcoxon signed-rank test is the simplest form of a repeated measures design. The use of multiple comparisons with repeated measures is very common, especially in studies evaluating the time course of an effect. For example, if the earlier described repeated measures study in which a baseline is measured, and then a measurement is taken after compound treatment in the same animal were extended to include measurements taken every 15 min after dosing for 2 h, each animal will be sampled once for baseline, and 8 times for treatment resulting in 8 comparisons. However, these 9 groups are not independent as they are all obtained from the same set of animals after the same treatment and would be expected to share some variance. This type of study should be analyzed using a one way repeated measures ANOVA (Fig. 3.7).

The repeated measures ANOVA is an extension of the ANOVA that accounts for the shared variance in the groups. The assumptions of the one way repeated measures ANOVA are the same as the ANOVA with the addition of an assumption of sphericity. This is an extension of the equal variance assumption that states that the variance of the differences between all combinations of related groups is equal. Most software packages that provide repeated measured ANOVA will perform tests of sphericity, and these tests should be used as the repeated measures ANOVA is very sensitive to violations of this assumption. Choice of post hoc testing is the same as that discussed for multiple comparisons between independent groups by ANOVA (Fig. 3.7).

For situations where the normality assumption is violated in a repeated measures design involving three or more groups, the Friedman test (Friedman, 1937), a rank nonparametric version of the analysis of variance can be used (Fig. 3.8). The Friedman test is an extension of the Wilcoxon signed-rank test and carries all of the assumptions of that test described earlier with the additional assumption of sphericity. The null hypothesis for the Friedman test states that all groups have the same median value and the p-value is interpreted as the probability that differences in the median can be attributed to chance alone. As with the Wilcoxon signed-rank test, Dunn’s test can be used as a post hoc analysis to determine which groups are significantly different (Fig. 3.8).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128047255000033

CCTV Ergonomics: Case Studies and Practical Guidance

John Wood, in Meeting Diversity in Ergonomics, 2007

Results

The data collected was subjected to a battery of statistical tests including repeated-measures ANOVA, Friedman's test, Bonferroni Post-hoc test, Wilcoxon Signed Ranks test and ‘T’ test. The main conclusion was that there was no significant difference between the different displays and thus the replacement of the CRT with flat screens would not result in any significant degradation in signaller performance. The key results are presented in the bar charts (Figs. 13 and 14).

Rank the evidence generated from the following designs from lowest to highest

FIGURE 13. Targets missed vs. display type and image.

Rank the evidence generated from the following designs from lowest to highest

FIGURE 14. Reaction time vs. display type and image.

The targets missed results were tested statistically. Although the bar chart shows a difference between the error scores, this was not statistically significant. The number of targets missed, in general, was not affected by display type.

The performance with the mono TFT is marginally worse than the existing condition, mono CRT, and the full colour condition is marginally better. Statistical tests showed the differences to be significant. However, in real terms a reaction time difference of a few tens of milliseconds would not be a significant differentiator: the nature of the operational task means that operators are required to take the time necessary to satisfy themselves that the crossing is clear – thus the operator is not under time pressure.

In general, participants felt that the colour clips presented on the TFT monitor provided better definition for the images.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080453736500179

Motor Unit Number Estimation (MUNE) and Quantitative EMG

Keisuke Arasaki, ... Ryosuke Ushijima, in Supplements to Clinical Neurophysiology, 2009

2.5 Statistical analysis

We performed statistical analyses using SPSS 11.5J for Windows (SPSS Japan Inc., Tokyo). Since we could not be sure as to the exact nature of the population distribution under study, we adopted non-parametric statistical tests. As a result, we used Wilcoxon's signed ranks test to compare the uMAP area, CMAP area, and the MUNEs on the affected and unaffected sides of each of the patients with cerebral infarction. In order to compare the MUNE ratios of patients with cerebral infarction and hand weakness to the ratios of the patients having normal hand strength we used the Mann–Whitney U test. Spearman's correlation test was used to find a statistically significant correlation between the MUNE ratio and the JSS-UM.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S1567424X08000196

Quality Evaluation for Compressed Medical Images: Diagnostic Accuracy

Pamela Cosman, ... Richard Olshen, in Handbook of Medical Imaging, 2000

Results Using the Personal Gold Standard

As was discussed previously, the personal gold standard was set by taking a radiologist's recorded vessel size on the uncompressed image to be the correct measurement for judging her or his performance on the compressed images. Using a personal gold standard in general accounts for a measurement bias attributed to an individual radiologist, thereby providing a more consistent result among the measurements of each judge at the different compression levels. The personal gold standard thus eliminates the interobserver variability present with the independent gold standard. However, it does not allow us to compare performance at compressed bit rates to performance at the original bit rates, since the standard is determined from the original bit rates, thereby giving the original images zero error. As before, we first consider visual trends and then quantify differences between levels by statistical tests.

Figure 7 shows average pme vs mean bit rate for the five compressed levels for each judge separately and for the judges pooled, whereas Fig. 8 is a display of the actual pme vs. actual achieved bit rate for all the data points. The data for the judges pooled are the measurements from all judges, images, levels, and vessels, with each judge's measurements compared to her or his personal gold standard. In each case, quadratic splines with a single knot at 1.0 bpp were fit to the data. Figs 9 and 10 are the corresponding figures for the apme. As expected, with the personal gold standard the pme and the apme are less than those obtained with the independent gold standard. The graphs indicate that whereas both judges 2 and 3 overmeasured at all bit rates with respect to the independent gold standard, only judge 3 overmeasured at the compressed bit rates with respect to the personal gold standard.

Rank the evidence generated from the following designs from lowest to highest

FIGURE 7. Mean pme vs mean bit rate using the personal gold standard. The dotted, dashed, and dash-dot curves are quadratic splines fit to the data points for judges 1, 2, and 3, respectively. The solid curve is a quadratic spline fit to the data points for all judges pooled.

Rank the evidence generated from the following designs from lowest to highest

FIGURE 8. Pme vs actual bit rate using the personal gold standard. The x's indicate data points for all images, pooled across judges and compression levels. The solid curve is a quadratic spline fit to the data.

Rank the evidence generated from the following designs from lowest to highest

FIGURE 9. Mean apme vs mean bit rate using the personal gold standard. The dotted, dashed, and dash-dot curves are quadratic splines fit to the data points for judges 1, 2, and 3, respectively. The solid curve is a quadratic spline fit to the data points for all judges pooled.

Rank the evidence generated from the following designs from lowest to highest

FIGURE 10. Apme vs actual bit rate using the personal gold standard. The x's indicate data points for all images, pooled across judges and compresssion levels. The solid curve is a quadratic spline fit to the data.

The t-test results indicate that levels 1 (0.36 bpp) and 4 (1.14 bpp) have significantly different pme associated with them than does the personal gold standard. The results of the Wilcoxon signed rank test on percent measurement error using the personal gold standard are similar to those obtained with the independent gold standard. In particular, only level 1 at 0.36 bpp differed significantly from the originals. Furthermore, levels 1, 3, and 4 were significantly different from level 5.

Since the t-test indicates that some results are marginally significant when the Wilcoxon signed rank test indicates the results are not significant, a Bonferroni simultaneous test (union bound) was constructed. This technique uses the significance level of two different tests to obtain a significance level that is simultaneously applicable for both. For example, in order to obtain a simultaneous significance level of α% with two tests, we could have the significance of each test be at (α/2)%. With the simultaneous test, the pme at level 4 (1.14 bpp) is not significantly different from the uncompressed level. As such, the simultaneous test indicates that only level 1 (0.36 bpp) has significantly different pme from the uncompressed level. This agrees with the corresponding result using the independent gold standard. Thus, pme at compression levels down to 0.55 bpp does not seem to differ significantly from the pme at the 9.0 bpp original.

In summary, with both the independent and personal gold standards, the t-test and the Wilcoxon signed rank test indicate that pme at compression levels down to 0.55 bpp did not differ significantly from the pme at the 9.0 bpp original. This was shown to be true for the independent gold standard by a direct application of the tests. For the personal gold standard, this was resolved by using the Bonferroni test for simultaneous validity of multiple analyses. The status of measurement accuracy at 0.36 bpp remains unclear, with the t-test concluding no difference and the Wilcoxon indicating significant difference in pme from the original with the independent gold standard, and both tests indicating significant difference in pme from the original with the personal gold standard. Since the model for the t-test is fitted only fairly to moderately well by the data, we lean towards the more conservative conclusion that lossy compression by our vector quantization compression method is not a cause of significant measurement error at bit rates ranging from 9.0 bpp down to 0.55 bpp, but it does introduce error at 0.36 bpp.

A radiologist's subjective perception of quality changes more rapidly and drastically with decreasing bit rate than does the actual measurement error. Radiologists evidently believe that the usefulness of images for measurement tasks degrades rapidly with decreasing bit rate. However, their actual measurement performance on the images was shown by both the t-test and Wilcoxon signed rank test (or the Bonferroni simultaneous test to resolve differences between the two) to remain consistently high down to 0.55 bpp. Thus, the radiologist's opinion of an image's diagnostic utility seems not to coincide with its utility for the clinical purpose for which the image was taken. The radiologist's subjective opinion of an image's usefulness for diagnosis should not be used as the sole predictor of actual usefulness.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780120777907500588

Tests on Ranked Data

R.H. Riffenburgh, in Statistics in Medicine (Third Edition), 2012

Example Posed: Bias in Early Sampling of Prostate Biopsy Patients

We ask if the PSA levels of the first 20 patients differ significantly from 8.96 (the average of the remaining 281). When the sample is large, it exceeds the tabulated probabilities, but may be approximated by the normal distribution, with μ and σ calculated by simple formulas.

Method: The Normal Approximation to the Signed-Rank Test

The normal approximation to the Wilcoxon signed-rank test tests the hypothesis that the distribution of differences has a median of 0. (The median and mean are the same in the normal distribution.) It may test (1) a set of observations deviating from a hypothesized common value or (2) pairs of observations on the same individuals, such as before-and-after data. The p-value will not be identical with the exact method, but only rarely will this difference change the outcome decision. The steps in performing the test by hand are as follows:

1.

Calculate the differences between pairs or from a hypothesized central value

2.

Rank the magnitudes (i.e. the differences without signs)

3.

Reattach the signs to the ranks

4.

Add up the positive and negative ranks

5.

Denote by T the unsigned value of the smaller; n is sample size (number of ranks)

6.

Calculate μ= n(n + 1)/4, σ2= (2n + 1)μ/6, and then z= (T − μ)/σ

7.

Obtain the p-value (as if it were α) from Table I.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123848642000111

Description and Analysis of Data, and Critical Appraisal of the Literature

Brian W. McCrindle, in Paediatric Cardiology (Third Edition), 2010

Matched Pairs and Measures of Agreement

Usually measurements for a study are independent of one another. For example, we may wish to compare measurements made in two groups of subjects where we know that the groups are composed of separate individuals that bear no relationship to one another. In order to reduce variation or to control for a given factor, we may create pairs of individuals matched for a common characteristic. Alternatively, the two groups may not be independent but have an individual level relationship, such as a group of subjects and a group of their siblings. The subjects and their sibling represent matched pairs. When this is the case, we must use statistical testing that takes into account the fact that the two groups are not independent. If the independent variable is categorical, we would use a McNemar chi-square test. If the independent variable is ordinal, we would use an appropriate nonparametric type of test, such as Wilcoxon signed rank test. If the independent variable is continuous, we would use a paired t test. Each of these tests would relate to a different and specific type of probability distribution.

Sometimes, we will make repeat measurements in the same subject but using different methodology. By their nature, it is not surprising that the measurements will correlate, since they are measuring the same thing. What we are actually interested in is the degree to which the measurements agree, or agree with a criterion standard. Agreement between two binary variables can be expressed in one of two ways, either through the raw agreement or through the chance corrected agreement. The raw agreement is merely the number of times two measures agree divided by the total number of measures. By chance, two binary variables will agree approximately half of the time. Based on this, raw agreement is of limited interest. Cohen’s kappa or chance-corrected agreement is the degree of agreement between variables beyond that expected by chance alone. When continuous variables are of interest, agreement is assessed and depicted using Bland-Altman plots. A Bland-Altman plot plots the difference between two measurements in a pair on the y-axis versus the mean of those two measurements on the x-axis. If the agreement were perfect, all of the points would be at a difference of zero regardless of the value of the measurement. The plot can show the degree and limits of agreement, but also any patterns. Systematic bias can be noted, as well as changes in the magnitude of agreement as the average values get larger or smaller. A paired t test can be used to determine if any systematic differences exist between pairs of measures.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780702030642000242

Biostatistical Basis of Inference in Heart Failure Study

Longjian Liu MD, PHD, MSC (LSHTM), FAHA, in Heart Failure: Epidemiology and Research Methods, 2018

Nonparametric Tests

Nonparametric tests are used in detecting population differences when certain assumptions are not satisfied.

Wilcoxon signed rank test

Wilcoxon signed rank test, a rank test used in nonparametric statistics, can be considered as a backup for t-test where the independent variable is binary but the dependent variable is not normally distributed. It is used to compare the locations of two populations, to determine if one population is shifted with respect to another. The method employed is a sum of ranks comparison.

Example

To compare the difference in mean serum glucose between patients with heart failure and those without heart failure, a hypothetical study and the related measures are given below.

In patients with heart failure group, n = 25 and mean (SD) glucose = 129.2 (91.59) mg/dL.

In the comparison group (i.e., those without heart failure), n = 25 and mean (SD) glucose = 107.76 (44.41) mg dL Fig. 4.12 depicts the distribution of serum glucose levels by the two groups. It is evident from the figure that the serum glucose levels are not normally distributed by heart failure status in the study sample. Using t-test to examine their mean differences is not appropriate. In the case, we can either try to translate the data (serum glucose level) into a normally distributed (or an approximately normal) dataset, then using t-test, or we can apply Wilcoxon signed rank test to examine their mean difference.

SAS computing

SAS Proc step. The following statements request a Wilcoxon test of the null hypothesis that there is no difference in mean serum glucose (SGP) levels between the two groups. HF is the CLASS variable (represents those with or without heart failure), and SGP is the analysis variable. The WILCOXON option requests an analysis of Wilcoxon scores because the sample size is small and the large-sample normal approximation is not adequate. These statements produce the results shown below.

Output

The NPAR1WAY Procedure

Wilcoxon Scores (Rank Sums) for Variable SGP Classified by Variable HF
HFNSum of ScoresExpected Under H0Std Dev Under H0Mean Score
YES 25 651.0 637.50 51.497969 26.040
NO 25 624.0 637.50 51.497969 24.960
Average scores were used for ties.
Wilcoxon Two-Sample Test
Statistic 651.0000
Normal approximation
z 0.2524
One-sided P > z 0.4004
Two-sided P > |z| 0.8007
t approximation
One-sided P > z 0.4009
Two-sided P > |z| 0.8018
z includes a continuity correction of 0.5.
Kruskal-Wallis Test
Chi-square 0.0687
DF 1
P > Chi-square 0.7932

The table above displays the results of the Wilcoxon two-sample test. The Wilcoxon statistic equals 651.0. Because this value is greater than 637.50, the expected value under the null hypothesis, PROC NPAR1WAY displays the right-sided P-values. The t approximation for the Wilcoxon scores (rank sums) two-sample test yields a two-sided P-value of 0.8018 (>.05). The result accepts the hull hypothesis that the two mean Wilcoxon scores are not statistically significant.

In practice, let us also conduct a log-transformed analysis for the purpose of using t-test. Fig. 4.13 depicts the distributions of log-transformed values of serum glucose concentration. It shows an approximately normal distribution for those with or without heart failure. Therefore, we may use t-test for the transformed data.

The t-test procedure

Variable: Logsgp

HFNMeanStd DevStd ErrMinimumMaximum
NO 25 4.6286 0.2932 0.0586 4.2905 5.6525
YES 25 4.7219 0.4797 0.0959 3.9318 6.1717
Diff (1–2) −0.0932 0.3976 0.1125
HFMethodMean95% CL MeanStd Dev95% CL Std Dev
NO 4.6286 4.5076 4.7497 0.2932 0.2290 0.4079
YES 4.7219 4.5238 4.9199 0.4797 0.3746 0.6674
Diff (1–2) Pooled −0.0932 −0.3193 0.1329 0.3976 0.3315 0.4967
Diff (1–2) Satterthwaite −0.0932 −0.3205 0.1341
MethodVariancesDFt ValueP > |t|
Pooled Equal 48 −0.83 .4112
Satterthwaite Unequal 39.735 −0.83 .4121
Equality of Variances
MethodNum DFDen DFF ValueP > F
Folded F 24 24 2.68 0.0192

The results from the tables above indicate the calculated t statistic = −0.83 and get a P-value of 0.4121 (for variances = unequal, because of the equality of variance test, P = .0192). The P-value of 0.4121 is >.05; therefore, we accept the null hypothesis that these two means of log values of serum glucose are not statistically significant between those with or without heart failure. The conclusion is the same as the Wilcoxon scores (rank sums) two-sample test.

The Kruskal-Wallis test mean difference among three or more than groups

The Kruskal-Wallis test, a median test, can be considered as a backup method for ANOVA where the independent variable is categorical (three or more than three groups) but the dependent variable is not normally distributed.

Example

This example is to test means difference in serum glucose levels among nonsmokers, former smokers, and current smokers. The table below shows that among nonsmokers (n = 18), mean (SD) serum glucose = 111.32 (53.44) mg/dL, in former smokers (n = 7), mean (SD) glucose = 96.29 (20.84) mg/dL, and in current smokers (n = 15), mean glucose (SD) = 142.20 (107.36) mg/dL. In biostatistics test, the null hypothesis is that these three-sample means are from the sample population, μ1=μ2=μ3.

The MEANS Procedure

Analysis Variable: SGP Serum Glucose (mg/dL)
SMOKINGN ObsNMeanStd DevMinimumMaximum
NO 28 28 111.32 53.44 73.00 334.00
FORMER 7 7 96.29 20.84 73.00 139.00
CURRENT 15 15 142.20 107.36 51.00 479.00

SAS Proc step (SMK = smoking status)

The NPAR1WAY Procedure

Wilcoxon Scores (Rank Sums) for Variable SGP Classified by Variable smk
smkNSum of ScoresExpected Under H0Std Dev Under H0Mean Score
NO 28 673.50 714.00 51.125839 24.053571
CURRENT 15 454.00 382.50 47.198668 30.266667
FORMER 7 147.50 178.50 35.738255 21.071429
Average scores were used for ties.
Kruskal-Wallis Test
Chi-square 2.5296
DF 2
P > Chi-square .2823

In a nonparametric test for three or more than three means, we see the result of “Kruskal-Wallis Test.” In the example, the P-value is .2823 (>.05), so we accept the null hypothesis, and there is no sufficient evidence to reject the hypothesis that the populations of serum glucose levels from the three exposures (nonsmokers, former smokers, and current smokers) have equal medians.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323485586000049

Statistical Techniques in Pharmaceutical Product Development

Ankur Barua, ... Rakesh K. Tekade, in Dosage Form Design Parameters, 2018

10.3 Statistical Procedures

The experimental study designs involve an intervention or manipulation by the investigator which is expected to influence the outcome and its impact is measured in terms of before–after effects. The experimental study designs are categorized into two types: randomized and nonrandomized trials (Bradshaw, 2016). Randomization is the random method of allocation of intervention group and control group (after they are selected and they provide informed written consent to participate in the study) by an independent biostatistician not related to the designated research project. This is done to minimize the selection bias as well as allocation bias. The randomization process is conducted by using a lottery method or random number table or electronically generated random numbers (Sagarin et al., 2014). The randomized controlled trial (RCT) always contain a control group which is adequately matched with the intervention group through at least one common parameter which can be either a sociodemographic correlate or an exposure or a disease or any other outcome. The level of evidence generated by a RCT is of the highest quality. Hence, evidence generated from best quality RCTs are included in systematic reviews and meta-analysis for evidence-based practice. On the other hand, a nonrandomized trial (NRT) does not contain any control group. Although the nonrandomized controlled trial (NRCT) always contains a control group, it is not appropriately matched with the intervention group. The levels of evidence generated by NRTs or NRCTs are of poor quality. Hence, evidence generated from them are never included in systematic reviews and meta-analysis for evidence-based practice (Borgnakke et al., 2014). The major experimental study designs are presented in Table 10.1.

Table 10.1. Experimental Study Designs

Experimental study designs (randomized or nonrandomized trials)Salient featuresStudy unit
i. a.

Laboratory experiments (Phase I trial)

Comparison/control group is chosen after matching or randomization, following informed consent, wherever feasible. 1st group—intervention
b.

Animal experiments (Phase I trial)

2nd group—control
– Done on animals or experiments under lab conditions
Blinding and cross-over techniques are more desired to be applied. Studies could be multicentric when >2 centers are involved.
ii. Clinical trials (Phase II trial) 1st group—intervention
a.

Preventive trials—(Immunoprophylaxis/chemoprophylaxis)

2nd group—control
Application of statistical tests of significance is mandatory. These follow-up studies are conducted for demonstration of cause and effect relationships.
d.

Therapeutic trials (drug)

– Done on diseased individuals (patients)
iii. Field trials (Phase III trial) 1st group—intervention
a.

Preventive trials—(immunoprophylaxis/chemoprophylaxis)

2nd group—control
c.

Therapeutic trials (drug)

– Done on apparently healthy individuals
iv. Community/post-marketing trials (Phase IV trial) 1st group—intervention
a.

Preventive trials—(immunoprophylaxis/chemoprophylaxis)

2nd group—control
b.

Therapeutic trials (drug)

– Done on community as a unit
v. Survival analysis (cohort/experimental) 1st group—cohort/intervention
2nd group—control
– Done on either diseased or apparently healthy individuals

The experimental studies are conducted in four phases based on the size of the study population or sample size and quality of evidence that it is expected to be generated (Busk and Marascuilo, 2015). The Phase I trial is the beginning of the pharmaceutical product development process. It includes laboratory experiments and animal experiments on a very small amount of sample to study the pharmacokinetics, pharmacodynamics and toxicity levels.

The Phase II trial is also known as a clinical trial where a relatively smaller number of individuals (biostatistically approved) having a common exposure or disease are recruited to estimate the dose–response, efficacy, side-effects, and adverse reactions. This could be either pa reventive trial (vaccine for post-exposure prophylaxis or nutritional supplement or drug for chemoprophylaxis) or a therapeutic trial (drug for treatment). The vaccines for post-exposure prophylaxis, nutritional supplements, drugs for chemoprophylaxis, and drugs for treatment of any disease are never released in the market before they pass this phase (Wright, 2017).

The Phase III trial is also known as a field trial where a significant number of individuals having a common exposure are recruited from the field practice area to study the efficacy, side-effects, and adverse reactions. These are usually preventive trials involving a new vaccine for pre-exposure prophylaxis or nutritional supplement or drug for chemoprophylaxis. The new vaccines for pre-exposure prophylaxis, nutritional supplements, and drugs for chemoprophylaxis are never released in the market before they pass this phase. Some drugs for treatment of any disease might go through a Phase III trial before being released to the market. However, this is not mandatory (DeAngelis, 2017; Wright, 2017).

The Phase IV trial is also known as a community trial where a large community having a common exposure is involved to study the efficacy, side-effects, and adverse reactions. These trials are conducted on various brands of the same vaccine or nutritional supplement or drug which are marketed for a significant period and the respective manufacturing companies want to know which product is the best in terms of highest efficacy, least side-effects, and least adverse reactions for addressing the common, specific health issue (Wright, 2017). This is the reason why this phase is also known as post-marketing trial. As the Phase IV trials involve a large number of individuals who need to be followed up for a considerable period and monitored by many investigators, paramedical staff, and clinicians, this is the costliest study design among all experimental studies. Hence, the conduction of Phase IV trial is not mandatory under any international guidelines or regulations. It is left to the decree of renowned companies to conduct this trial depending on their financial capacity (Mahan, 2014).

Biostatistics serves a dominant role in quantification or inferential assessments in experimental study designs. Biostatistical procedures are usually more incorporated in areas where international regulators have issued guidelines like preclinical testing of cardiac liability, stability testing, and carcinogenicity. The Good Manufacturing Practice (GMP) has become an important regulatory authority in recent years. Hence, statistical inputs are regularly sought for high-risk screening, chemical and formulation development, as well as drug delivery and assay (Arias et al., 2017).

When repeated samples are drawn from the general population, the database follows a symmetric bell-shaped curve which is considered as the reference curve in biostatistics and is used for all comparison purposes. This is also known as the normal distribution or Gaussian curve (Barua et al., 2015a) Fig. 10.1.

Rank the evidence generated from the following designs from lowest to highest

Figure 10.1. Properties of the normal distribution or Gaussian curve.

Here, the data is symmetrically distributed on both sides with total area as 1; all central tendencies like mean, median, and mode coincide at the center; mean is zero and standard deviation (SD) is 1. Here, approximately 68% of the database is covered under ±1SD, 95% of the database is covered under ±2SD, and 99.7% of the database is covered under ±3SD. This is the ideal distribution of data from the general population and serves as a reference base. Since, ±1SD covers only 68% of the database where normal findings will appear as abnormal and ±3SD covers nearly whole of the database (99.7%) where abnormal findings will appear as normal, only ±2SD covering 95% of the database is considered as the reference point for normal distribution while following the middle path. Hence, the abnormal distribution of 5% of the remaining database (considered as chance) is used to determine the level of significance or probability or p-value which is set at <0.05. If the probability of difference between the two groups merely happening due to chance is less than 0.05 or 5%, this suggests that chance has very less weightage and there is a real difference. This is the interpretation of the p-value in biostatistics. The confidence interval (CI) is a measure of reliability of the study findings. It is hypothesized that if the same study is repeated 100 times by using the same methodology and conducted under similar environmental conditions, then 95 times the findings will fall within the estimated range. The CI is calculated for all statistical parameters like incidence rates, prevalence rates, and strengths of associations such as odds ratio (OR), relative risk (RR), and hazard ratio (HR) (Higgins, 2017).

In biostatistical procedures, parametric tests are applied when the database closely follows a normal distribution curve, while nonparametric tests are applied when the database becomes skewed and gets significantly deviated from the normal distribution curve. The biostatistical tools used to study the nature of distribution of a database are histogram, Q–Q plot, and Shapiro Wilk test. The basic algorithms of statistical tests of significance which are applicable for analytical and experimental studies are given in Table 10.2 for parametric tests and in Table 10.3 for nonparametric tests, respectively (Barua et al., 2015b).

Table 10.2. The Basic Algorithms of Parametric Tests of Significance Which are Applicable for Analytical and Experimental Studies

Parametric tests (data following normal distribution curve)
Type of data for analysisCategorical variables (comparison of proportions)Continuous variables (comparison of mean and SEM)
Sample size ≤30 Fischer’s exact test Independent t-test or students t-test
Sample size &gt;30 Chi-square test Z-test or Independent t-test
Paired or before–after effect for sample size ≤30 Mc Nemar’s test Paired t-test
Paired or before–after effect for sample size &gt;30 Mc Nemar’s test Paired t-test
Test for more than 2 subgroups of independent samples Chi-square test One-way ANOVA
Test for repeated measures in more than 2 subgroups Chi-square test One way repeated measures ANOVA
Linear trend Chi-square for linear trend
Correlation Correlation Pearson’s correlation coefficient
Regression Multiple logistic regression Multiple logistic regression
Survival analysis Kaplan–Meier Cox regression

Table 10.3. The Basic Algorithms of Nonparametric Tests of Significance Which are Applicable for Analytical and Experimental Studies

Nonparametric tests (data not following normal distribution curve)
Type of data for analysisCategorical variables (comparison of proportions)Continuous variables (comparison of median and IQR)
Sample size ≤30 Nonparametric Mann–Whitney U-test
Chi-square test
Sample size &gt;30 Nonparametric Kolmogorov–Smirnov Z-test
Chi-square test
Paired or before–after effect Nonparametric Wilcoxon test
Mc Nemar’s test
Test for more than 2 subgroups of independent samples Nonparametric Kruskal–Wallis One-way ANOVA
Chi-square test
Test for repeated measures in more than 2 subgroups Kendall’s W test Friedmans’ ANOVA
Correlation Correlation Spearman’s Correlation Coefficient
Regression Cox Regression Cox Regression

The biostatistical model of univariate analysis assesses one independent variable at a time against the outcome or dependent variable. It provides a list of probable risk factors associated with the outcome (morbidity or mortality). Here, the strength of association is expressed in terms of unadjusted odds ratio or unadjusted relative risk. The multivariate analysis uses a complex biostatistical model which allows multiple independent variables to interact with each other at a time to produce the outcome (Hieke et al., 2016). It is used to study the independent effect of each variable over the outcome. It can eliminate the confounders from univariate analysis and identify the predictors of the outcome or dependable variable. The multiple logistic regression and Cox’s proportional hazards model are commonly used biostatistical instruments for multivariate analysis in analytical and experimental studies. Here, the strength of association is expressed in terms of adjusted odds ratio or adjusted relative risk or hazard ratio (Binder and Blettner, 2015).

Survival analysis is conducted in prospective studies which have a significant follow-up period for a health-related event to occur. The primary endpoint in many prospective cohort or experimental studies is time until an event occurs (e.g., death, remission). Survival function describes the proportion of individuals surviving until a given point of time or living beyond. Data are subject to censoring when a study ends before the event occurs (Lacny et al., 2017). The Kaplan–Meier estimate of survival function is conducted by using the log-rank test to observe whether the difference in survival rates between the intervention and control groups is statistically significant (Collett, 2015) (Fig. 10.2).

Rank the evidence generated from the following designs from lowest to highest

Figure 10.2. The Kaplan–Meier survival analysis.

The hazard function of survival time provides the conditional failure rate. It is the risk of failure per unit time during the aging process. Hence, the hazard function is also known as the instantaneous failure rate, force of mortality, and age-specific failure rate. A proportional hazards model takes into consideration the fact that different individuals have hazard functions which are proportional to one another (Ediebah et al., 2014). The Cox’s proportional hazards model (CPHM) is a multivariate analysis technique for investigating the relationship between survival time and independent variables (Fig. 10.3). It compares two or more intervention groups by adjusting their risk factors on survival times like the multiple regression. It follows a log-linear model to assesses the hazard ratio by modeling the relative risk of an event as the function of time and covariates (Ediebah et al., 2014).

Rank the evidence generated from the following designs from lowest to highest

Figure 10.3. The Cox’s proportional hazards model.

Noninferiority trials are sometimes conducted to determine whether the alternative treatments are good enough to be used side by side with the main regimen. The comparison of superiority, equivalence, and noninferiority hypotheses are based on 2% margin of difference in event rates. Some of the new pharmaceutical products which are developed might prove to be potentially as effective as existing standard treatments. They eventually become more preferred as compared to the standard ones if they cost less or have fewer side-effects or have fewer adverse reactions or are found to be easier for administration (Scott, 2009).

Every type of study design adopted in health research or pharmaceutical product development generates some evidence which is considered as optimum according to feasibility for that particular time frame for the designated study population located in a specified place. The complexity of study designs and biostatistical applications improves from observational studies to analytical ones which are followed by experimental studies and finally result in systematic review and meta-analysis. The hierarchy of evidence in health research is depicted in Fig. 10.4.

Rank the evidence generated from the following designs from lowest to highest

Figure 10.4. Hierarchy of evidence in health research.

The evidence in healthcare and pharmaceutical product development initiates from observational studies. Every health research on a new disease starts from a single case report. When several similar case reports are available, then a case series analysis is conducted (Guo et al., 2016). The cross-sectional studies or prevalence studies and ecological studies are conducted next to identify the sociodemographic and environmental correlates associated with a specific health outcome. All the correlates which are found to be associated with morbidity or mortality in observational studies are only the probable ones and not definitive. These factors are further evaluated in analytical studies like case-control, cohort, and nested case-control. The highest level of evidence among these are from prospective cohort and nested case-control studies. The experimental studies are the strongest study designs, but they are not feasible in every circumstance. Among these, the most powerful study design for producing the best quality of evidence is RCT. However, for evidence-based practice, the highest level of evidence is obtained from the meta-analysis followed by the systematic reviews on RCTs (Walsh et al., 2014).

Data Warehousing is the random storage of data from various study designs conducted across the world in a separate database. The systematic arrangement of this historical data from warehouse followed by extraction, synthesis, analysis, and interpretation to generate new information and evidence is called Data Mining. This is the most cost-effective way of generating good quality of evidence in pharmaceutical product development and health research (Truong et al., 2017). However, the process of data mining involves very complex and laborious quality assessment and biostatistical methods. Here, the pooled data from various analytical and experimental studies, conducted in different parts of the world, are synthesized and analyzed to evaluate the evidence of whether a specific pharmaceutical product is beneficial or harmful for the consumers. The systematic review and meta-analysis are a part of this multifaceted data mining process, which are frequently used in evidence-based practice (Athanassoulis et al., 2015). The types of reviews on pooled data in evidence-based practice are presented in Fig. 10.5.

Rank the evidence generated from the following designs from lowest to highest

Figure 10.5. The types of reviews on pooled data in evidence-based practice.

The meta-analysis of individual patient data through pooled analysis is the best quality of evidence in evidence-based practice of health research. The next highest level is meta-analysis of experimental studies followed by meta-analysis of analytical studies (Schmidt and Hunter, 2014). The systematic reviews provide the following level of evidence. Every RCT included in the meta-analysis or systematic review must satisfy the CONSORT guidelines with minimum amount of attrition in a preferred reporting items for systematic reviews and meta-analysis (PRISMA) chart. Only the meta-analysis and systematic review provide optimum quality evidence for pharmaceutical product development and clinical practice. The ordinary reviews are considered as overviews and they contain a considerable amount of biased information. Hence, they are never used in decision-making in health research (Moher et al., 2015).

Systematic review is a process of polling of data from various multicentric trials by using predesignated criteria followed by synthesis, analysis, and interpretation of results for rational decision-making. Due to the best quality of study design, the RCTs are preferred for inclusion in a systematic review to explore the effectiveness of an intervention (Jones et al., 2016). The PRISMA chart depicts the number of drop-outs at every stage of the study which are enrollment, allocation, follow-up, and analysis (Fig. 10.6).

Rank the evidence generated from the following designs from lowest to highest

Figure 10.6. The PRISMA chart.

The consolidated standards of reporting trials (CONSORT) guidelines are used to evaluate the amount of bias or confounders in each RCT (Schulz et al., 2010). The biases in experimental studies are inappropriate techniques adopted during the pharmaceutical product development which lead to erroneous results, thereby reducing the quality of evidence. The major biases which are identified during this criterion-based procedure are problems with random sequence generation (selection bias), allocation concealment (selection bias), blinding (performance bias and detection bias), incomplete outcome data (attrition bias), and selective reporting (reporting bias) (Schmidt and Hunter, 2014). The major biases in experimental studies and the strategies to minimize them are depicted in Fig. 10.7.

Rank the evidence generated from the following designs from lowest to highest

Figure 10.7. The major biases in experimental studies.

After applying the CONSORT criteria, all the available RCTs related to a specific intervention are arranged in descending order of their quality of evidence and the study with minimum bias tops the list. Only the studies with moderate to high-quality evidence are included in the systematic review (Calvert et al., 2013).

A meta-analysis is often conducted on the selected RCTs which are short-listed during the systematic review to generate evidence for decision-making in clinical practice. Here, complex biostatistical procedures are applied on pooled data available from the selected RCTs (Ford et al., 2014). The strengths of association in the form of pooled odds ratio, pooled relative risk, and pooled hazard ratio are calculated and depicted on forest plots to evaluate the indicators for decision-making in evidence-based clinical practice. The position of the diamond which represents the pooled strength of association in a forest plot provides overall evidence of whether the pharmaceutical product is beneficial or harmful (Haber et al., 2015). This ultimately indicates whether a specific intervention is really working for a designated outcome or not, as shown in Fig. 10.8.

Rank the evidence generated from the following designs from lowest to highest

Figure 10.8. The forest plot for pooled strength of association.

The Cochrane Collaboration, named after the British researcher Archie Cochrane, is famous for its huge database of systematic reviews and meta-analysis. The five main criteria which the Cochrane Collaboration uses for grading the quality of evidence for each experimental study are limitations of study designs, inconsistency between results, indirectness of evidence, imprecision of effect measurement, and funnel plot summary for publication bias (Anderson et al., 2016). The funnel plot is the graphical representation of the reporting of positive and negative findings of a specific type of RCT which is conducted in different settings. This is used to assess publication bias which is suspected when the funnel plot shows mostly positive findings like proving an intervention to be beneficial in most of the occasions (Finnerup et al., 2015) (Fig. 10.9).

Rank the evidence generated from the following designs from lowest to highest

Figure 10.9. The funnel plot for assessing publication bias.

The Cochrane Collaboration also conducts sensitivity analysis, random effects model, heterogeneity tests, and subgroup analysis to assess whether the results are consistent and reliable. Thus, every Cochrane systematic review and meta-analysis maintain the highest level of research and provide the best quality of evidence for decision-making in public health policy management for the World Health Organization (Moher et al., 2015).

The following section presents an outline of the inferential statistics used in Experimental Studies:

a.

For assessing before–after effect for categorical variables with one post-test:

i.

Parametric (normal distribution)—Mc Nemar’s test and unadjusted relative risk (RR).

ii.

Nonparametric (skewed distribution)—nonparametric Mc Nemar’s test and unadjusted relative risk (RR).

b.

For assessing before–after effect for categorical variables with more than one post-test:

i.

Parametric (normal distribution)—chi-square test, Mc Nemar’s test, and unadjusted relative risk (RR).

ii.

Nonparametric (skewed distribution)—Kendall’s W test, nonparametric Mc Nemar’s test, and unadjusted relative risk (RR).

c.

For assessing before–after effect for continuous variables with one post-test:

i.

Parametric (normal distribution)—paired t-test and unadjusted Pearson’s correlation.

ii.

Nonparametric (skewed distribution)—Wilcoxon Signed rank test and unadjusted Spearman’s correlation.

d.

For assessing before–after effect for continuous variables with more than one post-test:

i.

Parametric (normal distribution)—Pearson’s ANOVA, paired t-test, and unadjusted Pearson’s correlation.

ii.

Nonparametric (skewed distribution)—Kruskal Wallis ANOVA, Wilcoxon signed rank test, and unadjusted Spearman’s correlation.

e.

For baseline risk and treatment effect or absolute risk difference—absolute risk reduction (ARR).

f.

For reduction in the rate of the outcome in the treatment group relative to that in the control group—relative risk reduction (RRR).

g.

For numbers of patients to be treated with experimental therapy to prevent one bad outcome—number needed to treat analysis (NNT)

h.

For treatment success—Kaplan–Meier survival analysis.

i.

For treatment failure—Cox proportional hazards model (regression analysis) with estimation of hazard ratio (hr) and adjusted relative risk.

j.

For missing data and lost to follow-up cases—intention to treat analysis (ITT).

k.

For trial of equal potential/efficacy for intervention as well as gold standard—nonsuperiority and equivalence analysis.

l.

For evaluation of clinical significance—reliability change index (RCI) by Jacobson–Traux.

10.3.1 Nonclinical Trials

Nonclinical biostatistics involves the exploration of new opportunities for intersectoral collaboration and sharing of new ideas for the development of advanced statistical instruments. As the project work initiates in the research laboratories, the nonclinical biostatisticians become the first persons to measure, quantify, and validate all the instruments. They often venture into unknown areas of new technology which often have immense clinical implications (Yang, 2016).

Unfortunately, a small number of nonclinical biostatisticians are found in the pharmaceutical industry. Hence, it is beneficial for the nonclinical statisticians to form intersectoral as well as interinstitutional networks to promote mutual learning. The Chemistry, Manufacturing, and Control (CMC) Statistical Expert Team was formed in the United States during 1960 to focus on the control issues related to the manufacturing of pharmaceutical products. Another team, formed during 2003, is known as the Pharmacogenomics Statistical Expert Team. Both these teams comprise of statisticians from major pharmaceutical institutions and are sanctioned by Pharmaceutical Research and Manufacturers of America (PhRMA).

10.3.2 Clinical Trials

Clinical trials are conducted in various stages to explore the effect of new pharmaceutical products on human biological system. The investigation initiates with pharmacokinetic and pharmacodynamic researches which are followed by proof-of-concept and dose-ranging explorations (Ginn et al., 2013). At this early stage, studies related to the effect of drug on QT/QTc prolongation and drug interactions are also conducted. The objectives of these trials are to identify adverse drug reactions and efficacy assessment. If the early trials are satisfactory, efficacy, and safety range of the pharmaceutical product are meticulously investigated in a heterogeneous population throughout the confirmatory phase (Pocock, 2013).

The early study designs are kept more flexible for proof-of-concept and dose-ranging research. Their main objective is to learn the various study populations, adaptations in terms of dose allocation, and early termination as efficiently as possible. Meticulous observations on a real-time basis through extensive biostatistical modeling can lead an investigator to multiple decision-points at a fast pace. The main purpose of this development in the clinical phase is to generate information to facilitate decision-making. The developers can adopt innovative approaches if they can defend the reasoning for further development in an efficient manner (Simon and Simon, 2013).

In contrast to the early phases of clinical trials, the statistical approaches for the confirmatory phase are prudently planned and specified before the initiation of the study. They are also meticulously followed to provide credibility to the results. A pharmaceutical sponsor needs to agree on the methodology adopted for prior study designs, study population, success criteria, and decide on setting up of primary endpoints, criteria for handling missing data, and multiple comparisons along with the consultation of a biostatistician (Rammsayer and Troche, 2014). The criteria for adaptation also need to be clearly specified in advance. However, the sponsor’s access to interim results should be tightly controlled during the interim analysis to avoid manipulations. The risk-benefit and cost-effectiveness of new pharmaceutical products are assessed during the confirmatory phase. The sponsor gets a better opportunity to study adverse reactions with the increased number of individuals studied at this stage. This phase also overlaps with the life cycle management phase where drug differentiation and new indications are explored (Cuffe et al., 2014). The knowledge about a new molecular or biologic entity is explored to support a specific objective at the confirmatory phase which becomes the foundation for recommendations to the medical practitioners. If a post-marketing trial is planned in future then additional studies will be conducted to accomplish the conditions for its approval (Maca et al., 2014).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128144213000105

Biostatistics Used for Clinical Investigation of Coronary Artery Disease

Chul Ahn, in Translational Research in Coronary Artery Disease, 2016

Statistical Analyses

In this section, statistical methods for data analysis are briefly described. Statistical methods described here have been widely used for the analysis of observational and experimental studies.

Analysis of Continuous Response Variables

Continuous response variables are analyzed using t-tests, analysis of variance (ANOVA), analysis of covariance (ANCOVA), or mixed models, to test the null hypothesis of equal means in different groups with and without adjusting by covariates. For all models, the data is tested to ensure that the underlying assumptions (i.e., normality and homoscedasticity) are met. If not, standard transformations (e.g., log, inverse, square root, and Box-Cox) are taken on the data in order to meet these assumptions. If data transformation is inadequate to meet the analysis assumptions, then rank transformation of the data is performed and one-way ANOVA on the rank-transformed response variables are analyzed and reported. Nonparametric alternatives such as the Wilcoxon signed-rank test, the Wilcoxon rank-sum test, the Kruskal–Wallis test, or permutation tests, are used as appropriate. When covariates could affect a response variable in an ANOVA context, analysis of covariance (ANCOVA) is used to adjust for treatment effects. The underlying assumptions of the ANCOVA model (e.g., homogeneity of slopes across treatment groups) are tested. Standard regression criteria are used to assess the appropriateness of including particular covariates. When more than one covariate is being included in the model, the possibility of multicollinearity will be reduced through the careful initial assessment of correlations among all study covariates. Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others.

Analysis of Categorical Response Variables

Where response variables are categorical, Pearson’s chi-square test or Fisher’s exact test is used to test for differences among treatment groups. Cochran–Mantel–Haenszel test is used when we must stratify on additional variables. Logistic regression is used to model the relationship between a binary outcome variable and covariates. Logistic regression diagnostics is employed to ensure that the logistic model is appropriate. Polychotomous logistic regression can be applied for ordinal categorical variables under the proportional odds assumption. When the outcome is truly multinomial, generalized logit models can be applied. Poisson regression is used if the outcome is a count of events. Construction of composite measures can be formed, if necessary, to combine information among highly correlated covariates [12,13].

Analysis of Survival Data

The method of Kaplan and Meier is used to estimate the distributions of time-to-event outcomes, and these distributions among treatment groups are tested using the log-rank test. Multivariable proportional hazards models are used to test for treatment or prognostic effects in the presence of covariates. The proportional hazards assumption can be evaluated graphically and analytically, and regression diagnostics (e.g., martingale and Schoenfeld residuals) are examined to ensure that the models are appropriate [14]. Violations of the proportional hazards assumption can be addressed in one of the following ways: (i) Stratify by the levels of a categorical variable for which the proportionality assumption fails. (ii) Fit separate Cox models to different time intervals. (iii) Use the extended Cox model instead of the ordinary Cox model. The extended Cox model permits time-dependent covariates [15].

Analysis of Longitudinal Data

Some observations will be measured repeatedly over time, and thus the ordinary independence assumption of observations no longer holds. In situations where one has prior knowledge about the measurement correlation structure, one can use linear mixed models for Gaussian outcomes and generalized linear mixed models (or nonlinear mixed models) for categorical outcomes [16]. In situations where measurement correlation structure is not plausible to predict, one can apply the generalized estimating equations (GEE) for either continuous or categorical outcomes [17,18]. This population average model allows potential misspecification of the measurement correlation structure, yet maintains the consistency of a treatment effect estimate. Missing data arise in almost all serious longitudinal data analyses. Missing data can be handled using the generalized-EM algorithm [19,20] and multiple imputation techniques [21].

Multiple Comparisons

Multiple comparison problems arise when investigators assess the statistical significance in more than one test in a study. When more than one comparison is made, the chance of falsely detecting a nonexistent effect increases. Therefore, statistical adjustment needs to be made for multiple comparisons to account for this. One of the most basic and popular fixes to the multiple comparison problem is the Bonferroni correction. The Bonferroni correction adjusts the p-value based on the total number of comparisons being performed. Bonferroni-adjusted p-value is calculated by dividing the original p-value by the number of tests being performed. For example, Bonferroni-adjusted p-value is 0.05/5 = 0.01 if the number of tests being performed is 5. Although Bonferroni correction reduces the number of false rejections, it also increases the number of cases that the null hypothesis is not rejected when it should have been rejected. That is, the Bonferroni correction severely reduces the power to detect an important effect. To overcome the shortcomings of the Bonferroni correction, investigators have proposed more sophisticated procedures that reduce the familywise error rate (the probability of having at least one false positive) without sacrificing power. A variety of such corrections exist that rely upon bootstrapping methods or permutation tests [22,23].

Sample Size and Power Calculations

Commercially available software such as nQuery Advisor and PASS can be used to compute the sample size and power for standard statistical problems. For the Cox proportional hazards model, simulations in SAS, or R can be used to compute the power given specified effect parameters and sample size. Sample size for repeated measurement data [24] can be estimated using the methods of GEE [25–27] and linear mixed models [28,29].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012802385300019X

Which type of design provides the highest degree of internal validity?

Of the three types of research (experimental, non-experimental, and quasi-experimental), experimental research usually has the highest internal validity.

Which of the following is an example of an experimental design?

A simple example of an experimental design is a clinical trial, where research participants are placed into control and treatment groups in order to determine the degree to which an intervention in the treatment group is effective.

Which type of research designs provide the best evidence for cause and effect relationships?

Experimental Research Using an experiment, the research attempts to establish a cause-and-effect relationship in a situation or phenomenon. It is a causal research design type where the researcher tries to observe the impact of a variable on a dependent one.

Which of the following elements are required in an experimental research design?

The components of experimental design are control, independent variable and dependent variable, constant variables, random assignment and manipulation.