During a clinical trial, we aim to measure a signal. This signal consists of the difference between a new treatment and some control. The latter could be a placebo arm, an arm with standard treatment, or a historical control. At the EORTC we prefer randomized and, if possible, blinded controls. An important part of the statistician’s job is to find a trial design that will detect a desired signal with high probability. This is what we mean when we say a trial is powered to measure an effect of a particular size. It is somewhat common sense that one needs more precise instruments to detect weak signals, and in the statistical world, we can increase the precision of our instrument by increasing our sample size. In clinical studies, more patients mean more events.
Below we see the patient numbers needed to measure continuous signals of varying strength, as indicated by the effect size (mean difference between two groups, divided by standard deviation).
The strength for survival endpoints can also be expressed with the hazard ratio, and the sample size is driven by the number of events (e.g. number of deaths). As you can see, the smaller the effect the more patients are needed to detect the signal. This presents a dilemma, because sometimes large studies are launched to detect small, possibly irrelevant effects. The number of patients in a clinical trial is a major cost factor, so particular attention is paid to this aspect: We don’t want to include too few patients, because we run the risk of not detecting the desired signal, the study would be underpowered. On the other hand, we don’t want to include any more patients than are needed, because it would be a waste of resources and unethical, as well.
“Classic” decision rule
The classical way to run a study is to specify upfront the hypotheses, sample size, test statistic, alpha (type I error), and beta (type II error) before trial start. Once the trial is concluded, the statistical analysis plan will provide a clear decision rule if the treatment under consideration has the sought after effect. The key point here is that a decision is only made once the whole sample has been completed. In order to control type I error, we don’t take sneak peeks.
As an example for the “classic” decision rule, think of a simple clinical design to test a new drug against cancer. We assign one half of the patients to the new drug and the other half to placebo. We would then compare the number of patients in both groups that show a response (or reach the endpoint of the study) and base our decision whether or not the new drug is active on this count.
The advent of sequential methods
This changed in 1943 when a report from Abraham Wald, then at Columbia University, was presented to the National Defense Research Committee. (Wald was also famous for his work on aircraft survivability; in survival analysis we encounter manifold variants of this problem.)
Wald’s paper concerned quality control in an industrial production process. In such process, the product comes out one by one at the end of a production line. For quality control purposes, we are interested in the percentage of defective units. From a classical approach, we would have to wait until a certain pre-specified number of units were produced before checking them. We would then count the number of defective units in the batch and base our statistics on these results.
But how big should the batch size be? This depends on the true percentage of defective units. If it is small, the batch size has to be large and vice versa. In an industrial process this can be disadvantageous. Consider an extreme example to illustrate this point. Say that due to some major mistake, almost every unit is defective. The quality control department, however, decided, to take a batch of 1000 units and base the decision on this batch size. If 100 units are produced per day, a decision would only be made after ten days and the factory would have been producing a large number of defective parts.
In a sequential test, each unit is examined once it comes out of the production line. In the above example, if the first 20 units were defective, we could have made a confident decision that something was very wrong in the production process.
A variation of this idea in clinical trials comes with the group sequential designs, where groups of patients, rather than every single patient, is examined, hence group sequential designs. This is the idea behind the interim analyses – if a new drug is extremely active or inactive, we may see this already at an interim analysis and do not need to wait until all patients are treated. In such a way adaptations of the trial conduct can be done early, most prominently in cases of early stopping for futility or activity of the new regimen. This conduct is ethical because it prevents patients from being treated with inactive drugs, and, moreover, it can save resources.
Bayesian methods are appealing to many scientists and this is why: in daily life we base our decisions on prior experience. We see the sun rise on our first morning, we see it rise on the next day, and, finally, we expect that the sun will rise every morning. Compare this to the view of the Bayesian statistician, who before dawn would say, “Sunrise in the morning has a high prior probability”.
The nature of knowledge in the inductive sciences, including medicine, is of that kind: we make observations and induce models from them. The goal of this process is not something like absolute truth. We only look for a model that best explains what we see. To find a counterexample does not necessarily render the model invalid. Only a new model that has a higher potential to explain observations and experimental results will replace an old model.
Mathematics, however, does not work that way, it makes deductions from axioms, and a single counterexample disproves an assertion. (N.B.: interestingly, before a court of justice, a single example can establish a precedent and can thus prove an assertion).
Definition and instances of adaptive designs
An adaptive design is one that allows modifications to the trial and/or statistical procedures of the trial after its initiation without undermining its validity and integrity. The purpose is to make clinical trials more flexible, efficient, and fast. What follows are several possible adaptive clinical trial designs.
Seamless Phase II/III Designs
Let’s say we have the task to find out which of three new available treatments is the best. The classical approach would be the following: during the phase II part, we would try and find out which one of the three new treatments is the best and how it performs in comparison to a control, a comparison that has a low power during phase II. Once the phase II trial is complete, and sometime after the results are available, a phase III trial would be planned to compare the best treatment from phase II to control. This time the comparison is adequately powered.
In a seamless phase II/III design, we would not interrupt accrual between the phases. Instead, in this instance, there would be two interim analyses, and for each of them a trial arm with a low efficacy is dropped. The patients of the phase II part would also contribute to the results of the phase III part. Theoretically, the time for the whole process is much shorter and needs fewer patients.
The reality, however, is sometimes a bit sobering, because there is the issue of overrun. This refers to the fact that more patients are accrued during the interim analysis. In a rapidly accruing trial, all patients might have been already accrued before the interim results are available, and by that time it is too late for an adaptation.
Much is learned from the results of a Phase II trial, which has great impact on how a Phase III trial would be conducted. Clearly, the seamless design has a disadvantage here: It has to be planned as one entity and knowledge gained from the Phase II part can only have limited use for the Phase III part. In that sense the gap between Phase II and III in the classic approach provides the necessary time and flexibility to fully account for the Phase II results when planning Phase III.
Sample size re-estimation
Based on interim results, the sample size can be increased so as to increase the power to detect a desired treatment effect. This design can keep the initial budget of a trial low. If it is decided to increase the sample size, the budget has to be increased accordingly: adaptive designs need adaptive budgets. There is possible source for conflict here: The decision to increase the sample size usually comes from an independent data monitoring committee (IDMC), whereas the additional funding is provided by other sources (e.g. a company).
As a rule we would consider increasing the sample size if the conditional power at the interim analysis was between 30 and 90 percent. If it is lower, there is little hope that the trial would turn out positive even with a bigger sample size. If it is higher, the chances are good that the trial would turn out positive even without increasing the sample size.
One important aspect of trial design is the decision on the magnitude of a clinically relevant treatment effect. In this sense, sample size re-estimation is a data driven reconsideration of clinical relevance which is conceptually problematic.
Alterations of the randomization schedule to increase the probability of success, i.e., randomize to effective arms with a higher probability.
Early stopping, group sequential design
In a group sequential design, the trial can be stopped at an interim analysis if the treatment under consideration turns out to be ineffective or more effective than anticipated.
Drop the Loser design
Also known as the Pick the Winner design, this typically is applicable to phase II designs with several arms and two stages. Following the first stage, inefficient arms are dropped and efficient arms are kept.
Biomarker adaptive design
Modifications are made based on the response of various biomarkers associated with the disease under consideration. This design can be helpful in selecting the right patient population, and, by extension, help in developing “personalized medicine”. It should be kept in mind that there is a gap between identifying biomarkers associated with clinical outcomes and establishing a predictive model between relevant biomarkers and clinical outcomes in clinical development.
Up-ramping of patient accrual
Because of accrual overrun, adaptation works better if the accrual is slow. This is also the reason why that in a slowly accruing clinical trial, the accrual does not need to be suspended. Once an adaptation is made, accrual can then be ramped up.
Adaptive dose finding
This design is used in phase I clinical trials to identify a maximum tolerable and minimal effective dose. Dose levels can be adapted during the trial. If the doses are fixed, there is the risk that the dose levels are too close to each other to show any difference in toxicity, e.g. continual reassessment method, where dose levels might be adjusted after every single patient.
Adaptive methods, including designs with interim analyses with early stopping rules, can be advantageous. These potential advantages were illustrated by the example of industrial quality control. Adaptive designs need good, sometimes extensive, logistics.
Some adaptive designs (e.g. group sequential designs) proved to be very useful and are already a part of the standard repertoire in clinical cancer research.
A signal does not get more precise if we use adaptive designs, and if we want to achieve more precise information we still need to make an investment and increase the sample size.
Adaptive designs have a problem with bias that is sometimes uncontrollable, particularly if interim results are leaked. Adaptation is sometimes merely fishing for significant results.
The most important techniques to avoid bias are randomization and blinding. Adaptive designs have to compromise on both, and subsequently they compromise on bias.
We run clinical trials and ask for personalized medicine — this is a conflict, isn’t it? In a clinical trial we purport to treat homogenous populations; this is what we stipulate so that we can say treatment XY achieves a 70 percent response rate. But personalized medicine says a response rate is always, and only, the response rate of an individual patient. The response of another patient will be different. Adaptive designs aim to provide a link between these opposing concepts.
In subsequent EORTC Newsletter articles, the pros and cons of adaptive designs will be discussed in more detail.
Jan C. Schuller
EORTC Statistics Department