Knowledge of the limitations in comparing radiographic data across clinical trials is necessary for accurate interpretation of data in the absence of direct, head-to-head trials. These limitations include differences in study design, patient characteristics, severity of disease, duration of follow-up, scoring method used, reader reliability, order in which radiographs are read, handling of missing data, and, finally, data presentation. Each of these limitations is discussed below.
Some trials use a parallel design with patients who have not previously received treatment for RA. An example of this is the MTX arm in the etanercept trial, in which patients with early RA who were naive to MTX treatment were randomized to an MTX treatment arm or an etanercept treatment arm . Other trials included patients who were previously treated and who experienced a partial response to a disease-modifying antirheumatic drug. An example of this study design is the Anti-TNF [tumor necrosis factor] Trial in Rheumatoid Arthritis with Concomitant Therapy (ATTRACT), in which patients who had previously been treated with MTX were randomized to groups treated with infliximab plus MTX or with MTX alone . Although both studies included treatment arms in which patients received MTX alone, the MTX arm in the etanercept trial is not comparable with the MTX arm in the ATTRACT trial, because of the different baseline characteristics and treatment histories of the patient populations. It can be expected that the radiographic response to MTX in patients naive to this drug will be more pronounced (on a population level) compared with patients who have previously shown a partial response to MTX and then continued treatment with this drug. Therefore, it is important to be aware of the study designs before comparing data between trials.
Several prognostic factors are known to predict an unfavorable outcome with respect to structural joint damage. The most important of these are the presence of rheumatoid factor, evidence of joint erosion early in the course of the disease, and rapid disease progression . However, these predictors account for only a limited percentage of the variation between patients. Further, although these predictors are valid for groups of patients, they have little value when applied to individual patients. In addition, it is likely that other, currently unknown, factors will be associated with structural joint damage. By sampling from one patient population and randomizing the enrolled patients over two (or more) comparative trial arms, investigators can reasonably assume that both known and unknown factors are well balanced between the treatment arms. However, if trial arms from various studies are compared, this randomization has not taken place and many hidden differences between the patient populations may exist.
Although prognostic factors listed previously relate to the progression of structural joint damage, they are not necessarily transferable to predicting response to therapy. Anderson et al.  investigated the prognostic factors for response to treatment in 14 randomized clinical trials. The investigators concluded that patients whose RA is of longer duration do not respond as well to treatment as do patients with early disease. Moreover, female gender, previous treatment with a disease-modifying antirheumatic drug, poorer functional class, and higher disease activity affect the likelihood of patient response to treatment. According to Anderson et al. , these factors should also be considered when interpreting data from clinical trials.
Baseline radiographic damage
Patients enrolled in clinical trials have different levels of structural joint damage. Depending on the eligibility criteria for a study, patients in one study may have substantially more baseline radiographic damage than patients in another study. Therefore, baseline radiographic damage represents another important obstacle to comparisons across clinical trials. Expressing this baseline damage in terms of disease duration results in a radiographic progression rate. Recently, in two different studies, the radiographic progression rate before entering the study was shown to be an important predictor of treatment outcome [9, 10]. The ATTRACT trial was conducted in patients with established disease who had achieved a partial response to MTX. Patients were randomized to four active treatment arms with infliximab and a control arm with placebo, with all patients continuing on MTX . The Combinatietherapie Bij Reumatoide Artritis [Combination Therapy in Rheumatoid Arthritis] (COBRA) trial was performed in patients with early disease who were randomized to either a combination of high-dose prednisone (quickly tapered) combined with MTX and sulfasalazine, or to sulfasalazine alone . Although these trials were based on patients from different populations (patients with established disease versus those with early disease), both trials showed retardation of radiographic progression (infliximab compared with placebo on an MTX background and combination therapy compared with sulfasalazine alone, respectively). Further, it was evident that within each trial, patients with the highest radiographic progression rate at the onset of the trial benefited most from treatment [9, 10].
Duration of follow-up
At the group level, radiographic progression in RA is a linear process [11, 12]. However, progression rates and patterns differ markedly from patient to patient [13, 14]. Because radiographs show cumulative damage, differences in duration of follow-up are expected to have a large impact on the results. Moreover, because of differences in the patterns of progression, patient-to-patient variability cannot be easily corrected for by dividing progression scores by follow-up duration - for example, to calculate a monthly rate of progression. Therefore, it is important that the duration of follow-up be similar when radiographic progression is compared across trials.
Another important consideration when comparing radiographic scores across clinical studies is the scoring system used to assess structural joint damage. Several scoring methods exist to assess joint radiographs. These scoring methods evaluate different bony features, assess different joints, and have different scoring ranges. The most widely used methods are the Larsen and Sharp methods, along with their modifications [15–18]. The Larsen method uses a global grading system that mainly assesses erosive damage. The scoring range is from 0 to 200. The Sharp method assigns separate scores for erosions and joint space narrowing, which are combined to obtain a total score. The scoring range for the Sharp method is from 0 to 314 or 448, depending on which modified version of the method is used. Because of the differences in scoring ranges and in the abnormalities included in the assessment, a score of, for example, 5 in the Larsen method cannot be directly compared with a score obtained using the Sharp method. In some trials, scores obtained from hand radiographs are included, whereas in other studies, radiographs of both the hands and feet are used. Therefore, it is important to compare scores obtained on the same films: joints of either the hands or the feet, or a combination of both.
Clinical trials are typically designed to have one or two observers read and score each radiograph. The use of two observers reduces the variability in scoring and decreases the error of measurement. Interobserver reliability is high for the progression of scores. However, the absolute scores from reader to reader may be significantly different. In other words, each observer has his or her own reading level (and is consistent with his or her own readings), and this reading level may be clearly different from that of another observer, but the progression seen is fairly consistent between the observers. Trials are analyzed making use of these progression scores, scored by one (pair of) observer(s). However, when absolute scores are compared across trials, another variable besides treatment, design, and patient characteristics is introduced: a different observer. This further complicates the comparison of scores across trials that used different readers.
Radiograph scoring sequence
Radiographs are read either in a known sequence or in random order. The order in which radiographs are scored has a significant effect on the measurement error of scores and on the ability of scores to capture disease progression accurately [19, 20]. Consequently, the order in which radiographs are read is an important factor when comparing results across trials. However, earlier published trials often did not present this information . Therefore, comparing results from new trials with those already published is often problematic.
Handling of missing data
Because radiographs show cumulative structural joint damage, missing radiographs become an important issue in the analysis of a progressive disease such as RA. Missing data cannot be replaced by a simple 'last-observation-carried-forward' procedure, as is often applied to other data, especially in the case of long term trials. Sensitivity analyses to investigate the effect of the missing radiographs are warranted . The aim of each trial should be to have films of randomized patients at baseline and at follow-up, regardless of patient status. However, this often is not feasible, because patients, especially those who have withdrawn, may refuse to submit to follow-up films. Therefore, missing radiographs will continue to pose an obstacle and data need to be analyzed in various ways to rule out an effect of selectively missing films.
Clinical trials present radiographic data in a variety of ways, which makes the comparison of data across trials difficult. To minimize this obstacle, a roundtable conference was held to establish a minimum set of radiographic results that should be presented in each trial . Most data are presented on a group level, either by mean ± standard deviation (SD) or by median and interquartile range (IQR). Because of the skewed nature of radiographic data, the two ways of presenting the data provide important, but completely different, information. If a large proportion of patients in a group shows no or minimal progression and a few patients show a significantly higher rate of disease progression, the latter set of patients gives much weight to the mean ± SD of the overall group. The presentation of these data as a median with IQR provides information on the proportion of patients showing a specific progression. Both the mean ± SD and the median with IQR give important and additive information on a group level.
Other important information is the analysis at the level of individual patients. By dichotomizing the data, statistical power is lost. Therefore, such analysis is advised as a secondary analysis . It is useful to know the percentage of patients who show progression above a certain clinically important level. However, the decision about what level to use as a cutoff is often arbitrary and can result in incomparability between trials. Although some trials simply define no progression as a score of zero, this finding does not take into account measurement error, which is always present. Others use arbitrary cutoff values or base the cutoff level on the SDD (smallest detectable difference apart from measurement error), which is a trial-specific number [22, 23]. In the leflunomide trials published by Sharp et al. , the cutoff value regarded as indicating progression in erosion was a score of 3, which resulted in progression of erosions being reported in 3% to 11% of treated patients (receiving leflunomide, sulfasalazine, or MTX), versus 12% to 17% of patients receiving placebo. In the ATTRACT trial, an SDD of the total score (in this case, 8.6) was selected as the cutoff value in reporting progression. In this trial, 6% of treated patients were reported to have shown progression of erosion, versus 31% of patients receiving placebo . In the etanercept trial, 0 was selected as the cutoff value for the erosion score. Applying this cutoff value, 28% of etanercept-treated patients were judged to have erosion that progressed, versus 40% of MTX-treated patients. Within each trial, these figures are meaningful and show that the active treatment was effective. However, there is little value in comparing progression between these trials, because all have assigned different cutoff values.