Reliability of computerized image analysis for the evaluation of serial synovial biopsies in randomized controlled trials in rheumatoid arthritis

Analysis of biomarkers in synovial tissue is increasingly used in the evaluation of new targeted therapies for patients with rheumatoid arthritis (RA). This study determined the intrarater and inter-rater reliability of digital image analysis (DIA) of synovial biopsies from RA patients participating in clinical trials. Arthroscopic synovial biopsies were obtained before and after treatment from 19 RA patients participating in a randomized controlled trial with prednisolone. Immunohistochemistry was used to detect CD3+ T cells, CD38+ plasma cells and CD68+ macrophages. The mean change in positive cells per square millimetre for each marker was determined by different operators and at different times using DIA. Nonparametric tests were used to determine differences between observers and assessments, and to determine changes after treatment. The intraclass correlations (ICCs) were calculated to determine the intrarater and inter-rater reliability. Intrarater ICCs showed good reliability for measuring changes in T lymphocytes (R = 0.87), plasma cells (R = 0.62) and macrophages (R = 0.73). Analysis by Bland–Altman plots showed no systemic differences between measurements. The smallest detectable changes were calculated and their discriminatory power revealed good response in the prednisolone group compared with the placebo group. Similarly, inter-rater ICCs also revealed good reliability for measuring T lymphocytes (R = 0.68), plasma cells (R = 0.69) and macrophages (R = 0.72). All measurements identified the same cell types as changing significantly in the treated patients compared with the placebo group. The measurement of change in total positive cell numbers in synovial tissue can be determined reproducibly for various cell types by DIA in RA clinical trials.


Introduction
Rheumatoid arthritis (RA) is characterized by chronic and symmetric inflammation of synovial joints [1,2]. Although the aetiology of RA is still unknown, it is thought of as an autoimmune disease with the synovial tissue (ST) being its primary target. The microscopic appearance of RA ST includes marked intimal lining layer hyperplasia due to increased numbers of fiboblast-like synoviocytes and intimal macrophages, and accumulation of macrophages, T cells, B cells, plasma cells, dendritic cells, mast cells, natural killer cells and neutrophils in the synovial sublining layer [3]. Developments in synovial biopsy techniques, especially arthroscopy, have resulted in easier access to human ST. It is now possible to select ST from many sites within large and small joints, even in the earliest phases of disease, enhancing studies of aetiology, prognosis and response to treatment [4].
Analysis of biomarkers in ST is increasingly being used in the evaluation of new targeted therapies in RA patients [5]. Numerous studies have suggested consistent associations between rapidity and magnitude of both clinical and immunohistological responses. It was shown that, especially within the ST, the number of infiltrating sublining macrophages can be used as a biomarker of clinical efficacy in relatively small studies of short duration [6,7]. Therefore, change in synovial sublining macrophages may be used as a biomarker for the DIA = digital image analysis; ICC = intraclass correlation coefficient; RA = rheumatoid arthritis; SDC = smallest detectable change; ST = synovial tissue. evaluation of novel antirheumatic therapies. In addition to screening for possible efficacy, this approach provides insight into the mechanism of action of treatment.
Within this setting, reliable and validated methods for studying the ST are pivotal. The use of computerized or digital image analysis (DIA) has greatly facilitated the evaluation of ST. The major advantage of DIA is standardization of image acquisition and processing, minimizing variance, and the ability to quantify the actual stained area together with staining intensity in a time efficient manner [8,9]. This allows analysis of large numbers of stained sections. Strong correlations were observed between CIA, semiquantitative scoring and manual counting for analysis of ST cellular markers, cytokines and adhesion molecules [10,11]. Although the reproducibility of measuring cytokine and cell adhesion molecule staining by DIA was reported to be within 10% [8], no formal studies investigating intrarater and inter-rater variability have yet been reported. Therefore, we designed a study to determine the intrarater and inter-rater reliability of this approach for the analysis of synovial biopsies from RA patients participating in clinical trials.

Patients and samples
Arthroscopic synovial biopsies were obtained before and 2 weeks after treatment in 19 patients who participated in a double-blind, placebo-controlled, single-centre study with prednisolone, as reported earlier [6]. All patients included had RA according to the 1987 criteria proposed by the American College of Rheumatology [12] and were on a stable regimen of disease-modifying antirheumatic drugs (methotrexate, sulphasalazine, hydroxychloroquine or leflunomide, or a combination of these) for at least 28 days before inclusion in the study. Ten out of the 19 patients received prednisolone and nine received placebo treatment. Needle arthroscopy of an actively inflamed joint (knee, ankle, or wrist) was performed under local anaesthesia in all patients before treatment and in the same joint after treatment. The procedures for needle arthroscopy were performed as described previously in detail [13,14]. During each procedure, biopsies were taken from six or more sites throughout the joint to minimize sampling error [15,16]. These specimens were directly collected en bloc in a mold embedded in Tissue Tek OCT (Miles diagnostics, Elkhart, IN, USA) and subsequently snap frozen by immersion in methylbutane (-80°C). The frozen blocs were stored in liquid nitrogen until they were processed. The study was approved by the Medical Ethics Committee of the Academic Medical Center, Amsterdam, The Netherlands, and all patients provided informed consent before start of the study.

Immunohistochemical analysis
From each tissue sample, consisting of six different biopsy samples, serial sections were cut with a cryostat (5 µm) and stained with the following antibodies to analyze the major cell populations in the synovium: anti-CD68 (EMB11; Dako, Glos-trup, Denmark), anti-CD38 (HB-7; Becton Dickinson) and anti-CD3 (SK7; Becton Dickinson, Erembodegem, Belgium). Sections with nonassessable tissue, defined as the absence of an intimal lining layer, were not analyzed. For control sections, the primary antibodies were omitted or irrelevant antibodies were applied. Staining for cellular markers was performed using a three-step immunoperoxidase method, as was previously described [17].

Digital image analysis
After immunohistochemical staining, all coded sections were randomly analyzed by computer-assisted image analysis ( Fig.  1). For all markers, 18 high-power fields were analyzed. The images of the high-power fields were analyzed using the Qwin analysis system (Leica, Cambridge, UK), as described previously in detail [10,11].
For determination of intrarater reliability, one observer performed the acquisition and analysis twice with an interval of 4 weeks in between (OB1 t0 and OB1 t1, respectively). To determine the inter-rater reliability, acquisition of images and analysis were performed independently by two other experienced observers (OB2 and OB3). All observers were blinded regarding clinical data. For each measurement all observers independently set their own threshold levels regarding the detection of stained antigen, nuclear staining and background staining. After the analysis, all observers independently calculated the mean change in the total number of positive cells per square millimetre of ST for each marker.

Statistical analysis
The nonparametric Friedman test and the Wilcoxon signed rank test were used to identify differences in the detection of the change in positive cell numbers per marker in the whole patient group, between observers and between assessments. The intrarater and inter-rater reliability was quantified by means of the intraclass correlation coefficient (ICC) of agreement [18]. In addition, scatter plots, in accordance with methods reported by Bland and Altman [19], were constructed to show differences in the change in positive cells between two measurements from one observer. The smallest detectable changes (SDCs), representing the smallest change in scores that can be deemed to be a 'real' change [20], for the intra-observer variances was calculated and used to evaluate their disciminatory power. The nonparametric Mann-Whitney U-test was used to determine whether each analysis detected differences in the change of positive cell numbers when the placebo group was compared with the prednisolone-treated group.

Intrarater reliability
The mean numbers of CD3 + T lymphocytes, CD38 + plasma cells and CD68 + sublining macrophages before and after intervention for two analyses by the same observer at different time points (OB1 t0 and OB1 t1) are shown in Table 1. There were no significant differences in the mean change in T cells, plasma cells and macrophages in the total population between the two measurements.
The overall correlations between the first and second analysis by the same observer were good. For the measurement of the change in CD3 + T lymphocytes, CD38 + plasma cells and CD68 + macrophages, the single rater and average of rater ICCs were calculated and are shown in Table 2. The relations between the two measurements by the single observer are plotted in Fig. 2. There were no systemic differences between the two measurements for each marker, but the variation was rather large. An analysis of the between patient variances and within patient variances is provided in Table 2.
The SDC, averaged for the number of readings, for CD3 + lymphocytes was 182, for CD38 + plasma cells it was 128, and for CD68 + macrophages it was 306. When these estimates were used to identify those patients who responded to the treatment (i.e. had a reduction in positive cell numbers exceeding the SDC), for CD3 + lymphocytes four of the 10 patients in the prednisolone group responded versus none of the nine patients in the placebo group; for CD38 + plasma cells four of the 10 patients in the prednisolone group responded versus one of the nine patients in the placebo group; and for CD68 + macrophages seven out of the 10 patients in the prednisolone group responded versus none of the nine patients in the placebo group.
To determine whether the same observer identified the same differences in the synovial infiltrate after treatment at different time points, we determined whether there were significant differences in the change in T cells, plasma cells and macrophages between the placebo group and the prednisolonetreated group for each measurement. At both time points there was, on average, a significant reduction in the number of CD3 + lymphocytes and CD68 + macrophages in the prednisolonetreated patients as compared with placebo (Table 1), whereas on average there were no significant changes in the number of CD38 + plasma cells.

Interrater reliability
The mean number of T cells, plasma cells and macrophages before and after intervention measured by the other two observers (OB2 and OB3) are also shown in Table 1. There were no statistically significant differences in the mean change in positive cells between the analyses by the three observers (OB1 t0, OB2 and OB3).
When the overall correlations between the analyses of the three observers were calculated the ICCs (single and average of raters) appeared to be good for CD3 + lymphocytes, CD38 + plamsa cells and CD68 + macrophages ( Table 2). An analysis of between patient variances and the within patient variances is also provided in Table 2.
To determine whether all three observers identified the same differences in the synovial infiltrate after treatment, we determined whether there were significant differences in the change in T cells, plasma cells and macrophages between the placebo group and the prednisolone-treated group for each measurement. The measurements by all three observers showed, on average, a significant reduction in the number of CD3 + lymphocytes and CD68 + macrophages in the prednisolone-treated patients versus placebo (Table 1), whereas, on average, there were no significant changes in the number of CD38 + plasma cells.

Discussion
This study investigated the intra-and interobserver reliability of assessment of the change in ST T cells, plasma cells, and macrophages quantified by DIA. Tissue samples were obtained from RA patients participating in a single-centre, placebo-controlled clinical trial with prednisolone. There were no significant differences in measurement of the mean change in T cells, plasma cells and macrophages between the three observers, or for different measurements by one observer. ICCs revealed good agreement between measurements. All observers and all measurements identified, on average, significant reductions in T cells and macrophages but not in Acquisition, analysis and output for a digital image analysis system Acquisition, analysis and output for a digital image analysis system. Acquisition and analysis of immunohistochemical staining of CD3 + T lymphocytes in synovial tissue using a digital image analysis system [10]. Three different areas from each biopsy sample, which are representative of the whole tissue section are selected. During analysis, staining thresholds are set for primary staining (i.e. CD3 + T lymphocytes), nuclear staining and background staining. The output is generated in a spreadsheet as the total number of positive cells per square millimetre of synovial tissue. plasma cells in the prednisolone group compared with placebo.
It can be anticipated that there will be an upsurge in randomized controlled trials investigating novel biological agents and small molecules in terms of their safety and efficacy. Thus, sensitive, validated and reliable measurements to screen for potential efficacy in an early phase of drug development are clearly needed. Clinical outcome measures have historically been used as primary end-points, but their reliability may be limited in small proof-of-principle studies. For clinical measurements such as the tender and swollen joint count, ICCs have been reported to vary between 0.15 and 0.85 for inter-rater variability and between 0.67 and 0.95 for intrarater variability [21]. Radiographic measurements, with the use of conventional X-ray films, show good reliability in most studies but they are not useful in short-term clinical trials [21]. The use of mag-netic resonance images is promising, with acceptable interrater ICC for global synovitis scores and bone erosions, although optimal scoring systems are yet to be developed [22].
In light of the need to screen various compounds for potential efficacy in small numbers of patients and because of recent technical developments, we believe that our thinking about clinical trials is about to change dramatically. Clinical studies conducted during early phases of drug development will increasingly consist of small trials with a high density of biological data [23]. Consistent with this notion, serial ST analysis with evaluation of biomarkers was recently included in several randomized clinical trials of both disease-modifying anti-rheumatic drugs and biological agents [6,13,[24][25][26][27]. These and other studies showed consistent relationships between the magnitude of synovial changes and clinical response. In partic- Table 1 Numbers of positive cells before and after intervention Shown are the mean numbers (± standard deviation) of CD3 + T lymphocytes, CD38 + plasma cells and CD68 + sublining macrophages per square millimetre of synovial tissue before and after intervention, measured by one observer at two different time points (OB1 t0 and OB1 t1) and by two other observers (OB2 and OB3) for placebo-treated patients and prednisolone-treated patients. a Nonparametric, unpaired, Mann-Whitney U-test for the comparison between placebo and prednisolone treatment. b Nonparametric, paired, Wilcoxon signed rank test, for the comparison between OB1 t0 and OB1 t1 (intra-observer comparison). c Nonparametric, paired, Friedman test, for the comparsion between the three observers (OB1 t0, OB2 and OB3). Table 2 Estimates

of the variance components (between and within patients) and of the intraclass correlations (single rater and average of raters)
Cell type Intra-observer Inter-observer ular, the change in infiltrating sublining macrophages was identified to be a potent and sensitive synovial biomarker [6,7].
ST can easily and safely be obtained as a result of the introduction of small-bore arthroscopes and the development of local and regional anaesthesia protocols. Despite heterogeneity in the ST within a single joint, it has been shown that representative measures of synovial inflammation can be obtained by examining a limited area of tissue [15,28,29]. Previous work [10,11] has also shown that DIA is a sensitive, time efficient method for quantifying both the number of stained cells and the staining intensity, with good correlations with both manual counting and semiquantative scoring.
Although DIA is described as reliable and objective, little is known about the variability and reliability of this tool. Variation in measurements may result from a limited number of factors with this approach. In our system the observer selects three different areas of each six high-power fields from one slide, which is composed of six biopsy samples from six different sites in the joint. This is done in such a way that a representative area is selected, and this requires extensive training and experience with the histopathological morphology of ST. After scanning the representative high-power fields, the images are analyzed by setting threshold values for the stained antigen, nuclear staining and background staining [10]. These thresholds are kept constant for all measurements with the same marker within a study, but could theoretically give rise to variation when set by different observers or by one observer at different times. In the present study it was shown that these variables did not result in different outcomes. There were good ICCs when the findings of three experienced observers or the findings of the same observer at different times were compared. Analysis by Bland-Altman plots showed no systemic differences with regard to the intra-observer measurements, and the SDCs showed good discriminatory power when applied to the treatment groups. In addition, all observers and all measurements identified the same cell types (T cells and macrophages) as decreasing significantly in the active treatment group compared with placebo. All measurements also identified a consistent trend toward reduced plasma cell numbers after corticosteroid treatment, which did not reach statistical significance, possibly because of the relative small number of patients included. Although this method does exhibit good agreement in detecting changes in histological markers, this does not necessarily mean that these results can be extrapolated to the expression of a given marker at a given time point, as used in cross-sectional studies of ST. In addition, it remains to be seen whether the same reliability holds true for determination of changes in secreted proteins, such as cytokines and chemokines.

Conclusion
In conclusion, the findings of the present study show the reliability of ST analysis using a DIA system for the evaluation of serial synovial biopsy samples before and after treatment. This approach may be used for efficient quantification of synovial biomarkers in small proof-of-principle clinical trials.

Competing Interests
The author(s) declare that they have no competing interests.

Authors' contributions
JJH contributed to experiments, was responsible for data analysis and interpretation, and wrote the manuscript. MV and TJMS were responsible for both the set-up and performance of the experiments. DMG was responsible for including patients and collecting materials and data. AHZ coordinated and assisted in the statistical analysis of the data. PPT was responsible for planning the work and contributed to data analysis, interpretation and write up.