Diagnosis of early stage knee osteoarthritis based on early clinical course: data from the CHECK cohort

Background Early diagnosis of knee osteoarthritis (OA) is important in managing this disease, but such an early diagnostic tool is still lacking in clinical practice. The purpose of this study was to develop diagnostic models for early stage knee OA based on the first 2-year clinical course after the patient’s initial presentation in primary care and to identify whether these course factors had additive discriminative value over baseline factors. Methods We extracted eligible patients’ clinical and radiographic data from the CHECK cohort and formed the first 2-year course factors according to the factors’ changes over the 2 years. Clinical expert consensus-based diagnosis, which was made via evaluating patients’ 5- to 10-year follow-up data, was used as the outcome factor. Four models were developed: model 1, included clinical course factors only; model 2, included clinical and radiographic course factors; model 3, clinical baseline factors + clinical course factors; and model 4, clinical and radiographic baseline factors + clinical and radiographic course factors. All the models were built by a generalized estimating equation with a backward selection method. Area under the receiver operating characteristic curve (AUC) and its 95% confidence interval (CI) were calculated for assessing model discrimination. Delong’s method compared AUCs. Results Seven hundred sixty-one patients with 1185 symptomatic knees were included in this study. Thirty-seven percent knees were diagnosed as OA at follow-up. Model 1 contained 6 clinical course factors; model 2: 6 clinical and 3 radiographic course factors; model 3: 6 baseline clinical factors combined with 5 clinical course factors; and model 4: 4 clinical and 1 radiographic baseline factors combined with 5 clinical and 3 radiographic course factors. Model discriminations are as follows: model 1, AUC 0.70 (95% CI 0.67–0.74); model 2, 0.74 (95% CI 0.71–0.77); model 3, 0.77 (95% CI 0.74–0.80); and model 4, 0.80 (95% CI 0.77–0.82). AUCs of model 3 and model 4 were slightly but significantly higher than corresponding baseline-factor models (model 3 0.77 vs 0.75, p = 0.031; model 4 0.80 vs 0.76, p = 0.003). Conclusions Four diagnostic models were developed with “fair” to “good” discriminations. First 2-year course factors had additive discriminative value over baseline factors. Supplementary Information The online version contains supplementary material available at 10.1186/s13075-021-02598-5.


Background
Early diagnosis of knee osteoarthritis (OA) is important in managing this disease, as it helps open a 'treatment window' for early interventions which could positively modify the disease course [1][2][3][4]. Nowadays, such an early diagnostic tool is still lacking in clinical practice.
Individuals who will develop established OA could be considered as being at early stage of OA in the years prior to the diagnosis of established OA. With applying multivariable prediction models, early diagnostic algorithms can be built by connecting multiple present predictors with the future occurrence of established OA [5,6]. A few knee OA models, including clinical manifestations together with imaging features [7,8] or laboratory biomarkers [9][10][11][12][13], have been proposed for building (early) predictive configurations. As no gold standard has been established for diagnosing knee OA in clinical practice (as opposed to classification criteria intended for studies), these models were built using heterogeneous outcomes; American College of Rheumatology criteria based clinical OA [12,13], persistent knee pain [14], or (incident) radiographic OA [7][8][9][10][11]. A better way to minimize the "gap" between "research classification criteria" and "unknown gold criteria" is to obtain a clinical expert consensus-based diagnosis, as we have done in a previous study [15].
All of the above models were based on baseline factors only; none evaluated the diagnostic value of the early clinical course. Knee OA progression has been reported to follow a pattern of inertia [16], which means knees with recent progress will continue to progress in the future and are more likely to develop into established OA. In turn, these knees should be considered as being at early stage OA at this moment. Besides, a "wait-and-see" policy is frequently applied by clinicians while treating knee complaints with a recent onset and with mild symptoms, suspected but not confirmed for knee OA [17,18]. Repeated consultations are quite common for such a chronic disease. Hence, early clinical course data of knee OA is often clinically accessible.
In our previous study, we built early diagnostic models for clinical expert consensus-based diagnosis by including baseline factors [15]. In this study, we aimed to use the first 2-year course factors, as well as the combinations of baseline and course factors, to build diagnostic models for the same expert diagnosis. Additionally, we aimed to see whether course factors had additive discriminative value over baseline factors.

Data source and patients
We obtained patient data from the CHECK cohort (a longitudinal cohort study of patients with knee or hip complaints suspect for early stage OA, followed for 10 years) [19,20]. The inclusion criteria of CHECK cohort were (1) non-traumatic knee or hip pain or stiffness, (2) aged 45-65 years old, (3) no previous consultation, or the first consultation with a general practitioner within 6 months before inclusion. The CHECK cohort excluded the patients if the complaints could be explained by other diseases than OA. Patients in the CHECK cohort fulfilled questionnaires and got physical and radiographic examinations at baseline, 2, 5, 8, and 10 years. See more details in other papers [19,20].
This study included all the knees with reported symptoms at baseline and had data available throughout the 10 years. If the patient reported bilateral knee symptoms at baseline, both knees would be included.

First 2-year course factors and definitions
We collected identical factors at baseline and 2-year follow-up, including body mass index (BMI, kg/m 2 ); bilateral knee pain (yes/no); physical examinations (presence of joint line tenderness, bony swelling at the joint margins, warmth, effusion, crepitus, patellofemoral joint grinding, restricted/painful flexion/extension, Heberden nodes); Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) questionnaires [21] (we selected knee OA related items, includes 5 individual items for pain, 6 for function and 1 for knee stiffness; all are graded from 0 to 4), and radiographic items (medial/ lateral tibiofemoral osteophytes, medial/lateral tibiofemoral joint space narrowing (JSN), patellofemoral osteophytes, patellofemoral JSN and tibiofemoral joint angle). We defined restricted/painful flexion as maximal knee flexion ≤ 115°or pain at knee flexion; restricted/ painful extension as an extension deficit ≥ 1°or pain at knee extension. We measured tibiofemoral joint angle on standardized weight-bearing posterior-anterior radiographs using Knee Images Digital Analysis (KIDA) software [22]. Trained readers scored the radiographic items according to Kellgren & Lawrence criteria [23] via a centralized reading of standardized posterior-anterior and lateral radiographs. Readers got information on the sequence of images but were blinded to the clinical information [19].
We defined the first 2-year course factors according to the factors' change over this period. BMI change greater than 5% was considered as clinically relevant [24,25], so we code course factor for BMI into decrease (BMI decreased ≥ 5%), increase (BMI increased ≥ 5%), and stable. For bilateral knee pain and physical examination items, we code each course factor into negative at both time points (baseline and 2-year follow-up), positive at either time point, and positive at both time points. For WOMAC individual items, osteophyte and JSN, we code course factors into three categories by the changes in severity: decrease (severity decreased one grade or more), increase (severity increased one grade or more), and stable. We chose the "one grade" as the threshold mainly based on prior knowledge that 'one grade' is considered as a minimal detectable difference in the WOMAC questionnaire [21,26] and Kellgren & Lawrence grading system [23]. According to the previous literature, tibiofemoral joint angle change of less than 2°should be considered as measurement error [22]. Hence, we code the course factor for joint angle into decrease (angle decreased ≥ 2°), increase (angle increased ≥ 2°), and stable.
Knowing that few patients (1%) presented bony swelling, joint warmth, and joint effusion at both time points, we incorporated these patients into the category of positive at either time point. Similarly, few patients (1-3%) presented decreased severity in radiographic items (except tibiofemoral joint angle), thus we incorporated these into the stable category.

Outcome factor
We used the clinical expert consensus-based diagnosis as the outcome factor. Our previous studies described the process in detail [15,27]. Briefly, we recruited both general practitioners and secondary care physicians to evaluate each knee's longitudinal (from 5-to 10-year follow-up) clinical and radiographic data. Clinical experts made the final diagnosis for each knee of whether clinically relevant knee OA developed during follow-up based on consensus. No formal definition of clinically relevant knee OA was provided to the clinicians; they were instructed to use their own clinical expertise to judge this. The final diagnosis was made upon agreement by clinicians (intraclass correlation coefficient 0.908; 95% confidence interval (CI), 0.821 to 0.965) [15] and for each knee, the final diagnosis could be one of the following options: OA, no OA, and uncertain.

Statistics
We checked missing data of baseline and 2-year followup factors and replaced them by multiple imputation (created 50 datasets, 49% cases had incomplete data, but only 2 variables had more than 10% missing values). Next, we created course factors for each knee.
During the model building process, we firstly excluded knees that were diagnosed as "uncertain." We did not calculate the formal sample size but were sure to meet the rule of thumb for at least 10 OA knees per predictor. We adopted the same models (contains baseline factors only) as we developed in the previous study as the baseline-factor models for this study [15]. For building course-factor models, we used the same stepped approach as for our baseline-factor models. First, we build a model by including clinical course factors only (model 1) and then including both clinical and radiographic course factors (model 2). Since the two knees from the patients with bilateral complaints would share the same personal data (i.e., age, sex, and BMI) and might have correlated measurement results, we treated the data of the two knees as repeated measures within one person. To adjust for repeated measures, we applied generalized estimating equation (GEE) with a backward selection method (P > 0.1 removal) to build the models. In this way, final models can be used for calculating the probabilities of individual knees. With treating the category of stable or negative at both time points as reference, we incorporated the other two categories into the reference category if tested insignificant (P > 0.1).
Finally, we added factors of the final model 1 into our baseline clinical-factor model (developed in our previous study [15]). We built this combined clinical model We presented all model factors as pooled odds ratios (OR) and 95% CI, and tested model discrimination via the receiver operating characteristic curve. Pooled area under the curve (AUC) and its 95% CI were calculated. To identify whether course factors have additive discriminative value over baseline factors, we compared AUC values of model 3 and model 4 with those of the two corresponding baseline-factor models using the method of Delong et al. [28]. To evaluate each factor's contribution in the 4 models, we continued backward selection and removed the factor (with the highest p value) step by step until the last one. AUC was calculated for each step.
We internally validated all the models by estimating model calibration and over-fitting [29]. We tested model calibration via calibration plot and Hosmer and Lemeshow statistics. P > 0.05 of Hosmer and Lemeshow test indicates good calibration. We detected model overfitting by bootstrapping 1000 samples from the derivation dataset (with replacement) [29]. The amount of optimism was evaluated according to the change in AUC.
We performed sensitivity analysis for the 4 models by including 'uncertain' knees into the dataset, and assessed model discriminations when treating 'uncertain' knees as OA knees and as no OA, respectively.
Model building, discrimination, and sensitivity analysis were performed with software SPSS version 25.0 (IBM, Chicago, USA). AUC comparison and internal validation were performed with R software version 3.6.1. Development and reporting these models followed TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) guidance (see Additional file 1 TRIPOD checklist) [29].

Patients and factors
Seven hundred sixty-one patients with 1185 symptomatic knees were included in this study. Nine hundred forty-eight (79%) were female; the mean (SD) age is 56 (5) years. Characteristics, pooled after multiple imputation, of baseline and course factors, are presented in Table 1. Four hundred thirty-eight (37%) knees were diagnosed as OA, 532 (45%) were as no OA, and 215 (18%) were as "uncertain" in the final diagnosis.

Models
Six clinical course factors were retained in model 1, and 9 (6 clinical and 3 radiographic) course factors were retained in model 2. Pooled OR are presented in Table 2. In both model 1 and model 2, worsening of clinical or radiographic signs over the first 2 years, except the course of restricted/ painful extension, indicated a higher probability of early stage knee OA.
Six baseline clinical factors combined with 5 clinical course factors were retained in model 3, and 5 (4 clinical and 1 radiographic) baseline factors combined with 8 (5 clinical and 3 radiographic) course factors were retained in model 4. Pooled OR are presented in Table 2. In both model 3 and model 4, more severe baseline status combined with worsening of clinical or radiographic signs over the first 2 years, except the course of restricted/ painful extension, indicated a higher probability of early stage knee OA.
Four final model equations for calculating individual probability are presented in Fig. 1.

Internal validation and sensitivity analysis
Both model 1 and model 2 presented good internal calibration (model 1, p = 0.24; model 2, p = 0.40) (Fig. 2) and were detected with no over-fitting when rounded to two decimals (Additional file 1: table 1). Sensitivity analysis showed minimal reduction (3% to 5%) of AUCs while incorporating "uncertain" knees into the dataset for both model 1 and model 2 (Additional file 1: table 2).

Discussion
This study showed that information on the early clinical course can help to diagnose early stage knee OA. Both models with only course factors included presented "fair" discriminations and good internal validations. Adding these identified course factors into baseline models, we found the two combined models had significantly better discriminative abilities than the baseline models. However, the improvements in AUC should be considered as small and further studies are needed for evaluating the clinical relevance. Baker et al. reported that interpretation of a small increase in AUC should be made based on balancing benefits and costs of obtaining new factors; even an increase of 0.02 in AUC by additional factors was demonstrated worthwhile in their study [30].
The principal motivation for this study was to build implementable diagnostic tools for early stage knee OA. For this purpose, our study was designed to have the following strengths: first, our models were built among a large population who were in suspicion of early stage knee OA and began to look for medical care in primary care [19]. Comparing to other well-known early stage knee OA cohorts, such as the osteoarthritis initiative (incident OA subgroup) [31], MOST [32], and CASK [33] cohorts, the population in CHECK presents an even earlier stage with milder structural damage. Therefore, the first 2-year course identified in this study should be considered as early clinical course. Second, our models used diagnoses of real clinical experts (experienced general practitioners and secondary care physicians) as the reference standard, obtained via a pre-designed protocol. Most diagnoses of prior knee OA models were based on radiographic assessments, equaling (incident) radiographic knee OA [7][8][9][10][11]. However, not all of these radiographic OA would be diagnosed as OA in real practice; in the CHECK cohort, the overlap between radiographic knee OA and the expert diagnosis was only 59% [15]. Third, the factors in our models are clinically implementable assessments. As early diagnosis is mainly done in primary care, where the use of radiography to diagnose OA is discouraged, our model 1 and model 3 can be used. Model 2 and model 4 can be implemented in clinical settings where radiographs are available. In contrast, models that apply novel factors such as (quantified) MRI features and laboratory (genetic) biomarkers [7][8][9][10][11][12][13], are costly and not applicable in daily clinical practice. In addition, the AUCs of such models have ranged from 0.72 to 0.83 [7][8][9][10][11][12][13], fully comparable with our results obtained with routine and low cost procedures.
Theoretically, the individual probability of early stage knee OA can be calculated via inputting personal attributes into our model equations. The additive discriminative value   The decrease category was not significant and incorporated into the reference category of course factors indicates that the 2-year follow-up based early diagnosis should be more accurate than using baseline data only. On the other hand, based on the equations, it makes more sense to apply these models to the patients with worsening conditions within the first 2 years, since these patients are more likely to result in higher probabilities and need treatment. Furthermore, these findings raise a new strategy for selecting predictors. Future studies on building prediction/diagnostic models for knee OA could also take a period of disease course into account. This study presented the first step of establishing early diagnostic criteria based on patients' first 2-year clinical course, but certainly, further studies are warranted before implementing these models in real practice. There are probably too many factors in our final models, especially model 3 and model 4, which increases the difficulty of implementing in real practice. Making models more concise seems necessary. According to each factor's contribution in our models, it should be feasible to decrease the number of factors in model 3 and model 4 to a maximum of 6 or 7 with AUC values around 0.75 (AUC of 0.75 or above indicates clearly useful discrimination [34]). Further evaluations on each factor's clinical implications as well as the costs and benefits are required before deciding which factors to be removed. Meanwhile, model external validation is an essential step as it evaluates the reliability of applying models in other populations. Moreover, to establish clinically practical diagnostic criteria, a probability threshold for ruling in and ruling out early stage knee OA is needed, after which diagnostic measures including predictive values, sensitivity, and specificity can be assessed.
In general, model factors indicated that knees with early deterioration are more likely to have early stage knee OA, which is mostly consistent with the results of other studies [16,35,36]. It is notable that one course factor, restricted/painful extension, was inversely related to clinically relevant knee OA in our models. A similar phenomenon was found in another study [37] whose model presented dyslipidemia and a family history of premature coronary artery disease was a protective factor for all-cause mortality. This was explained as caused by an unmeasured confounder of lipid-lowering medications usage in these patients. In this study, the patients with restricted/painful joint extension (presented at either time point) probably get some unmeasured but effective interventions as well, such as physical therapy [38,39]. We incorporated this factor in our model mainly for its additive discriminative ability, but this inverse relationship needs to be externally validated.
There are several limitations in this study. First, misclassification bias cannot be ruled out when dealing with course factors, especially in radiographic factors. As we can see, some knees were found to have milder structural features after the 2 years, most of which should be considered as misclassifications (measurement errors). Given the low rates and these variables were created independent from the outcome, it should be considered as non-differential misclassification bias and has a very limited impact on model estimates. Second concern is that our models are based on the first 2-year follow-up, which means the model makes an early diagnosis 2 years Fig. 2 Calibration plots of the four models. Blue points represent data points of mean predicted against mean observed within certain range of predicted probability. Orange line represents a regression smoother through data points. Gray line represents perfect calibration after baseline. The cost of this time delay is unknown. Since our baseline assessment was at the patient's first consultation for knee complaint and the CHECK cohort was proven to include patients at an early stage, we assume 'wait and see' or 'inconclusive diagnosis' together with some symptomatic treatments in the 2 years is justifiable. Third, since there was no evidence of defining time interval for detecting early OA disease course, we chose the 2 years based on the availability of follow-up data in the CHECK cohort. Further studies on exploring other time intervals or verifying this choice are needed. Fourth, a minimal amount of overfitting was detected in model 4, which might cause inaccurate probability estimations. According to a previous study, optimism is acceptable if less than 5% [40]. Therefore, we did not adjust the model intercept and coefficients.

Conclusions
Four diagnostic models for early stage knee OA were developed based on the early clinical course and were well internally validated. Clinical course factors had statistically additive discriminative value over baseline factors, but the clinical relevance is yet to be determined. For real practice, findings of this study suggest a reevaluation for patients with conditions get worse after baseline assessment.