Skip to main content

Identification of biomarkers by machine learning classifiers to assist diagnose rheumatoid arthritis-associated interstitial lung disease



This study aimed to search for blood biomarkers among the profiles of patients with RA-ILD by using machine learning classifiers and probe correlations between the markers and the characteristics of RA-ILD.


A total of 153 RA patients were enrolled, including 75 RA-ILD and 78 RA-non-ILD. Routine laboratory data, the levels of tumor markers and autoantibodies, and clinical manifestations were recorded. Univariate analysis, least absolute shrinkage and selection operator (LASSO), random forest (RF), and partial least square (PLS) were performed, and the receiver operating characteristic (ROC) curves were plotted.


Univariate analysis showed that, compared to RA-non-ILD, patients with RA-ILD were older (p < 0.001), had higher white blood cell (p = 0.003) and neutrophil counts (p = 0.017), had higher erythrocyte sedimentation rate (p = 0.003) and C-reactive protein (p = 0.003), had higher levels of KL-6 (p < 0.001), D-dimer (p < 0.001), fibrinogen (p < 0.001), fibrinogen degradation products (p < 0.001), lactate dehydrogenase (p < 0.001), hydroxybutyrate dehydrogenase (p < 0.001), carbohydrate antigen (CA) 19–9 (p < 0.001), carcinoembryonic antigen (p = 0.001), and CA242 (p < 0.001), but a significantly lower albumin level (p = 0.003). The areas under the curves (AUCs) of the LASSO, RF, and PLS models attained 0.95 in terms of differentiating patients with RA-ILD from those without. When data from the univariate analysis and the top 10 indicators of the three machine learning models were combined, the most discriminatory markers were age and the KL-6, D-dimer, and CA19-9, with AUCs of 0.814 [95% confidence interval (CI) 0.731–0.880], 0.749 (95% CI 0.660–0.824), 0.749 (95% CI 0.660–0.824), and 0.727 (95% CI 0.637–0.805), respectively. When all four markers were combined, the AUC reached 0.928 (95% CI 0.865–0.968). Notably, neither the KL-6 nor the CA19-9 level correlated with disease activity in RA-ILD group.


The levels of KL-6, D-dimer, and tumor markers greatly aided RA-ILD identification. Machine learning algorithms combined with traditional biostatistical analysis can diagnose patients with RA-ILD and identify biomarkers potentially associated with the disease.


Rheumatoid arthritis (RA) is a common systemic inflammatory disease caused by the interactions between genetic and environmental factors; the prevalence in the general population ranges from 0.5 to 2%. RA is characterized by synovitis and erosive destruction of the cartilage and bone [1, 2]. Notably, various extra-articular manifestations are common [3]. Pulmonary involvement is particularly common, potentially affecting all compartments of the respiratory system, including the serosal, airway, and/or parenchymal tissues [4]. Interstitial lung disease (ILD) caused by lung parenchymal damage is often the most devastating lung issue; the prevalence ranges from 6 to 30%. ILD is one of the leading causes of morbidity and premature mortality in RA patients [3, 5]. RA-ILD was first reported by Ellman and Ball in 1948 [6]. In a recent study, the 1- and 5-year mortality rates were 13.9 and 39.0%, respectively, compared to 3.8 and 18.2% in RA patients without ILD [7]. Hence, early recognition and monitoring of RA-ILD is paramount to potentially alter the disease course.

RA-ILD diagnosis requires multidisciplinary discussion and evaluation of patient’s medical history, clinical characteristics, laboratory indicators, high-resolution computed tomography (HRCT), pulmonary function test (PFT), and even lung biopsy [8]. Although ILD is well-recognized as a common comorbidity of RA, the present assessment tools (chest X-ray, HRCT, and PFT) may not be optimal for all patients. Radiation exposure and high cost may limit the use of HRCT in clinical practice, especially in younger patients and those for whom disease progression must be monitored over time [9]. Therefore, biomarkers assisting RA-ILD diagnosis, and that aid prognosis, assessment, and follow-up are urgently required.

Krebs von den Lungen-6 (KL-6) is a mucin-like, high-molecular-weight glycoprotein expressed on the surface membranes of alveolar and bronchiolar epithelial cells, particularly on type II pneumocytes that are damaged or regenerating; KL-6 is then secreted into the bloodstream through damaged alveolar basement membrane [10]. Recent study demonstrated that KL-6 plays important roles in the diagnosis, prognostic assessment, and risk stratification of connective tissue disease-related interstitial lung disease (CTD-ILD) [11]. Additionally, the development of tumor markers may also contribute to ILD; their diagnostic utilities have been investigated. The levels of carbohydrate antigen (CA) 19–9, CA125, CEA, and CA15-3 were increased compared to a control group of RA-non-ILD patients [12, 13]. D-dimer is the end-product of cross-linked fibrinolysis and is involved in the acute phase of inflammation; it may thus contribute to the pathophysiology of RA-ILD [14]. Tian et al. [15] assessed the levels of various serum markers in a cohort of CTD-ILD patients and found that the D-dimer levels were elevated. Based on this, we hypothesized that integration of these indicators might aid the screening of RA patients with ILD. However, few integrated models that effectively differentiate RA patients with and without ILD have been reported. Thus, an integrated model that combines multiple biomarkers to diagnose RA-ILD is pressing.

Over the past decade, great strides have been made in machine learning (a branch of artificial intelligence). Computers simulate human learning, build analytical models as they learn by example, train and evaluate models, and self-improve over multiple cycles in terms of their predictive powers. Machine learning allows researchers to use complex data and develop self-trained strategies to predict the characteristics of new samples. The algorithms have found applications in clinical fields, including disease prediction, diagnosis, and prognosis, and in drug discovery [16,17,18]. A method that combines multiple biomarkers to diagnose RA-ILD would be optimal. Here, we used machine learning to integrate data on the levels of KL-6, tumor biomarkers, and routine laboratory parameters and clinical features in order to identify the biomarkers that best diagnose RA-ILD.

Materials and methods


This was a retrospective analysis of 153 patients (57 new-onset RA patients and 96 treated RA patients hospitalized due to disease relapse, 103 females and 50 males, mean age 53.82 ± 14.29 years) who met the the definitive 1987 RA classification criteria of the American College of Rheumatology (ACR) at the Second Hospital of Shanxi Medical University between February 2020 and November 2021 [19]. All patients were divided into two groups: the RA-ILD group and the RA-non-ILD group. ILD was diagnosed by a rheumatologist and radiologist based on HRCT-revealed reticular abnormalities and honeycombing and clinical features. The disease activity was evaluated using the disease activity score 28-ESR [DAS28(ESR)], which is the most frequently used clinical tool to determine RA disease severity [20]. Patients who were younger than 18 years of age or pregnant, or who suffered from a malignant disease (a cancer/tumor), sarcoidosis, amyloidosis, an infection (bacteria, viral, or fungal), or other autoimmune diseases, were excluded. All patients had stopped drug treatment for more than 3 months at the time of sampling. The study was approved by the ethics committee of the Second Hospital of Shanxi Medical University (2016KY007). Informed consent was obtained from all individuals.

Clinical and laboratory indices

The clinical parameters of all patients were retrospectively collected; these included age, gender, disease duration, and clinical manifestations (the tender joint count [TJC], swollen joint count [SJC], and DAS28). The routine laboratory data included the white blood cell (WBC), red blood cell (RBC) count, hemoglobin (Hb), platelet (PLT), lymphocyte (LYMPH), and neutrophil (NEUT); erythrocyte sedimentation rate (ESR), C-reactive protein (CRP), and immune globulin (Ig) G, IgM and IgA; alanine transaminase (ALT), aspartate aminotransferase (AST), serum total protein (TP), albumin (ALB), globulin (GLO), lactate dehydrogenase (LDH), and lactate dehydrogenase (HBDH); and RA-related autoantibodies (rheumatoid factor [RF], anti-nuclear antibodies [ANA], anti-perinuclear factor [APF], anti-keratin antibodies [AKA], anti-cyclic citrullinated peptide antibody [CCP], and anti-mutated citrulline vimentin [MCV]). We also recorded the levels of D-dimer, fibrinogen degradation products (FDP), fibrinogen (FIB), and tumor markers (CA19-9, CA125, CA153, CA242, neuron-specific enolase [NSE], carcinoembryonic antigen [CEA], squamous cell carcinoma antigen [SCC], and alpha-fetoprotein [AFP]).

KL-6 assay

Peripheral venous blood samples from RA patients were collected immediately after admission and before drug administration (within 24 h of hospitalization) and stored at –80 °C. The levels of KL-6 were measured using the Kaeser 6600 chemiluminescent immunoassay following the manufacturer’s instructions.

Statistical analysis

All data were analyzed using the SPSS 22.0, R package (version 4.0.2) and MedCalc software. In univariate analysis, the data were described as mean ± SD or as median (Q25, Q75) for continuous variables, and were compared using the independent samples t-test or the Mann–Whitney U test, respectively. The effect of age on various parameters was corrected with the aid of the covariance test. The chi-square test was employed to compare categorical variables expressed as numbers with percentages. Next, a total of 34 continuous variables described in the univariate analysis were incorporated into the least absolute shrinkage and selection operator (LASSO), random forest (RF), and partial least square (PLS) and were employed to classify patients with RA-ILD and RA-non-ILD. In this study, machine learning was trained on 70% subsets with tenfold cross-validation; the 30% holdout subsets were used for validation of the final model. We set 10 random seeds, and each seed corresponded to tenfold cross-verification; we got 10 different data segmentation “optimal model” by re-iterating tenfold cross-validation. We obtained the ranking of important variables of each “optimal model” through varlmp function (Package caret version 6.0). The top 10 most-weighted features were designated as an important feature when the AUC of LASSO, RF, and PLS was biggest in the 10 “optimal model,” respectively. Overall important biomarkers were selected on the basis of being simultaneously important of three machine learning algorithms and had significant differences in univariate analysis. The performance of biomarkers was evaluated by drawing receiver operating characteristic (ROC) curves. The area under curve (AUC), the cut-off, sensitivity, specificity, positive likelihood ratio (+ LR), negative likelihood ratio (-LR), Youden index, and comparisons of these biomarkers were performed by MedCalc software. Spearman rank correlation analysis was used to analyze correlations between biomarkers and disease activity. Figure 1 shows the study design and the analytical plan flow. The p value < 0.05 was considered to indicate statistical significance.

Fig. 1
figure 1

The design and analysis plan flow diagram in this study. RA, rheumatoid arthritis; ILD, interstitial lung disease; LASSO, least absolute shrinkage and selection operator; RF, random forest; PLS, partial least square


Demographic and clinical characteristics of RA patients

The 153 RA patients were divided into RA-ILD group (n = 75) and RA-non-ILD (n = 78). Before employing the machine learning algorithms, we used a conventional biostatistics approach to analyze the differences between RA-ILD (45 females, 30 males) and RA-non-ILD (58 females, 20 males) patients. The details of demographic, clinical, and laboratory features between the two groups were summarized in Table 1. The a higher frequency of RA-ILD than RA-non-ILD in men, but no significant differences (p = 0.058). There was no significant differences in smoking history (p = 0.101) between the RA-ILD and RA-non-ILD groups. However, the RA-ILD patients were significantly older in than RA-non-ILD patients (62.84 ± 8.71 vs. 45.15 ± 13.31 years, p < 0.001). The clinical manifestations such as TJC and SJC were similar in the two groups (both p > 0.05). Compared to RA-non-ILD patients, the patients with RA-ILD exhibited a higher WBC count (p = 0.003), NEUT count (p = 0.017), ESR (p = 0.003), and CRP (p = 0.003), but a significantly lower ALB level (p = 0.003).

Table 1 Comparisons of the demographic, clinical, and laboratory features between the RA-ILD and the RA-non-ILD group

KL-6 and tumor markers were increased in patients with RA-ILD 

The KL-6 level was significantly higher in the RA-ILD than the RA-non-ILD group [470.46 (288.92, 804.88) U/mL vs. 260.77 (188.07, 368.79) U/mL, p < 0.001]. The levels of CEA [2.30 (1.21, 3.81) ng/mL vs. 1.39 (0.95, 2.03) ng/mL, p = 0.001], CA19-9 [9.14 (5.59, 22.44) KU/L vs. 5.04 (3.12, 8.01) KU/L, p < 0.001] and CA242 [6.89 (4.01, 13.14) KU/L vs. 3.85 (2.86, 6.01) KU/L, p < 0.001] were higher in patients with RA-ILD than RA-non-ILD, but no significant between-group difference was noted for NSE, SCC, AFP, CA125, and CA153 (all p > 0.05). Meanwhile, the levels of D-dimer [961.50 (294.50, 3360.25) ng/mL vs. 263.00 (138.00, 604.00) ng/mL, p < 0.001], FIB [4.30 (3.59, 4.95) g/L vs. 3.37 (2.83, 4.18) g/L, p < 0.001], FDP [5.40 (2.31, 10.61) μg/mL vs. 2.39 (1.07, 4.43) μg/mL, p < 0.001)], LDH [197.00 (171.75, 226.50) U/L vs. 170.00 (148.00, 191.75) U/L, p < 0.001] and HBDH [142.50 (128.00, 159.25) U/L vs. 123.50 (109.00, 136.75) U/L, p < 0.001] in patients with RA-ILD were significantly higher than in those with RA-non-ILD (Fig. 2). Thus, results suggested that these parameters could be potentially promising biomarkers of RA-ILD.

Fig. 2
figure 2

Elevated biomarkers level in RA-ILD patients. The levels of KL-6 (a), D-dimer (b), FIB (c), FDP (d), LDH (e), HBDH (f), CEA (g),CA19-9, and CA153 (h) were significantly higher in RA-ILD patients. ILD, rheumatoid arthritis-related interstitial lung disease; Non-ILD, rheumatoid arthritis-without interstitial lung disease; KL-6, Krebs von den Lungen-6; FIB, fibrinogen; FDP, fibrinogen degradation products; LDH, lactate dehydrogenase; HBDH, hydroxybutyrate dehydrogenase; NSE, neuron-specific enolase; CEA, carcinoembryonic antigen; SCC, squamous cell carcinoma antigen; AFP, alpha-fetoprotein; CA, carbohydrate antigen

Multiple machine learning models distinguishing RA-ILD from RA

We used the LASSO, RF, and PLS to further distinguish RA-ILD and RA-non-ILD patients and to screen for valuable variables. The classification accuracy of models remained stable in 10 runs; the AUCs of LASSO, RF, and PLS were 0.84 to 0.95, 0.85 to 0.95, and 0.81 to 0.95, respectively (Supplemental Table 1). ROC analysis revealed a max AUC of 0·95 (accuracy 95%), indicating outstanding efficiency in discriminating between RA-ILD from RA-non-ILD patients (Fig. 3). The top 10 contributing features were age, KL-6, FIB, D-dimer, CA199, WBC, NEUT, NSE, AFP, and SJC for LASSO; age, KL-6, FIB, D-dimer, CA199, CA242, LDH, CEA, HBDH, and WBC count for RF; and age, KL-6, D-dimer, CA19-9, CA242, LDH, CRP, ESR, CA153, and PLT for PLS (Fig. 4).

Fig. 3
figure 3

Machine learning approaches are effective at separating RA-ILD and RA-non-ILD subjects. The maximum of area under the ROC curve of LASSO (a), RF (b, c)

Fig. 4
figure 4

Venn diagram showing the four characteristic markers identified by the univariate analysis, LASSO, RF, and PLS model

Clinical values of biomarkers in diagnosing ILD in RA patients

Based on the LASSO, RF, and PLS, and univariate analysis, four simultaneously important indicators were identified: age, KL-6, D-dimer, and CA19-9. The ROC curves of these four indicators were plotted in Fig. 5. ROC curve analysis revealed that the AUC of age was 0.814 (95% CI 0.731–0.880, p < 0.001), with a sensitivity of 93.33% and a specificity of 67.95%. The cut-off value for KL-6 was set at 373.65 U/mL, with a sensitivity of 61.33% and a specificity of 78.21% [AUC 0.749 (95% CI 0.660–0.824), p < 0.001]. The AUCs for D-dimer and CA19-9 were 0.749 (95% CI 0.660–0.824, p < 0.001) and 0.727 (95% CI 0.637–0.805, p < 0.001), respectively. Furthermore, the ROC curve for the combination of age, KL-6, D-dimer, and CA19-9 exhibited an AUC of 0.928 (95% CI 0.865–0.968, p < 0.001) with a sensitivity of 83.82% and a specificity of 81.63%. The AUC provided by the biomarker combination was significantly higher than that of age, KL-6, D-dimer, or CA19-9 alone (Z = 3.248, p = 0.001; Z = 4.256, p < 0.001; Z = 4.196, p < 0.001; and Z = 4.523, p < 0.001). The diagnostic efficiencies of the four biomarkers were summarized in Table 2. Taken together, these observations showed that the multivariate models outperformed single biomarkers in diagnosing RA-ILD.

Fig. 5
figure 5

Important biomarkers were selected from multiple analyses and ROC curves were plotted. The ROCs of age, KL-6, D-dimer, and CA19-9, and their combination were plotted to differentiate RA-ILD from RA-non-ILD. The ROC curve for the combination of age, KL-6, D-dimer, and CA19-9 exhibited an AUC of 0.928

Table 2 The predictive power of multiple biomarkers in the diagnosis of patients with RA-ILD vs. RA-non-ILD

Associations between biomarkers and disease activity indicators

The correlation analysis between biomarkers and disease activity was conducted in RA and RA-ILD patients (Fig. 6). Significant positive correlations were found between D-dimer level and disease activity index in all RA patients, such as ESR (r = 0.586, p < 0.001), CRP (r = 0.574, p < 0.001), DAS28 (r = 0.414, p < 0.001), IgG (r = 0.326, p < 0.001), IgA (r = 0.318, p < 0.001), and IgM (r = 0.261, p < 0.001). The CA19-9 level were weakly correlated with the ESR (r = 0.199, p = 0.008), but we found no correlations between KL-6 and disease activity indicators (p > 0.05), suggesting that KL-6 and CA19-9 might be involved in the pathogenesis of ILD rather than RA. Further analysis proved that there was no obvious correlation between the KL-6 and CA19-9, and any disease activity indicator, in patients with RA-ILD (all p > 0.05).

Fig. 6
figure 6

Heatmap of correlation between the biomarkers and disease characteristics. D-dimer was positively associated with disease activity index in patients with RA (a) and RA-ILD (b), but no correlations between KL-6 and disease activity. * = p < 0.05, ** = p < 0.001, and *** = p < 0.001 by Spearman correlation test


ILD, the most common and serious complication of RA, can occur at any stage of RA. Paradoxically, despite the lung involvement, patients with RA-ILD may remain asymptomatic long-term [3]. Respiratory symptoms (cough, wheezing, or dyspnea) are not obvious in most RA-ILD patients, bringing about challenges to diagnosis, early discovery, and management [21]. With the disease progresses, respiratory failure may develop, leading to poor prognosis and clinical death of patients [22]. The pathogenesis of RA-ILD remains incompletely understood, although genetic, humoral, and environmental factors seem to be involved. Older age, autoantibodies production (anti-CCP and RF), and cigarette smoking may increase the incidence of ILD [23, 24].

We found that the higher frequency of RA-ILD than RA-non-ILD in men, but no significant difference. This may be due to smoking being strongly associated with ILD in males. There was no significant difference in smoking between RA-ILD and RA-ILD groups (21.33% vs 11.54%) in the study, but the odds ratio was 2.079 (Supplementary table 2). Kelly et al. [25] showed the male:female ratio was 1:1.09 in 230 patients with RA-ILD and smoking was associated with ILD in males. In addition, most of the patients with RA-ILD were RF seropositive, older than RA-non-ILD patients. Consistent with our finding, Lee et al. [26] and Kass et al. [27] showed the mean age was significantly higher in the ILD group. The RA-ILD patients had higher levels of disease activity indicators (ESR, CRP, WBC count, and NEUT count), suggesting that ILD might aggravate primary RA. Therefore, it is essential to systematically screen for RA-ILD biomarkers; this permits the management of early-stage of ILD. Over the past decade, several biomarkers diagnostic of RA-ILD have emerged [28, 29]. However, most studies focused on single markers. To the best of our knowledge, this is the first study using a machine learning algorithm to identify multiple biomarkers for RA-ILD, though our data concern a small sample size. Common parameters selected using multiple biostatistical methods are more likely to represent the strongest and true pictures in the data.

We found that the levels of KL-6 and tumor markers (CA19-9, CA242, and CEA) were elevated in RA-ILD patients. Previous studies suggested that RA-ILD patients had significantly higher serum KL-6 and tumor markers than did those without ILD, and that these markers were strongly associated with the severity of ILD [13, 28]. KL-6 is chemotactic for lung fibroblasts and exerts pro-fibrotic and anti-apoptotic effects on these cells [28]. It remains unclear why the levels of tumor markers were elevated, but the results (especially CA199 and CEA) are consistent with observations from patients with CTD-ILD [29, 30]. Wang et al. assessed the levels of various serum tumor markers in a cohort of RA-ILD patients without cancer and found that the CA19-9 level was increased compared to that of RA patients without ILD [12]. CEA has been reported to reflect the proliferation and secretion of epithelial cells [31]. CA19-9 is secreted apically from the bronchial gland, and may induce NEUT maturation; the CA19-9 level correlated positively with NEUT count. Persistent epithelial cell damage and NEUT accumulation in the respiratory tract may explain the high levels of CA19-9 [32].

Furthermore, our results showed that the D-dimer level in the RA-ILD group was higher than that in the RA-non-ILD group. This may reflect the fact that D-dimer (a final product of fibrin degradation) is involved in the acute phase of inflammation [14]. In the acute phase of RA, an elevated D-dimer level may reflect upstream tissue damage caused by inflammatory [33]. We further found that the FIB and FDP levels in the RA-ILD group were significantly higher than in the RA-non-ILD group. In addition, the LDH and HBDH levels were significantly elevated in patients with RA-ILD, providing a new perspective for diagnosing RA-ILD. This may be due to the up-regulation of LDH expression by mammalian target of rapamycin (mTOR) activation on downstream targets, which further leads to the increase of serum HBDH levels [34]. mTOR is a key regulator of cell growth, activation, proliferation, and survival, and is involved in the occurrence and development of both RA and ILD [35, 36].

Subsequently, we used three machine learning algorithms to classify patients with RA-ILD and RA-non-ILD and to assess the importance of various parameters in terms of patient classification. Machine learning models that afford good predictive accuracy can be used to generate reliable biomarkers [17]. We augmented the model strength and stability by running the training iterations tenfold cross-validation and constructing 10 different data segmentation models. Such tenfold cross-validation simulates the more standardized diagnostic test and affords better classification [37]. Interestingly, all three approaches delivered highly consistent results. The best AUCs of the LASSO, RF, and PLS were all 0.95, suggesting that the identified markers robustly enhance current disease classification. Using the Lasso, RF, and PLS, RA patients are likely to be correctly classified as ILD or non-ILD. Our methods are the first to identify serum features associated with RA-ILD. However, machine learning does not replace traditional analytical analyses, rather further assisting clinical diagnosis by enhancing existing methods.

Importantly, four indicators, age, KL-6, D-dimer, and CA19-9, were identified as the most valuable biomarkers by the three machine learning algorithms and univariate analysis; and the four biomarkers might be involved in the occurrence and development of ILD. Notably, the ROC curve for the combination of age, KL-6, D-dimer, and CA19-9 exhibited an AUC of 0.928, a sensitivity of 83.82%, and a specificity of 81.63%. We further explored the correlations between biomarkers and ILD. Remarkably, we found no correction between the KL-6 or CA19-9 level and disease activity, indicating that KL-6 and CA19-9 may be independent predictors independent of disease activity and might be involved in the pathogenesis of the ILD rather than RA. Compared to the other biomarkers, KL-6 has the superior diagnostic value.

Last but not least, the diagnosis of ILD usually depends on HRCT, PFT, and lung ultrasound (LUS). HRCT can identify even subtle ILD changes and monitor existing diseases. However, radiation exposure and high cost restrict its use for screening and monitoring purposes [9]. PFT, especially forced vital capacity and diffusing capacity for carbon monoxide, could help guide management strategies. However, its role in screening for early asymptomatic ILD is controversial due to low sensitivity and poor repeatablility [38]. Over the past two decades, LUS has developed into a promising tool for assessing lung parenchymal disease by detecting and quantifying the number of B lines. However, adequate theoretical and practical training are prerequisites for LUS use. In addition, accurate results require more scanning sites and more time [39]. At first glance, the combination described in this study was based on the measurement of four different blood parameters, which may raise feasibility issues. However, the quantitative measurements of KL-6, D-dimer, and tumor markers in the blood can be performed easily and rapidly in most laboratories. In addition, the inherent characteristics of biomarker, including that it is non-ionizing, non-invasive, at low cost, repeatable, and easily accessible, make the combination possible initial screening tool of RA-ILD and aid clinicians to determine if ILD is present in RA patients [40]. Although the model is logical and easy to use, it still has some shortcomings. In the selection of biomarkers and the development of models, a hold out test set, or an external validation cohort should be employed to validate our findings, which can greatly improve the rigor and accuracy of the study, however, the small sample size limited the execution in this study. Therefore, prospective studies in larger cohorts need to be performed to verify the predictive value of the models.


In conclusion, we used novel tools to identify biomarkers associated with ILD in an RA cohort. Integration of traditional biostatistical methods with emerging machine learning algorithms yielded simple a model predicting RA-ILD, which may provide a new idea for future studies on the diagnosis of ILD and could also be generalized to predict the involvement of other organs.

Availability of data and materials

All data generated or analyzed during this study are available from the corresponding author on reasonable request.



Rheumatoid arthritis


Interstitial lung disease


High-resolution computed tomography


Krebs von den Lungen-6


Connective tissue disease-related interstitial lung disease


Carbohydrate antigen


American College of Rheumatology


Tender joint count


Swollen joint count


Disease activity index 28


White blood count


Red blood cells










Erythrocyte sedimentation rate


C-reactive protein


Alanine transaminase


Aspartate aminotransferase


Total protein






Lactate dehydrogenase


Lactate dehydrogenase


Immune globulin


Rheumatoid factor


Anti-nuclear antibodies


Anti-perinuclear factor


Anti-keratin antibodies


Anti-mutated citrulline vimentin


Anti-cyclic citrullinated peptide antibody


Fibrinogen degradation products




Neuron-specific enolase


Carcinoembryonic antigen


Squamous cell carcinoma antigen


Alpha fetoprotein


Least absolute shrinkage and selection operator


Random forest


Partial least square


Receiver operating characteristic


Area under curve





 + LR:

Positive likelihood ratio


Negative likelihood ratio


Mammalian target of rapamycin


  1. Sparks JA. Rheumatoid Arthritis. Ann Intern Med. 2019;170:ITC1–16.

    Article  Google Scholar 

  2. Minichiello E, Semerano L, Boissier MC. Time trends in the incidence, prevalence, and severity of rheumatoid arthritis: a systematic literature review. Joint Bone Spine. 2016;83:625–30.

    Article  Google Scholar 

  3. Conforti A, Di Cola I, Pavlych V, Ruscitti P, Berardicurti O, Ursini F, et al. Beyond the joints, the extra-articular manifestations in rheumatoid arthritis. Autoimmun Rev. 2021;20:102735.

    Article  Google Scholar 

  4. Suda T. Up-to-date information on rheumatoid arthritis-associated interstitial lung disease. Clin Med Insights Circ Respir Pulm Med. 2016;9:155–62.

    PubMed  PubMed Central  Google Scholar 

  5. Wang Y, Chen S, Zheng S, Lin J, Hu S, Zhuang J, et al. The role of lung ultrasound B-lines and serum KL-6 in the screening and follow-up of rheumatoid arthritis patients for an identification of interstitial lung disease: review of the literature, proposal for a preliminary algorithm, and clinical application to cases. Arthritis Res Ther. 2021;23:212.

    Article  CAS  Google Scholar 

  6. Ellman P, Ball RE. Rheumatoid disease with joint and pulmonary manifestations. Br Med J. 1948;2:816–20.

    Article  CAS  Google Scholar 

  7. Hyldgaard C, Hilberg O, Pedersen AB, Ulrichsen SP, Løkke A, Bendstrup E, et al. A population-based cohort study of rheumatoid arthritis-associated interstitial lung disease: comorbidity and mortality. Ann Rheum Dis. 2017;76:1700–6.

    Article  Google Scholar 

  8. England BR, Hershberger D. Management issues in rheumatoid arthritis-associated interstitial lung disease. Curr Opin Rheumatol. 2020;32:255–63.

    Article  Google Scholar 

  9. Picano E, Semelka R, Ravenel J, Matucci-Cerinic M. Rheumatological diseases and cancer: the hidden variable of radiation exposure. Ann Rheum Dis. 2014;73:2065–8.

    Article  Google Scholar 

  10. Ishikawa N, Hattori N, Yokoyama A, Kohno N. Utility of KL-6/MUC1 in the clinical management of interstitial lung diseases. Respir Investig. 2012;50:3–13.

    Article  Google Scholar 

  11. Hu Y, Wang LS, Jin YP, Du SS, Du YK, He X, et al. Serum Krebs von den Lungen-6 level as a diagnostic biomarker for interstitial lung disease in Chinese patients. Clin Respir J. 2017;11:337–45.

    Article  CAS  Google Scholar 

  12. Wang T, Zheng XJ, Ji YL, Liang ZA, Liang BM. Tumour markers in rheumatoid arthritis-associated interstitial lung disease. Clin Exp Rheumatol. 2016;34:587–91.

    CAS  PubMed  Google Scholar 

  13. Zheng M, Lou A, Zhang H, Zhu S, Yang M, Lai W. Serum KL-6, CA19-9, CA125 and CEA are diagnostic biomarkers for rheumatoid arthritis-associated interstitial lung disease in the chinese population. Rheumatol Ther. 2021;8:517–27.

    Article  Google Scholar 

  14. Wannamethee SG, Whincup PH, Lennon L, Papacosta O, Lowe GD. Associations between fibrin D-dimer, markers of inflammation, incident self-reported mobility limitation, and all-cause mortality in older men. J Am Geriatr Soc. 2014;62:2357–62.

    Article  Google Scholar 

  15. Tian M, Huang W, Ren F, Luo L, Zhou J, Huang D, et al. Comparative analysis of connective tissue disease-associated interstitial lung disease and interstitial pneumonia with autoimmune features. Clin Rheumatol. 2020;39:575–83.

    Article  Google Scholar 

  16. Deo RC. Machine learning in medicine. Circulation. 2015;132:1920–30.

    Article  Google Scholar 

  17. Gawehn E, Hiss JA, Schneider G. Deep learning in drug discovery. Mol Inform. 2016;35:3–14.

    Article  CAS  Google Scholar 

  18. Robinson GA, Peng J, Dönnes P, Coelewij L, Naja M, Radziszewska A, et al. Disease-associated and patient-specific immune cell signatures in juvenile-onset systemic lupus erythematosus: patient stratification using a machine-learning approach. Lancet Rheumatol. 2020;2:e485–96.

    Article  Google Scholar 

  19. Arnett FC, Edworthy SM, Bloch DA, McShane DJ, Fries JF, Cooper NS, et al. The American Rheumatism Association 1987 revised criteria for the classification of rheumatoid arthritis. Arthritis Rheum. 1988;31:315–24.

    Article  CAS  Google Scholar 

  20. Wells G, Becker JC, Teng J, Dougados M, Schiff M, Smolen J, et al. Validation of the 28-joint Disease Activity Score (DAS28) and European League Against Rheumatism response criteria based on C-reactive protein against disease progression in patients with rheumatoid arthritis, and comparison with the DAS28 based on erythrocyte sedimentation rate. Ann Rheum Dis. 2009;68:954–60.

    Article  CAS  Google Scholar 

  21. Zamora-Legoff JA, Krause ML, Crowson CS, Ryu JH, Matteson EL. Patterns of interstitial lung disease and mortality in rheumatoid arthritis. Rheumatology (Oxford). 2017;56:344–50.

    CAS  Google Scholar 

  22. Nannini C, Medina-Velasquez YF, Achenbach SJ, Crowson CS, Ryu JH, Vassallo R, et al. Incidence and mortality of obstructive lung disease in rheumatoid arthritis: a population-based study. Arthritis Care Res (Hoboken). 2013;65:1243–50.

    Article  Google Scholar 

  23. Mori S, Koga Y, Sugimoto M. Different risk factors between interstitial lung disease and airway disease in rheumatoid arthritis. Respir Med. 2012;106:1591–9.

    Article  Google Scholar 

  24. Restrepo JF, del Rincón I, Battafarano DF, Haas RW, Doria M, Escalante A. Clinical and laboratory factors associated with interstitial lung disease in rheumatoid arthritis. Clin Rheumatol. 2015;34:1529–36.

    Article  Google Scholar 

  25. Kelly CA, Saravanan V, Nisar M, Arthanari S, Woodhead FA, Price-Forbes AN, et al. Rheumatoid arthritis-related interstitial lung disease: associations, prognostic factors and physiological and radiological characteristics–a large multicentre UK study. Rheumatology (Oxford). 2014;53:1676–82.

    Article  CAS  Google Scholar 

  26. Lee JS, Lee EY, Ha YJ, Kang EH, Lee YJ, Song YW. Serum KL-6 levels reflect the severity of interstitial lung disease associated with connective tissue disease. Arthritis Res Ther. 2019;21:58.

    Article  Google Scholar 

  27. Kass DJ, Nouraie M, Glassberg MK, Ramreddy N, Fernandez K, Harlow L, et al. Comparative profiling of serum protein biomarkers in rheumatoid arthritis-associated interstitial lung disease and idiopathic pulmonary fibrosis. Arthritis Rheumatol. 2020;72:409–19.

    Article  CAS  Google Scholar 

  28. Fotoh DS, Helal A, Rizk MS, Esaily HA. Serum Krebs von den Lungen-6 and lung ultrasound B lines as potential diagnostic and prognostic factors for rheumatoid arthritis-associated interstitial lung disease. Clin Rheumatol. 2021;40:2689–97.

    Article  Google Scholar 

  29. Bao Y, Zhang W, Shi D, Bai W, He D, Wang D. Correlation between serum tumor marker levels and connective tissue disease-related interstitial lung disease. Int J Gen Med. 2021;14:2553–60.

    Article  Google Scholar 

  30. Shi L, Han XL, Guo HX, Wang J, Tang YP, Gao C, et al. Increases in tumor markers are associated with primary Sjögren’s syndrome-associated interstitial lung disease. Ther Adv Chronic Dis. 2020;11:2040622320944802.

    Article  CAS  Google Scholar 

  31. Strieter RM, Mehrad B. New mechanisms of pulmonary fibrosis. Chest. 2009;136:1364–70.

    Article  Google Scholar 

  32. Obayashi Y, Fujita J, Nishiyama T, Yoshinouchi T, Kamei T, Yamadori I, et al. Role of carbohydrate antigens sialyl Lewis (a) (CA19-9) in bronchoalveolar lavage in patients with pulmonary fibrosis. Respiration. 2000;67:146–52.

    Article  CAS  Google Scholar 

  33. Ishikawa G, Acquah SO, Salvatore M, Padilla ML. Elevated serum D-dimer level is associated with an increased risk of acute exacerbation in interstitial lung disease. Respir Med. 2017;128:78–84.

    Article  Google Scholar 

  34. Zha X, Wang F, Wang Y, He S, Jing Y, Wu X, et al. Lactate dehydrogenase B is critical for hyperactive mTOR-mediated tumorigenesis. Cancer Res. 2011;71:13–8.

    Article  CAS  Google Scholar 

  35. Qin Y, Gao C, Luo J. Metabolism characteristics of Th17 and regulatory T cells in autoimmune diseases. Front Immunol. 2022;13:828191.

    Article  CAS  Google Scholar 

  36. Gokey JJ, Sridharan A, Xu Y, Green J, Carraro G, Stripp BR, et al. Active epithelial hippo signaling in idiopathic pulmonary fibrosis. JCI Insight. 2018;3:e98738.

    Article  Google Scholar 

  37. Kegerreis B, Catalina MD, Bachali P, Geraci NS, Labonte AC, Zeng C, et al. Machine learning approaches to predict lupus disease activity from gene expression data. Sci Rep. 2019;9:9617.

    Article  Google Scholar 

  38. Suliman YA, Dobrota R, Huscher D, Nguyen-Kim TD, Maurer B, Jordan S, et al. Brief report: pulmonary function tests: high rate of false-negative results in the early detection and screening of scleroderma-related interstitial lung disease. Arthritis Rheumatol. 2015;67:3256–61.

    Article  CAS  Google Scholar 

  39. Volpicelli G, Elbarbary M, Blaivas M, Lichtenstein DA, Mathis G, Kirkpatrick AW, et al. International evidence-based recommendations for point-of-care lung ultrasound. Intensive Care Med. 2012;38:577–91.

    Article  Google Scholar 

  40. Strimbu K, Tavel JA. What are biomarkers? Curr Opin HIV AIDS. 2010;5:463–6.

    Article  Google Scholar 

Download references


This research was supported by the Nature Fund Projects of Shanxi Science and Technology Department (201901D111377), the Scientific Research Project of Health commission of Shanxi Province (2019044), the Research Project Supported by Shanxi Scholarship Council of China (2020–191), and Science and Technology Innovation Project of Shanxi Province (2020SYS08).

Author information

Authors and Affiliations



YQ developed and wrote the review. YLW performed data extraction and quality assessment. FXM contributed to the analysis and interpretation of data. Min Feng and Xiangcong Zhao participated in the statistical analysis. CG provided significant revisions to the manuscript. JL generated themes, guided, and edited the manuscript. All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published.

Corresponding author

Correspondence to Jing Luo.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the ethics committee of the Second Hospital of Shanxi Medical University (2016KY007). Informed consent was obtained from all individuals.

Consent for publication

All participants gave their informed consent to publication.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Supplemental Table 1 Performance ofmultiple machine learning classifiers to predict RA-ILD by 10-foldcross-validation. Supplemental Table 2 Comparisons between smoking and RA-ILD.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, Y., Wang, Y., Meng, F. et al. Identification of biomarkers by machine learning classifiers to assist diagnose rheumatoid arthritis-associated interstitial lung disease. Arthritis Res Ther 24, 115 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: