Prediction model for knee osteoarthritis based on genetic and clinical information

Introduction Osteoarthritis (OA) is the most common bone and joint disease influenced by genetic and environmental factors. Recent association studies have uncovered the genetic factors behind OA, its susceptibility genes, which would enable us to predict disease occurrence based on genotype information. However, most previous studies have evaluated the effects of only a single susceptibility gene, and hence prediction based on such information is not as reliable. Here, we constructed OA-prediction models based on genotype information from a case-control association study and tested their predictability. Methods We genotyped risk alleles of the three susceptibility genes, asporin (ASPN), growth differentiation factor 5 (GDF5), and double von Willebrand factor A domains (DVWA) for a total of 2,158 Japanese subjects (933 OA and 1,225 controls) and statistically analyzed their effects. After that, we constructed prediction models by using the logistic regression analysis. Results When the effects of each allele were assumed to be the same and multiplicative, each additional risk allele increased the odds ratio (OR) by a factor of 1.23 (95% confidence interval (CI), 1.12 to 1.34). Individuals with five or six risk alleles showed significantly higher susceptibility when compared with those with zero or one, with an OR of 2.67 (95% CI, 1.46 to 4.87; P = 0.0020). Statistical evaluation of the prediction power of models showed that a model using only genotyping data had poor predictability. We obtained a model with good predictability by incorporating clinical data, which was further improved by rigorous age adjustment. Conclusions Our results showed that consideration of adjusted clinical information, as well as increases in the number of risk alleles to be integrated, is critical for OA prediction by using data from case-control studies. To the authors' knowledge, this is the first report of the OA-prediction model combining both genetic and clinical information.


Introduction
Osteoarthritis (OA) is the most common bone and joint disease and is characterized by progressive cartilage degeneration. OA is a polygenic disease caused by genetic and environmental factors [1]. Epidemiologic studies have suggested that genetic factors strongly affect the onset and development of OA [2]. Genetic association studies are now uncovering the genetic factors responsible for of OA, that is, its susceptibility genes. Candidate-gene approaches have identified several genes associated with OA, and genome-wide association studies have recently found several promising OA-susceptibility genes [1,3].
Identification of OA-susceptibility genes would enable us to predict disease phenotypes based on genotype information. Such predictions would be a basis for personalized medicine and disease prevention. However, most previous OA genetic studies have evaluated the effects of only single susceptibility genes, which are mostly small, and hence predictions based on their results are not always useful. By analyzing the combinatorial effects of the gene calmodulin 1 (CALM1) and other genes in hip OA susceptibility, combinatorial association between CALM1 and ASPN susceptibility alleles has been reported [4]; however, the association was not conclusive because of the small sample size of the study. Because OA is a polygenic disease, evaluation of the combined effects of susceptibility genes, as well as between susceptibility genes and environmental/clinical factors, is important.
To obtain a more-comprehensive view of knee OA susceptibility and to make a good prediction model for OA by using genotype information, we investigated the combined effects of known susceptibility genes for knee OA in Japanese persons. We examined gene-gene interactions of the previously reported OA genes and estimated the population attributed risk (PAR) statistic of the genes by using data of a case-control association study [5]. We constructed prediction models for OA by using genetic data only and a combination of genetic and clinical data, and evaluated their predictability statistically. The model using only genotype information had less than fair predictability, but integration of clinical data with adjustment showed marked improvement of the prediction power.

Subjects
A case-control cohort (692 cases and 748 controls) was recruited from patients of several medical institutes in Japan. A population-based cohort (241 cases and 477 controls) was recruited from inhabitants of Odai and Minami-ise town in the Mie prefecture in Japan. We obtained written informed consent from each subject, and the study was approved by the ethics committee of the Center for Genomic Medicine in RIKEN. In total, 933 knee OA (81.1% female; mean age ± standard deviation (SD) = 71.8 ± 7.7 years) and 1,225 control subjects (75.4% female; mean age ± SD = 69.3 ± 9.2 years) were included. OA was diagnosed based on clinical and radiographic findings by using previously described criteria [6]. All OA patients were older than 40 years.

Statistical analysis
We used the software R [9] for all statistical analyses. To obtain PAR for multiple risk factors among casecontrol study data, we used an OR assessed by the logistic regression model adjusted for gender and age [10].

Prediction model
We used logistic regression for constructing prediction models. This method assumes a multiplicative allelic effect (each allele independently increases the odds of the disease). We used the Cochran-Armitage trend test to examine this odds increase by increasing the number of risk alleles. The variables of the logistic regression model with clinical data remained after the stepwise selection of variables based on the likelihood ratio and the Wald statistics [11,12]. We used the receiver operating characteristic (ROC) curve to evaluate abilities of prediction models. The area under the curve (AUC), which is a measure of the power to distinguish case and control individuals, was calculated for the curve by using the ROCR package [13]. The simulation study for the adjusted control group was performed 1,000 times. A perfect model would have an AUC of 1, whereas a model with no discriminative power would have an AUC of 0.5 [14].

Estimation of contribution of the genetic factor
We considered a statistical model for prediction as follows:
By taking variances on the liability scale for both sides of equation, this model becomes: As we found no evidence of gene-environmental interactions, we could assume that Cov(G,E) = 0. We also estimated the contribution of genetic factors in our models by calculating the average proportion of variation, Var(E)/Var(P), for 1,000 simulation studies.

Selection of susceptibility genes
We selected three susceptibility genes, ASPN [7], GDF5 [8], and DVWA [6], for study because they had clear associations in Japanese subjects and because the associations were replicated in other ethnic populations and supported by functional evidence for OA causality [1,3]. We obtained clinical data (Table 1) and genotyping data ( Table 2) for 933 knee OA cases and 1,225 controls and examined whether the previously reported risk alleles were associated with knee OA in this population after genotyping. We observed significant associations in GDF5 and DVWA and marginal associations in ASPN (Table 3).

Association between number of risk alleles and knee OA
We tested the independence between risk alleles for all pair-wise combinations by a χ 2 association test to 3 × 3 tables by using both case-only and case-control mixed designs. We did not find evidence of dependence. We next examined gene-gene interactions by using the logistic regression model. We considered analysis of variance between two logistic regression models, with or without the term of gene-gene interaction. After Bonferroni correction (P = 0.05/3 for three pair-wise combinations among three genes), we did not find any evidence of interaction (Table 4). We then studied the association between knee OA and the number of risk alleles possessed by a subject ( Table 5). The distributions of the subjects over the number of risk alleles differed between the case and control groups. The odds ratio adjusted for age, gender, and body mass index (BMI) increased with the number of risk alleles ( Figure 1) (P = 4.15e-6). As only 0.7% of subjects in the control group carried zero risk alleles, we selected subjects with either zero or one risk allele as a reference (the lowest-risk) group. The proportion of the group was 10.6%. Similarly, we combined subjects with five or six risk alleles to construct the highest-risk group. The group had an OR of 2.67 (95% CI, 1.46 to 4.87; P = 0.0020), as compared with the reference group.

Estimation of PAR
To investigate the contribution of genetic factors for the onset OA, we estimated the PAR statistic [5]. PAR is the proportion of cases in the population that can avoid the disease if certain risk alleles are removed from the population. Hence, PAR is useful for providing a measure of how much a certain factor contributes to the disease. We compared the reference group with other groups and obtained an estimated PAR of 31.4% for the risk alleles of the three susceptibility genes. We also considered PAR for BMI because the Framingham study indicated that a higher BMI may increase the OA risk [2]. We classified BMIs for all subjects into three categories (normal/underweight, <25; overweight, 26 to 30, and obese, ≥30) [15]. Setting normal/underweight as a reference group, we estimated the PAR of BMI as 28.4%.

Application for prediction of OA susceptibility
We generated OA-prediction models by using genetic or clinical data or both. We first examined the interactions between genetic and clinical factors and found no evidence of interactions, or of gene-gene interactions. We next considered two logistic regression models: the first model (MODEL I) was constructed with only the number of risk alleles for the three susceptibility genes; the second model (MODEL II) incorporated the clinical information of individuals, including gender, age, and BMI. For both models, we confirmed that each additional risk allele increased ORs by 1.23 (95%CI, 1.12 to 1.34) and 1.22 (95%CI, 1.11 to 1.34), respectively. For MODEL II, the contributions of gender, age, and BMI were 1.35 for female,    where the value of gen is 1 or 2 if the subject is male or female, respectively. We drew an ROC curve and evaluated the predictive power of the model by AUC [16]. AUCs of MODEL I and MODEL II were 0.554 and 0.685, respectively ( Figure 2).
To examine the contribution of clinical factors, we constructed a model by using only clinical data (MODEL III). We obtained an AUC of 0.678, which was very similar to that of MODEL II; however, comparison between MODEL II and MODEL III by a likelihood ratio test showed a significant difference (P = 1.67e-5), indicating that MODEL II was superior to MODEL III.
To examine the validity of the prediction model, we did a validation study as follows: We divided our subjects into two sets, the cohort study (718 subjects) and the others (1,440 subjects). By using data of the latter set, we constructed MODEL II and applied the model to the data of the former set. Consequently, we obtained an AUC of 0.717 for the validation study, whereas the AUC for the original study was 0.649.

Adjustment of clinical data for the control group
A prediction model is generally interpreted to be excellent, good, or fair, when its AUC is 0.9 to 1.0, 0.8 to 0.9, or 0.7 to 0.8, respectively [17]. Therefore, our models were poor. This was partly because the proportion of male subjects in our control group was only 0.246, which does not reflect that of the general Japanese population. Hence, we adjusted the age by selecting subjects from the case and control groups whose ages were 60 years or older for constructing MODEL II. Mean ages ± SD (years) of the case and control groups were 73.2 ± 6.2 and 72.0 ± 6.8, respectively. We then constructed a control group with random sampling for 171 male and 215 female sunjects, which was the same male-to-female ratio as that in the Japanese population in 2005. We also selected 400 case subjects at random and applied the logistic regression model to them. The curve for MODEL II was drawn through 1,000 simulation studies with rigorous age adjustment for control groups that reflected the Japanese elderly population. MODEL II had a significantly improved AUC. Curves were described by using the ROCR package. Through the simulation studies, we obtained a mean AUC of 0.867.
The distribution of subject ages in our control group was slightly different from that of the general Japanese population. Thus, we selected 128 male and 163 female sunjects so that the subjects in the control group had an equivalent age distribution to the Japanese population in 2005. We then applied the logistic regression model to 291 randomly selected cases. The simulation studies showed a mean AUC of 0.742 for MODEL II (Figure 2). For the model, we estimated the contribution of genetic factors by calculating the proportion of variation and found it very low (0.060), which indicated a nonsignificant contribution of the genetic factor in the model. Actually, we obtained a mean AUC of 0.735 for MODEL III. To confirm the prediction models, we also performed cross-validation studies and obtained their AUCs as 0.677 and 0.671, respectively, which were not different from the original ones.
Our case subjects mainly consisted of women (81.1%), and epidemiology studies have reported that women have a higher risk of OA than do men [2]. Hence, we also applied our models to only male or female subjects. With age adjustment for control subjects, we used 128 cases and 128 controls for men, and 462 cases and 462 controls for women. For men and women, we obtained AUCs of 0.737 and 0.696, respectively, and estimated the contribution of the genetic factor to be 0.110 and 0.091, respectively. The increasing proportions of genetic factors in these models were attributed to the fact that they did not contain gender information.

Discussion
Classic twin studies on OA heritability suggested that more than half of variations related to the OA susceptibility can be explained by genetic factors [18]. Therefore, we first tested a model that considered genotype information only. The prediction model had a low predictability; its AUC was only 0.554. The insufficient power of the model was partly due to the small number of integrated risk alleles. It has been reported that 20 to 25 risk alleles with frequencies greater than 0.1 and OR values of 1.5 are required for obtaining an AUC of about 0.8 [19]. A precedent study presented a similar model to our MODEL I and reported the additive effect of candidate genes associated with OA [20].
Based on our PAR analysis, the contributions of genetic factors for OA composed by three risk alleles was similar to that of the BMI (31.4% versus 28.4%). The proportion of variations and estimated contributions of genetic factors for our prediction model was calculated to be very low when compared with the previous twin studies [18]. Finding additional risk variants and integrating them into the model is necessary to increase the power of the model.
To improve the predictability of the model, we integrated clinical information, which increased the AUC to 0.685. We further modified the model by adjusting the age because our control group was composed of older subjects, as compared with the case group. After this adjustment, the AUC increased to 0.742. The effect of age difference between the case and control groups was confirmed by simulating the control group and a random sampling that considered only the male-to-female ratio and selected case subjects under the same conditions. The applied logistic regression model produced an AUC of 0.867. These results showed that despite slight differences in age between the case and control groups, mean values could be inflated. Thus, the distribution of age in the control group was influential for constructing a prediction model, underscoring the importance of age adjustment for the control group.
Among the three susceptibility genes, we found no evidence of interaction. We also found no evidence of interaction between the four clinical factors. We are uncertain whether the lack of interaction is true for only the factors that we examined, but it may be due to the limitations in the power of our study resulting from the sample size. It was previously shown that GDF5 may contribute to the variation in height with an estimated additive effect [21]; however, we did not find such an association in our study. Both gene-gene and geneenvironment interactions should play roles in common diseases, and we may therefore be able to increase the predictive power by finding the yet-unidentified variants that interact with clinical-environmental factors.
Our prediction model for knee OA was constructed based on data from a case-control association study. In type 2 diabetes, some prediction models have been considered based on case-control study data that have incorporated the number of risk alleles and clinical data [17,22,23]. These approaches have achieved successful outcomes; however, unlike a cohort study, data from control subjects may not reflect the true distribution of ages for the target population. In the case in which susceptibility to disease varies with age, it is necessary to consider the age distribution among the control group. To our knowledge, our approach with adjustment of clinical information is the first to construct prediction models using case-control studies, while considering this problem. Our approach can compare prediction models induced by genetic data, clinical data, and both of them and can estimate the contribution of genetic factors for the last model. It would be useful in future preventive measures against not only OA, but also other common polygenic diseases.

Conclusions
To our knowledge, this study is the first report of an OA-prediction model combining with genetic and clinical information from a case-control association study. Our prediction model using genotype information from three susceptibility genes had poor predictability; however, predictability improved significantly by incorporating clinical data and by adjusting those data. In the current model, the contribution of genetic factors is small. The identification of more OA-risk polymorphisms is necessary; these should be integrated to achieve better prediction.