Instruments
WOMAC
The WOMAC [18] is a 24-item questionnaire designed for use in lower extremity OA research. We used version LK3.0, with a 48-hour time frame and Likert scale. In this study, we did not use the two stiffness items from the WOMAC scale.
OA-FUNCTION-CAT item bank development
Literature review
We performed a comprehensive literature review to yield functional activity instruments relevant to hip and knee OA, hand-searching the references provided in each paper to identify additional sources. We contacted the instrument developers to obtain the instruments and compiled the items as a resource for developing the preliminary item bank.
Patient focus groups
Experienced moderators conducted six semi-structured focus groups, each consisting of five or six patients with hip or knee OA, exploring patients' views on important outcomes for OA research. The sessions were audio-taped and transcripts were content-analyzed.
Clinician focus groups
We held three multi-disciplinary focus groups that included five or six clinicians with extensive expertise in the treatment of patients with OA.
Cognitive testing
The entire item bank was subjected to cognitive testing to discover problems with any items that would reduce instrument performance. Two groups of five or six adult patients with hip or knee OA were asked to read the instructions for clarity and assess a sample of the items for clarity and relevance. Cognitive testers asked standardized probe questions to identify difficulty in reading or comprehending instructions or items.
When focus group participants identified functional activities not covered by the item bank, new items were written. Further revisions were made based on cognitive testing results. The final item bank consisted of 125 functional activities commonly affected by hip or knee OA (Additional data file 1). The final rating scale asked the subject to report the amount of difficulty s/he had in doing each function as (a) none, (b) a little, or (c) a lot. Subjects also reported their pain severity in doing each activity as (a) none, (b) mild or moderate, or (c) severe. For those activities that a subject did not do, s/he reported whether (d) s/he did not do the activity because of the arthritis in her/his legs or (e) s/he did not do an activity for reasons other than the arthritis in her/his legs. The time frame was 'on an average day over the past month'. In previous work, we found that IRT models fit better when response categories are more distinct and item characteristic curves do not overlap or become disordered due to small frequencies of individual rating categories [19].
Study sample
We recruited a convenience sample of 328 adults from the greater Boston area with confirmed OA of the knee and/or hip from a pool of patients who had previously participated in OA research and from the practice of a local orthopedic surgeon. In all cases, confirmation of disease included evidence of OA on x-rays and frequent pain in the joint. For the knee joint, the x-ray protocol included either posteroanterior fluoroscopically positioned or metatarsophalangeal view (all semi-flexed) and they were read for the presence of a definite osteophyte. In most of the knees, lateral and/or skyline views were obtained to evaluate the patellofemoral joint. For the hip joint, the patient received a diagnosis of hip OA if any of the following were present: joint space narrowing, subchondral sclerosis, osteophytes, subchondral cyst, or symptomatic acetabular dysplasia.
Data collection
Eligibility was determined by telephone interview and included age of at least 18 years, English-speaking, pain or stiffness in the knee or hip within the prior month, radiographic evidence of a definite osteophyte for the knee or hip or joint space narrowing for the hip, or confirmation from the subject of a physician's diagnosis of knee or hip OA. Subjects were excluded if they had been diagnosed with rheumatoid arthritis, systemic lupus erythematosis, gout, or psoriatric arthritis or used a wheelchair to move about in their home. To ensure that we included a wide range of functional ability in the sample, we used the physical function scale of the short form-36 health survey (SF-36) to estimate and stratify subjects by functional level.
The OA-FUNCTION-CAT item bank and WOMAC were administered to the subjects in their homes by trained interviewers. We addressed potential order effects by counterbalancing the order of instrument administration. Demographic information (age, gender, ethnicity, race, education, and living and housing status) was collected for each subject, and gender-specific items were administrated to subjects of appropriate gender. For focus groups, cognitive testing, and the calibration study, informed consent was attained before participation, and all procedures were approved by the Institutional Review Board at Boston University.
OA-FUNCTION-CAT structure/domains
The underlying structure of functional pain and functional difficulty items was assessed using a series of confirmatory factor analyses [20]. We evaluated item loadings and residual correlations between items using MPlus software, version 3.12 [21]. We chose unweighted least squares (ULS) estimation based on polychoric correlation matrices and variance-adjusted estimation methods to improve the precision of our estimates given these skewed categorical data [20, 22]. For each domain, we assessed eigenvalues associated with each factor extracted. Model fit was assessed using several approaches, including the chi-square test, comparative fit index (CFI), Tucker-Lewis index (TLI), and root mean square error approximation (RMSEA). For CFI and TLI, values range from 0 to 1, with higher values indicating better test model fit compared with a baseline model and with 0.90 or greater representing acceptable fit [23–25]. RMSEA represents misfit per degree of freedom (df), with lower values signifying better fit. Values of less than 0.05 suggest a 'very good fit', and values of around 0.08 are interpreted as 'marginal' fit. Values of greater than 0.1 are generally viewed as indicative of a 'poor fit' [26, 27]. We examined the magnitude of the factor loadings on the primary factor and considered residual correlations; those of less than or equal to 0.20 (a) suggest that the primary factor explains the correlation between items and (b) indicate acceptable fit [28]. Higher correlations indicate violation of the local independence assumption.
Item calibrations
The generalized partial credit model (GPCM) was used to estimate the item calibrations for each domain [29–32]. We used weighted maximum likelihood (WML) estimation to estimate IRT-based scores for the functional pain and functional difficulty domains [22, 33]. Item fit was evaluated using the likelihood ratio chi-square statistic (G2) for each item based on the comparison of expected and observed values across the distribution of the two domains. The likelihood ratio chi-square statistic for the whole test was examined to verify model fit of each domain, and Bonferroni-corrected P values were used in the significance tests. We standardized the scores estimated from the IRT model with a mean of 50 and standard deviation of 10. All of the IRT analyses were performed using the software package PARSCALE [34].
Differential item functioning
A fundamental assumption of IRT models is that a subject's score on an item should depend entirely on the subject's ability level in the relevant domain (for example, physical function) and the statistical characteristics of the item. Differential item functioning (DIF) means that, in spite of having the same underlying functional level, groups of subjects demonstrate different response probabilities, indicating that background variables (such as gender or site of OA) influenced the response [35]. A more severe pain on kneeling for subjects with knee OA compared with those with hip OA would be an example of DIF. We tested for the presence of DIF using logistic regression with background variables assigned as the independent variable and the OA-FUNCTION-CAT item score as the dependent variable. The analytic strategy successively added functional ability levels, background variables, and interaction terms to the model, and model comparison was based on the likelihood ratio test. The effect size of the DIF was classified based on the R2 change between models [36]. Uniform DIF was identified when the background effect was significant but the interaction effect with the person's functional ability level was not, whereas non-uniform DIF was identified if the interaction effect was significant.
Development of the simulated CAT program
Having finalized the item pool and generated item calibrations for each domain, we used HDRI™ software developed at Boston University to construct the OA-FUNCTION-CAT algorithms. The CATs were programmed to use WML score estimation and to select initial items in the middle of the ability ranges for pain and function. The program fed the response to the first item into the CAT algorithm and calculated a probable score and person-specific standard error (measure of precision). Subsequent items were selected and administered by the program until the preselected maximum number of items had been administered (in our analyses, 5-, 10-, or 15-item CATs were computer-selected). IRT assumes local independence of items, meaning that a subject's responses to any pair of items are statistically independent of each other [13]. One approach to local dependence is to remove items from the item bank. Rather than eliminating the items from the item bank, we used special programming within the CAT algorithm which allowed the selection of only one item within a set of locally dependent items.
Psychometric evaluation of the OA-FUNCTION-CAT
To assess psychometric performance of the OA-FUNCTION-CAT, we conducted simulations to estimate scores for three fixed-length CATs (that is, 5, 10, and 15 items) and to compare their properties with those of the full item bank and the WOMAC. To make suitable comparisons between the WOMAC and the OA-FUNCTION-CATs and item banks, we first estimated calibrations for one instrument and then converted the other to the same scale, essentially calibrating the WOMAC items using the OA-FUNCTION-CAT item calibrations in the functional pain and functional difficulty domains as anchors. We compared mean scores generated by the CAT simulations with scores from the full item bank for the entire sample and by site of OA.
We considered the following characteristics in our analysis: accuracy, breadth of coverage, reliability, and precision. We assessed the accuracy of CATs relative to the full item bank by calculating Pearson correlation coefficients between each of the CAT-generated scores and the full item bank scores. To evaluate breadth of coverage, we calculated item distributions and percentage at the ceiling and floor for each scale of the full item bank compared with the WOMAC. We calculated expected values for each response category for each item and defined the range of the scale as the corresponding person's score estimates between the expected value of the lowest and highest response categories in each scale. Reliability represents the degree to which the differences across subject scores are due to real differences in pain or functional ability (true variance) as opposed to measurement error. At various positions on the scale, we examined the ratio of the true variance to the total variance for each instrument, using the following estimation: 1/1+(standard error)2 [37]. Reliability was considered to be adequate for portions of the reliability function of greater than 0.70. Finally, precision was evaluated by calculating and comparing standard errors associated with each subject's score for the 5-, 10-, and 15-item CATs, the full item bank, and the WOMAC.