Accuracy, patient-perceived usability, and acceptance of two symptom checkers (Ada and Rheport) in rheumatology: interim results from a randomized controlled crossover trial

Background Timely diagnosis and treatment are essential in the effective management of inflammatory rheumatic diseases (IRDs). Symptom checkers (SCs) promise to accelerate diagnosis, reduce misdiagnoses, and guide patients more effectively through the health care system. Although SCs are increasingly used, there exists little supporting evidence. Objective To assess the diagnostic accuracy, patient-perceived usability, and acceptance of two SCs: (1) Ada and (2) Rheport. Methods Patients newly presenting to a German secondary rheumatology outpatient clinic were randomly assigned in a 1:1 ratio to complete Ada or Rheport and consecutively the respective other SCs in a prospective non-blinded controlled randomized crossover trial. The primary outcome was the accuracy of the SCs regarding the diagnosis of an IRD compared to the physicians’ diagnosis as the gold standard. The secondary outcomes were patient-perceived usability, acceptance, and time to complete the SC. Results In this interim analysis, the first 164 patients who completed the study were analyzed. 32.9% (54/164) of the study subjects were diagnosed with an IRD. Rheport showed a sensitivity of 53.7% and a specificity of 51.8% for IRDs. Ada’s top 1 (D1) and top 5 disease suggestions (D5) showed a sensitivity of 42.6% and 53.7% and a specificity of 63.6% and 54.5% concerning IRDs, respectively. The correct diagnosis of the IRD patients was within the Ada D1 and D5 suggestions in 16.7% (9/54) and 25.9% (14/54), respectively. The median System Usability Scale (SUS) score of Ada and Rheport was 75.0/100 and 77.5/100, respectively. The median completion time for both Ada and Rheport was 7.0 and 8.5 min, respectively. Sixty-four percent and 67.1% would recommend using Ada and Rheport to friends and other patients, respectively. Conclusions While SCs are well accepted among patients, their diagnostic accuracy is limited to date. Trial registration DRKS.de, DRKS00017642. Registered on 23 July 2019


Introduction
The European League Again Rheumatism (EULAR) recommendations support that patients with arthritis should be seen as early as possible, ideally during 6 weeks after symptom onset [1], since an early start of the treatment significantly improves patient outcomes [2]. Various strategies have been identified [3,4] to implement these recommendations; however, the diagnostic delay seems to increase despite such strategies [5,6].
Symptom checkers (SCs) could improve this situation. SCs are patient-centered diagnostic decision support systems (DDSS) that are designed to offer a scalable, objective, cost-effective, personalized triage strategy. Based on such a triage strategy, SCs should help to receive a more appropriate appointment, for the right patient, at the right time, thus empowering patients. It is known that patients with rheumatic and musculoskeletal diseases (RMD) are highly motivated to use SCs and other medical apps [7]. Thus, SCs like the artificial intelligencedriven Ada have been used to complete more than 15 million health assessments in 130 countries [8].
To ensure the safety and efficacy of such apps, EULAR recently published guidelines [9] that state "self-management apps should be up to date, scientifically justifiable, user-acceptable, and evidence-based where applicable," and validation should include people with RMDs.
Therefore, the aim of this study was to create realworld-based evidence by evaluating the diagnostic accuracy, usability, acceptance, and completion time of two free, publicly available SCs, Ada (www.ada.com) and Rheport (www.rheport.de).

Study design
We present interim results of a randomized controlled crossover multicenter study, conducted at three centers in Germany. The study was approved by the ethics committee of the Medical Faculty of the University of Erlangen-Nürnberg, Germany (106_19 Bc), reported to the German Clinical Trials Register (DRKS) (DRKS00017642) and conducted in compliance with the Declaration of Helsinki. All patients provided written informed consent before participating. Patients were randomized 1:1 to group 1 (completing Ada first, continuing with Rheport) or group 2 (completing Rheport first, continuing with Ada) by computer-generated block randomization whereas each block contains n = 100 patients. SCs were completed before the regular appointment. Assisting personnel was present to help with SC completion if necessary.

Study patients
Adult patients newly presenting to the first (University Hospital Erlangen, Germany) of three recruiting rheumatology outpatient clinics with musculoskeletal symptoms and unknown diagnosis were included in this cross-sectional study. Patients with a known diagnosis and patients unwilling or unable to comply with the protocol were excluded from the study. Besides the apprelated data outlined below, demographic variables, symptom duration, swollen and tender joint count, DAS28 score, ESR, CRP, anti-CCP antibody and rheumatoid factor status, and clinical diagnosis using established classification criteria were recorded. This interim analysis is based on patient data from rheumatology outpatient clinics recorded starting in September 2019 up to February 2020.

Description of the symptom checkers
Ada is a Conformité Européenne (CE)-certified medical app that is freely available in multiple languages and was used to complete more than 15 million health assessments in 130 countries [8]. The artificial intelligencedriven chatbot app first asks for basic health information (e.g., sex, smoking status) and then asks for the current leading symptoms. The questions ( Fig. 1) are dynamically chosen, and the total number varies depending on the previous answers given. Ada then provides a top (D1) and up to 5 concrete disease suggestions (D5), their probability and urgency advice. The app is based on constantly updated research findings and is not limited to RMDs.
Rheport is a rheumatology-specific online platform that uses a fixed patient questionnaire ( Fig. 1) including basic health information and rheumatology-specific questions, developed by rheumatologists. A background algorithm calculates the probability of an IRD based on a weighted sum score of the questionnaire answers. A sum score ≥ 1.0 was determined to be the threshold for an IRD. The system is already used in clinical routine to triage appointments of new patients per IRD probability. About 3000 appointments have been organized to date [4]. For this study, an app-based version of the software has been used. Both SCs were tested using three iOSbased tablets.

Primary outcome
The primary outcome was the diagnostic accuracy regarding the sensitivity and specificity of Ada and Rheport concerning the diagnosis of IRD. The results of the SCs were recorded and compared to the gold standard, i.e., the final physicians' diagnosis; reported on the discharge summary report; and adjudicated by the head of the local rheumatology department.

Secondary outcomes
SC completion time and patient-perceived usability were secondary outcomes of this study. SC completion time was measured by supervising local study personnel. Patients completed a survey evaluating the SC usability using the System Usability Scale (SUS) [10]. It consists of 10 statements with 5-point Likert scales ranging from strongly agree to strongly disagree, resulting in a maximum score of 100. Finally, patients were asked if they would recommend the two SCs to friends and other patients.

Statistical analysis
We performed an interim analysis of the first 164 patients who completed the study. The analysis consisted of (i) a descriptive sample characterization stratified by randomization arm, (ii) an assessment of Ada's and Rheport's diagnostic accuracy, and (iii) a descriptive evaluation of the secondary outcome measures specified above for the total sample. Descriptive characteristics for each randomization arm are presented as median (Mdn) and interquartile range (IQR) for interval data and as Fig. 1 Screenshots of the Ada and Rheport symptom checker. 1 The German version of Ada was used in the study. 2 The Rheport menu was translated into English for this figure absolute (n) and relative frequency (percent) for nominal data. Comparability of demographic and IRD-related characteristics between the randomization groups was assessed by the Wilcoxon rank-sum tests and χ 2 tests. Diagnostic accuracy was evaluated referring to sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and overall accuracy. The comparability of the secondary outcomes was evaluated by the Wilcoxon signed-rank tests whereas descriptive information is presented as Mdn (IQR). The significance level for inferential tests was set at p ≤ 0.05. The software used for the statistical analysis was R (version 3.6.3) and RStudio (version 1.2.5033), respectively.

Sample size determination
A minimum sample size of n = 122 patients was calculated, based on the following assumptions: (1) prevalence, defined as the proportion of subjects who, after presenting to the rheumatologist, are diagnosed with an inflammatory rheumatic disease of 40% [11]; (2) average diagnostic accuracy of previous applications for diagnosis using the 3 most likely diagnoses of 50% [12]; (3) desired accuracy of diagnosis using Ada or Rheport in terms of sensitivity and specificity of 70%; (4) type 1 error: discrete value according to Bujang

Participants
A total of 211 consecutive patients were approached, 167 agreed to participate, and 164 patients were included in the interim analysis presented (Fig. 2). 32.9% (54/164) of the presenting patients were diagnosed with an IRD based on the physicians' judgment. The classified diagnosis and demographic characteristics are summarized in Tables 1 and 2, respectively.

Secondary outcomes
The median completion time for Ada and Rheport was 7.0 min (IQR 5.8-9.0) and 8.5 (IQR 8.0-10.0), respectively. On a scale of 0 (worst) to 100 (best), the median SUS of Ada and Rheport was 75.0 (IQR 62.5-85.0) and 77.5 (IQR 62.5-87.5), respectively. Completion time and usability (SUS scores) were not different between the two groups. Sixty-four percent and 67.1% would recommend using Ada and Rheport to friends and other patients, respectively.

Discussion
This prospective real-world study highlights the currently limited diagnostic accuracy of SCs, such as Ada and Rheport with respect to IRDs. Their overall sensitivity and specificity for IRDs are moderate. SCs offer patients on-demand medical support independent of time and place. An automated SC-based triage, as offered by Rheport, may allow objective, scalable, and transparent decisions. By automating triage decisions, SCs could additionally save money [12,14] and accelerate the time to correct diagnosis [15], however may also lead to over-diagnosis and over-treatment [16].
Despite increasing patient usage [8], evidence supporting SC effectiveness is limited to date [12,17]. The results of this study are in line with previous SC analyses [12,17,18]. Research supported by Ada Health GmbH shows that Ada had the highest top 3 suggestion diagnostic accuracy (70.5%) compared to other SCs [19], and the correct condition was among the first three results in 83% in an Australian assessment study [20]. Similarly to our results, the majority of patients would recommend Ada (85.3%) to friends or relatives [21].
The first rheumatology-specific SC study with 34 patients [18] showed that only 4 out of 21 patients with inflammatory arthritis were given the first diagnosis of RA or PsA. Proft et al. recently showed that a physicianbased referral strategy was more effective than an online self-referral tool for early recognition of axial spondyloarthritis [22]. Nevertheless, these authors recommend using online self-referral tools in addition to traditional referral strategies, as the proportion of axial spondyloarthritis among self-referred patients (19.4%) was clearly higher than the assumed 5% prevalence in patients with chronic back pain. Regarding the current referral sensitivity of 32.9%, complementary SC integration might indeed be part of modern rheumatology.
The diagnostic accuracy of rheumatologists is high based on the comprehensive use of information from patients' history, symptoms, and also data from laboratory tests and imaging [23]. Therefore, the current comparison of the physicians' final diagnosis and SC-suggested diagnosis should be interpreted carefully, as the SC diagnosis is based on substantially less data. Furthermore, patients could discuss SC results with their rheumatologists, possibly influencing the rheumatologist's diagnosis. The sequential usage of both SCs represents a possible bias, as patients might be influenced by the usage of the first SC. However, we could not observe any significant differences related to SC order. The slightly better performance of Ada should be interpreted carefully. In contrast to Rheport, Ada is supported by artificial intelligence and does not use a fixed questionnaire. Ada covers a great variety of different conditions [19] and is not limited to IRDs, whereas Rheport is exclusively meant for the triage of new suspected IRD patients. The study setting was deliberately chosen risk-adverse, so the use of the SCs did not have any clinical implications. Symptom checkers are however designed to be used in community settings, where the probability that a patient  Annotation: Mdn, median; IQR 25%, interquartile range (25% bound); IQR 75%, interquartile range (75% bound) will have an IRD is much lower than in a rheumatology clinic and no help for SC completion is available. Furthermore, the exact SC diagnosis might be less important than the SC advice on when to see a doctor, especially in emergency situations. Our study setting caused a much higher a priori chance of having an IRD, as patients were already "screened" by referring physicians. The high proportion of PsA and AxSpA patients is likely attributed to a strong local cooperation with the orthopedic and dermatology department. Additional data from the other two centers will hopefully contribute to balancing results. We did not measure how often help from assisting personnel was necessary for SC completion.
To the best of our knowledge, this is the first prospective, real-world, multicenter study evaluating two currently used SCs in rheumatology. Our results may provide some help to guide and inform patients, treating health care professionals (HCPs) but also other stakeholders in health care. In conclusion, while SCs are wellaccepted by patients their diagnostic accuracy is limited. Constant improvement of algorithms might foster the future potential of SCs to improve patient care.