AI and healthcare: Gathering and interpreting data in the future

With the advent of diagnostics, treatment and artifical intelligence in the health sector, Huw Llewelyn MD discusses the challenges facing the medical profession, particularly in assessing these new interventions when randomised control trials (RCTs) are impractical. 

Many hope that new diagnostic tests and treatments based on artificial intelligence, ‘big data’, physiological smart phone apps or genetics will improve future health. However, we face some big challenges, particularly in assessing these new interventions when randomised control trials (RCTs) are impractical[1] or when data collected during routine care is too unstructured and inadequate for meaningful analysis[2]. I will describe examples of the type of innovations that might help. The first will be a way of conducting controlled trials without randomisation. The second example will be a method of collecting structured data during day to day care by working backwards from a decision.

If we wished to understand the benefit of thyroxine over placebo at different levels of TSH in the treatment of hypothyroidism, a standard RCT would not be practical given the large sample size required and the ethics of giving a placebo to those with severe symptoms. Take a TSH level of 25mu/l as an example. If sufficient patients could be recruited, we may observe that 55% had ‘persistent symptoms’ on placebo after 2 months and 10% had ‘persistent symptoms’ on thyroxine. Similar experiments for all predefined TSH levels would enable us to draw a graph such as that  in figure 1, with curves demonstrating the estimated proportions of patients with ‘persistent symptoms’ following placebo and thyroxine treatment at each TSH value. The very large number of patients needed for such a trial and their possible objections to taking placebo for severe symptoms might cause difficulties.

Figure 1.

There are however other study designs that might allow such questions to be answered. Patients would be asked in advance to consent to be given placebo only if their TSH was less than a cut-off value (e.g. a TSH of 20mu/l) and to be given a fixed dose of thyroxine if the TSH result was above that point. The results are then used to estimate the overall proportion with persistent symptoms on placebo and treatment by assuming that the proportion of patients with a TSH above or below 20mu/l are the same in those on placebo and treatment (see footnote). This allows both curves to be estimated in their entirety as shown in figure 1, so that there is a constant ‘odds ratio’ (in this case 0.09) between the curves at all points[3]. This will have been achieved without randomisation and without large numbers of patients. We can also assess the effect on the curves of using other findings such as the patient’s symptoms, genome testing, etc. Better tests for use as diagnostic and treatment indication gold standard criteria would create steeper curves that would represent greater ‘precision medicine’. Better treatments (or perhaps higher doses) would create greater separation of the curves[3].

The second example of a possible innovation is something that could be done to prescribing software in order to identify patients for such studies as the above. For example, after selecting a medication, the prescriber is offered a pick-list of diagnostic explanations, thus working backwards. The prescriber clicks on one option and then is offered another pick-list for that diagnosis of presenting complaints, confirmatory findings, markers of future progress and other interventions. If prescribers have agreed to help in a study on patients with that diagnosis, they are directed to the relevant advice. As well as supporting formal research and audit of routine care when assessing diagnostic tests and treatments, the same system of ‘working backwards’ can be used to provide high quality data for Artificial Intelligence (AI) research.

As well as supporting formal research and audit of routine care when assessing diagnostic tests and treatments, the same system of ‘working backwards’ can be used to provide high quality data for AI research.

Huw Llewelyn MD FRCP, honorary fellow in mathematics, Aberystwyth University

Information generated in the above way of ‘working backwards’ could also be given to patients, nurses and other colleagues in the form of a written evidence-based summary of care. The summary would look like a traditional Past Medical History but updated sequentially to provide a narrative of multidisciplinary care. Each diagnosis would form a heading, with subheadings of ‘outline evidence’ (in the form of symptoms, etc) and ‘outline management’[4]. Such a process of explaining decisions before they are implemented would make the prescriber stop and think before going ahead. This could increase the accuracy of predictions and reduce errors.

In one study, ‘intuitive’ or ‘System 1’ predictions made by a surgeon were correct in 78% of cases. Predictions made using transparent or ‘System 2’ reasoning that could also be done by ‘working backwards’ as above were correct in 77% of cases. However when both agreed, the accuracy was about 90%[5]. It is a consequence of probability axioms that if the same prediction is made correctly more frequently than by chance by two methods independently (eg by a doctor or AI) then the accuracy of that concurred prediction will be higher than the prediction made in isolation. In other words, ‘two heads are better than one’.

Innovative approaches of this kind could produce a broader range of evidence for health technology assessment, lower the costs of product testing and speed up the implementation of new treatments and diagnostic tests (in keeping with the government’s policy for ‘accelerated access’). Sharing the information with patients and others as decisions are being made could increase mutual trust, provide better background information during consultations, decrease errors and avoid unnecessary duplication and other waste of resources.

Huw Llewelyn MD FRCP is an honorary fellow in mathematics in Aberystwyth University and has been a practising consultant general physician and endocrinologist for over 40 years


  1. Randomised Control Trials – are they still the gold standard? Royal College of Physicians Commentary June 2018, p 10.
  2. Wyatt JC. AI in the NHS: how should physicians respond? Our Future Health Blog. Royal College of Physicians 2018.
  3. Llewelyn H. The scope and conventions of evidence based medicine need to be widened to deal with ‘too much medicine’. J Eval Clin Prac, 2018, 1-7.
  4. Llewelyn H, Ang AH, Lewis K, Abdullah A. The Oxford Handbook of Clinical Diagnosis, 3rd edition. Oxford University Press, Oxford 2014.
  5. Llewelyn, D E H. Assessing the validity of diagnostic tests and clinical decisions.  MD thesis, University of London, 1987.

Statistical footnote

Consider that the proportion of patients with persistent symptoms in a group on placebo and a TSH up to 20mul/l is 0.075. Conversely, the proportion of patients on placebo (or treatment) and a TSH up to 20mu/l in those with persistent symptoms is 0.333. Consider also that the proportion of patients with persistent symptoms in a group on treatment and a TSH over 20mul/l is 0.041.

Conversely, the proportion of patients on treatment (or placebo) and a TSH over 20mu/l in those with persistent symptoms is 0.25. From this information we are able to estimate the odds of persistent symptoms in the entire placebo group to be (0.075/(1-0.075))*(0.333/0.75) = 0.18, the estimated proportion being 0.18/(1+0.18) =0.153. In the same way, we can estimate the odds of persistent symptoms in the entire treatment group as (0.041/(1-0.041))*((1-0.333)/(1-0.75)) = 0.0162, the estimated proportion being 0.016. This means that the odds ratio between the curves at every point will be 0.0162/0.18 = 0.09.

We next find the means and standard deviations of TSH in the patients with and without persistent symptoms and plot the distributions. With the overall proportions estimated above and Bayes rule, we calculate the curves shown in figure 1. We can estimate the stochastic variation by recalculating the curves for the 95% confidence limits of the proportions used and the distance between the means of each distribution.