How to measure the diagnostic accuracy of artificial intelligence

Rx

Diagnostics is a field where AI is growing fast. Doctors, researchers and computer technicians are very involved in the development and testing of new solutions that can exceed the current diagnostic capacity. But how can the diagnostic accuracy of AI be measured?

There are several metrics to measure accuracy, so it’s hard to compare. Some of them are then influenced by the data sample which can distort the results obtained.

In order to evaluate the performance of a diagnostic system, it is therefore important to know the main metrics and understand what they really represent. An interesting article in MedCityNews by Elad Walach explains these concepts well.

One of the most common metrics used by AI companies is Area Under Curve (AUC). This way of measuring accuracy was actually invented by radar operators in World War Two, but it’s become very common in the machine learning community.

AUC measures how much more likely the AI solution is to correctly classify a positive result (say, to correctly detect a pulmonary embolism in a scan) versus how likely the same AI would be to wrongly detect something when it isn’t there.

The higher the AUC, the more the model is able to distinguish between true and false and therefore to distinguish between patients with disease and those without.

There are two further metrics that are more precise than AUC: sensitivity and specificity.
Sensitivity measures how many positives cases an algorithm detects out of all positive cases. Let’s say you have 100 real brain bleed patients in your department in a week. If the AI detects 95, it means it has 95 percent sensitivity.

Likewise, specificity is the number of negative cases you accurately classified as negative out of the negative cases. This means that if you have 1,000 negative cases (in a real-world scenario, you usually have more negative cases than positive), and the AI wrongly flags 80 of them as positive; the 920 accurate negative detections means that the AI has 92 percent specificity.

Together, these measures give you a good idea of how many patients could be missed by AI.

If these three measurements were to be compared, a solution with 89% sensitivity and 84% specificity could obtain an AUC of 0.95, as well as an algorithm with 80% sensitivity and 92% specificity. They are all good performance values even if an AUC score of 0.95 could suggest better values of sensitivity and specificity.

The AUC provides a single aggregated measure of system performance that may not be sufficient to assess system performance in specific areas.

Two other very important metrics in the IA are the positive predictive value (PPV) and the negative predictive value (NPV). While sensitivity and specificity are interesting from the point of view of technical evaluation, PPV and NPV better represent the clinical user experience.

PPV is the number of true positive cases, out of the total cases flagged as positive (including false positives). A PPV of 80 percent means that 8 out of every 10 alerts a user would see would be correct, 2 wrong. In other words, PPV is the “spam” metric reflecting how many irrelevant alerts a user would see in the day-to-day. Thus, the lower the PPV, the more “spam” (irrelevant alerts).
Imagine you’re using an AI with 95 percent sensitivity and 90 percent specificity to detect c-spine fractures. Over a sample of 1000 cervical spine cases, 100 are deemed to be positive for fractures.

The number of true positive (TP) cases, where the AI correctly spots a fracture, would be 95 percent of 100 (95). The number of false positives (FP), where the AI thinks it’s found a fracture in a healthy patient, would be 10 percent of 900 (90). The 95 TPs and the 90 FPs make 185 positive alerts altogether. That makes PPV 95/185, so 51 percent.

Our system features both high sensitivity (95 percent) and high specificity (90 percent). However, PPV is “only” 51 percent. Why?

The culprit is the data mix. Although there is a relatively low amount of false-positive cases, there is a very high number of negative cases in the first place (900 negatives vs 100 positives), meaning every percentage point in the specificity creates a huge difference in terms of the user experience.

Alternatively, Negative Predictive Value (NPV) reflects the “peace of mind” of the user: how sure you can be that if AI says the case is negative, it’s actually negative. In other words, out of all negative alerts, how many are really negative? Since there are usually many more negative cases (e.g. patients without a diagnosis) than positive cases, the NPV value is much higher than the PPV value. Values above 97% are very common.

For example, an IA solution with a sensitivity and specificity of only 80% would obtain, in the example above, an NPV of as much as 97.5%. A good system with sensitivity and specificity at 95%, with the same data mix, would obtain an NPV close to 99.5%.

The evaluation of these metrics therefore depends very much on the clinical context. In areas where the prevalence of the disease is relatively low, for example avian influenza, a solution with a PPV in the range of 50%-70% would be optimal. For the rarest diseases, a PPV number of up to 20% could still be an excellent performance! The NPV should be very high. An NPV of 95% or more should be sought for reliable AI systems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s