For many of us, technology offers a way to resolve uncertainty. If we can’t recall a fact or figure something out, we simply search and receive an answer. What year did the Paris Peace Conference conclude? Let me Google that… 1920. How many miles is a 10K run? 6.2 miles. Who starred with Oscar-winner Brendan Fraser in his headline-debut Encino Man? That’s right; it was Sean Astin and Pauly Shore.
Interestingly, the reverse is also increasingly true — computers are leaning on humans to check their work. “Human-in-the-loop” AI systems rely on human intervention for assurance that the AI hasn’t misread the information and made an inaccurate prediction. And often the circumstances are far more critical than movie trivia.
For example, a radiologist will review an AI’s X-ray diagnosis to determine if it missed a fracture or lesion. The human can then correct any mistakes to ensure the patient receives the proper care. It makes for a terrific partnership, but there’s a hiccup in the strategy: Humans are rarely 100% certain of their conclusions.
That same radiologist may see an area of bone tissue that’s a different color on the X-ray and wonder, “Is it a lesion or an irregularity in the X-ray itself? If it is a lesion, what’s the cause, and is it benign or malignant?” Even highly trained specialists — perhaps especially specialists — regularly bake such uncertainties into their observations and decisions. If they think there’s a 10% chance of an alternative diagnosis, they can talk to their patient about that and plan accordingly.
While this seems natural to us, human-in-the-loop systems don’t reason that way. They read human interventions as a binary: The human either knows what they know or they don’t. In turn, this might limit an AI system’s ability to mitigate the risk of human error in the partnership.
But is it possible for such systems to better understand the nuances of human decision-making to improve their performance — and our own? It’s a question a team of researchers at the University of Cambridge put to the test in a new research paper.
Are you sure about that?
In their first test, the researchers employed concept-based models — machine-learning models that improve predictions through human feedback — with two datasets. The first dataset, called CheXpert, classified chest X-rays. The other, called UMNIST, added the sums of digits from handwritten samples. Like most concept-based models, neither had been previously trained on uncertainty, so the researchers wanted to see how they would handle it.
“A lot of developers are working to address model uncertainty, but less work has been done on addressing uncertainty from the person’s point of view,” Katherine Collins, the study’s first author and a research student in Cambridge’s Department of Engineering, said. “We wanted to look at what happens when people express uncertainty, which is especially important in safety-critical settings.”
The answer: not great. The researchers found that even with low simulated uncertainty, the models’ performance dropped and continued to fall as that uncertainty grew. This suggested that the models, while accurate when receiving fully certain interventions, were “unable to generalize to settings in which the intervening user is uncertain of the nature of some concepts.”
For their next test, the researchers used an image classification dataset of birds and brought in real human participants. These participants were asked to identify specific features of birds in images. Was the bird multi-colored, solid, spotted, or striped? Is its tail shape forked, rounded, fanned, or squared? And so on.
However, the images didn’t always give the best representation of the bird. The pictured bird may be silhouetted against a bright background or have its tail feathers obscured by a branch. As such, the researchers gave the human participants the ability to use “soft labels” — concepts that aren’t either-or but allow humans to label plausibility from 0–100 (with 0 representing no idea and 100 representing absolute certainty).
For instance, if the participant thought it was very plausible the bird’s wing shape was broad, they could move a slider up to 80. But if they were less certain whether the wings were rounded or pointed, they could move those sliders less (say, to 20 and 10 respectively).
The researchers found that performance degraded when machines were replaced by humans. However, they also found that if the model was trained on uncertainty, it could ease some failures of the human participants. However, they weren’t perfect. Sometimes, human uncertainty was helpful; other times, it harmed the model’s performance.
“We need better tools to recalibrate these models so that the people working with them are empowered to say when they’re uncertain,” Matthew Barker, the study’s co-author, said. “In some ways, this work raised more questions than it answered, but even though humans may be miscalibrated in their uncertainty, we can improve the trustworthiness and reliability of these human-in-the-loop systems by accounting for human behavior.”
The Cambridge team was joined in their study by researchers at Princeton, the Alan Turing Institute, and Google DeepMind. They presented their paper at the 2023 AAI/ACM Conference on Artificial Intelligence, Ethics, and Society in Montreal. The paper is currently available as a preprint on arXiv.
Heading toward an uncertain future
The researchers hope their paper can help to one day develop human-in-the-loop systems that can take uncertainty into account and therefore mitigate the risks of both human and AI error. However, this research represents only the initial steps toward that goal.
It also revealed several challenges for future research. These include how to develop AI models and intervention policies that account for well-known human prediction errors (such as overconfidence bias); creating interfaces that help humans gauge their uncertainty; and training AI models to handle different types of uncertainty, such as the difference between questioning one’s knowledge versus how random effects will play out.
If these problems can be solved, human uncertainty may help improve these models’ performances by better supporting the “human” part of human-in-the-loop.
“As some of our colleagues so brilliantly put it, uncertainty is a form of transparency, and that’s hugely important,” Collins added. “We need to figure out when we can trust a model and when to trust a human and why. In certain applications, we’re looking at probability over possibilities.”