Belitsoft > Microsoft AI for Health: MAI-DxO Is 4 Times Better at Diagnosis Than Doctors

Microsoft AI for Health: MAI-DxO Is 4 Times Better at Diagnosis Than Doctors

Microsoft has released research that demonstrates a practical way for large language model software to help doctors make difficult diagnoses while spending less money on tests. 

Contents

The project has two main pieces. The first is a test set called the Sequential Diagnosis Benchmark, or SDBench. It turns 304 detailed New England Journal of Medicine case reports into step-by-step puzzle scripts. In each script, the decision maker — whether a human doctor or an AI model — receives only a short opening note. The decision maker must then ask questions or order tests one at a time, just as doctors do during a real consultation. A separate “gatekeeper” program releases an answer only if the request is specific. If the original article does not contain a requested lab value, the gatekeeper invents a realistic number so that no one can guess the right diagnosis from missing data. At the end, the test checks whether the final answer matches the article and it adds up the dollar cost of every question and every test.

The second piece is a control tool called the MAI Diagnostic Orchestrator, or MAI-DxO. It does not hold medical knowledge of its own. Instead, it tells a modern language model — such as GPT-4o, Gemini, Claude, Llama, or Grok — how to behave like a careful medical team. 

The orchestrator splits the job into several virtual “doctors”. 

  1. One keeps a ranked list of possible diseases. 
  2. One chooses the next question or test that should remove the most doubt. 
  3. One acts as a skeptic and keeps looking for other explanations. 
  4. One watches the running bill. 
  5. One checks that every step follows basic safety rules. 

The orchestrator repeats this loop until extra tests would not change the top answer. It then delivers a single diagnosis and a summary of what it spent to reach the conclusion. Because the system is model-agnostic, Microsoft can plug in different language models without changing the control logic.

Microsoft paired the orchestrator with OpenAI’s o3 model for its headline demonstration. Under those conditions, the AI reached the correct diagnosis in roughly 85 percent of the SDBench cases described in the press material and about 80 percent in the formal paper. 

A comparison group of 21 experienced physicians, drawn from the United States and the United Kingdom and averaging 12 years in practice, solved about 20 percent of the same problems. 

Those doctors were not allowed to use online references or language models, a rule meant to keep the playing field level with the AI, which also had to rely only on information revealed by the gatekeeper.

Cost figures point in the same direction. The doctors spent just under $3,000 per case if one counts $300 for every consultation round and standard 2023 U.S. prices for each test. The orchestrated o3 model spent about $2,400. When the researchers ran o3 on its own, without the orchestrator controlling the process, the model asked for many more tests. Its accuracy stayed high, but the bill climbed to nearly $8,000 per case. 

Microsoft argues that this spread shows the value of a formal process that forces the model to think in small steps and keep an eye on cost.

The study also makes its boundaries clear. All SDBench cases are difficult teaching cases, not routine coughs, rashes, or hypertension visits. The benchmark ignores regional price swings, insurance discounts, test wait times, and the discomfort a patient feels during a procedure. The doctors in the trial worked as unassisted generalists, though in real life they would call in specialists for rare conditions. For these reasons, Microsoft labels the work an early proof of concept, not a finished clinical product. The company states that the orchestrator has not yet been used on live patient data outside controlled tests.

Alongside MAI-DxO, Microsoft is promoting another tool, DxGPT, that focuses on rare diseases. The company says DxGPT reaches about 60 percent diagnostic accuracy across all diseases and close to 50 percent for rare disorders, numbers that put it in the same range as a trained clinician. 

DxGPT is already running in the Madrid regional health service, where 6,000 doctors may consult it, and the company estimates around 500,000 patients have benefited from its suggestions. DxGPT is available through the Azure Marketplace, which lets hospitals that already use Microsoft’s cloud add the service with limited extra effort.

Both tools are inside a health-specific unit Microsoft created at the end of 2024. Mustafa Suleyman, who co-founded DeepMind and later led the startup Inflection, joined Microsoft that year and now oversees all consumer AI products as well as the health group. Dominic King, a physician and former Google Health executive, serves as vice president. 

Their mandate is to merge clinical insight, product design, and model research into tools that improve diagnostic accuracy and lower cost. They repeat in public statements that doctors will remain responsible for treatment plans, patient communication, and ethical accountability. The AI’s role, they say, is to support judgment.

A major factor enabling the work is Microsoft’s partnership with OpenAI. From an initial $1 billion stake in 2019, Microsoft’s total commitment has risen to about $14 billion. The contract runs until 2030 but allows OpenAI to leave early if its board declares that it has achieved artificial general intelligence. News reports describe friction over OpenAI’s wish for more commercial freedom, but Suleyman says the alliance remains strong and long-term.

Analysts expect global spending on AI healthcare applications to rise from roughly $8 billion in 2023 to about $200 billion by 2030, driven mainly by tools that improve diagnosis and trim unnecessary testing. Microsoft views MAI-DxO, DxGPT, and the broader Azure platform as a way to capture a large share of that growth. Company disclosures say revenue from healthcare-related cloud services has more than tripled since 2020. Equity analysts maintain a strong buy consensus on Microsoft stock and see further upside, even after years of outperformance.

The clinical need is equally large. A 2023 study by the U.S. Agency for Healthcare Research and Quality estimated that American emergency departments misdiagnose about 7.4 million patients each year, with one in 350 cases ending in death or serious disability. At the same time, redundant tests add billions of dollars to national health costs. Microsoft argues that a system able to reach an accurate answer with fewer procedures can reduce both patient harm and wasteful spending.

Turning research into real-world impact will take time. Microsoft is negotiating with hospital systems to run live trials that feed the orchestrator real electronic health record data under regulatory oversight. Early uses are likely to appear as clinician-facing second-opinion tools. Consumer symptom checkers would follow only after regulators and professional bodies are satisfied that the system is reliable. Because Bing and Copilot already handle about 50 million health-related queries per day, Microsoft could integrate diagnostic suggestions quickly once confidence is high enough.

For a C-suite audience, three messages matter. 

First, structured large language model systems can already outperform experienced generalists on complex cases in controlled settings. 

Second, an orchestration layer that forces the model to ask targeted questions and watch its own budget prevents runaway costs. 

Third, commercial uptake will depend on showing the same gains on typical cases, proving fairness across diverse patient groups, and fitting within existing clinical workflows and liability rules. Microsoft’s size, cloud footprint, and access to frontier models give it a head start, but hospitals, insurers, and regulators will decide how quickly the technology becomes routine care.

If MAI-DxO and related tools clear those hurdles, they could change how diagnostic work is distributed between humans and software. Doctors would focus on final judgment, complex communication, and the hands-on parts of care, while orchestrated AI systems would handle the exhaustive review of differential diagnoses and the cost-benefit arithmetic behind each test order. That shift would not remove physicians from the loop, but it could let them spend less time on information gathering and more on treatment planning and the human side of medicine. 

Whether that potential becomes daily practice will depend on rigorous field trials, transparent error tracking, and clear accountability frameworks. Microsoft’s next steps — live deployments, peer-reviewed validation, and regulatory engagement — will show whether the early performance numbers can survive the complexity of real healthcare environments.

Never miss a post! Share it!

Written by
Delivery Manager
"I've been leading projects and managing teams with core expertise in ERP development, CRM development, SaaS development in HealthTech, FinTech and other domains for 15 years."
5.0
1 review

Rate this article

Leave a comment
Your email address will not be published.

Recommended posts

Belitsoft Blog for Entrepreneurs

Our Clients' Feedback

zensai
technicolor
crismon
berkeley
hathway
howcast
fraunhofer
apollomatrix
key2know
regenmed
moblers
showcast
ticken
Next slide
Let's Talk Business
Do you have a software development project to implement? We have people to work on it. We will be glad to answer all your questions as well as estimate any project of yours. Use the form below to describe the project and we will get in touch with you within 1 business day.
Contact form
We will process your personal data as described in the privacy notice
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply
Call us

USA +1 (917) 410-57-57

UK +44 (20) 3318-18-53

Email us

[email protected]

to top