Microsoft Copilot Researcher Now Uses Both GPT and Claude Together: What Changed and Why It Matters

news
Microsoft Copilot Researcher Now Uses Both GPT and Claude Together: What Changed and Why It Matters

On March 30, 2026, Microsoft announced a significant update to the Researcher agent inside Microsoft 365 Copilot. For the first time, Copilot uses two AI models from competing companies simultaneously, with OpenAI's GPT handling the initial research and Anthropic's Claude reviewing and critiquing the result before it reaches the user.

The result, according to Microsoft's own benchmarks, beats every other deep research AI tool currently available.

What Researcher Is

Researcher is Microsoft 365 Copilot's dedicated agent for deep research tasks inside the flow of work. Unlike a standard chatbot that answers a single question, Researcher tackles complex, multi-step research queries across ten different domains, drawing on sources, synthesising findings, and producing structured reports.

Before this update, Researcher relied on a single AI model to handle the entire process from planning to retrieval to writing. The March 2026 update changes that architecture fundamentally.

The Two New Features: Critique and Council

Critique: GPT Drafts, Claude Reviews

Critique is the more significant of the two new additions. It separates the research process into two distinct phases handled by two different AI models.

In the first phase, a GPT model plans the research task, iterates through source retrieval, and produces an initial draft. In the second phase, Claude acts as an expert reviewer. It checks the draft for accuracy, completeness, and citation quality, then refines it before the final report reaches the user.

Microsoft describes this as dividing responsibilities between two AI partners: one optimised for deep exploration and structured synthesis, the other focused on validation and refinement. The logic is that reviewing someone else's work is a different cognitive task from producing it, and different AI models have different strengths along that axis.

Council: Multiple Models Side by Side

Council is the second new feature. It brings multiple model responses side by side within the Researcher interface. A cover letter accompanies the results, explaining where the models agree, where they diverge, and what unique insight each one brings to the topic.

This gives users a comparative view rather than a single synthesised answer, which is particularly useful for research questions where different analytical perspectives produce meaningfully different conclusions.

The Benchmark Results

Microsoft tested Critique against the DRACO benchmark, which stands for Deep Research Accuracy, Completeness, and Objectivity. DRACO comprises 100 complex research tasks across 10 domains, introduced by researchers from Perplexity and academia in February 2026.

Researcher with Critique scores 7.0 points higher than the previous best-performing system on the benchmark, which was Perplexity Deep Research running Claude Opus 4.6. That represents a 13.88 percent improvement. Microsoft used OpenAI's GPT-5.2 as the evaluation judge, which it describes as the strictest of the three judge models reported in the benchmark paper.

The improvement puts Copilot's Researcher ahead of standalone deep research tools from OpenAI, Google, Perplexity, and Anthropic on this specific benchmark.

Why Microsoft Is Using Claude Alongside GPT

The decision to use Claude inside a Microsoft product is commercially striking. Microsoft has an existing multi-billion dollar partnership with OpenAI. Claude is made by Anthropic, a competitor. Using them together in the same product pipeline would have been unthinkable a year ago.

The explanation is competitive pressure. Microsoft reported 15 million paid Copilot seats in January 2026, representing roughly 3.3 percent of its 450 million commercial Microsoft 365 users. Adoption has been slower than Microsoft anticipated. At the same time, Google has been expanding Gemini across Workspace, Anthropic's Claude has seen growing enterprise adoption, and OpenAI has been pushing ChatGPT Enterprise with deep research capabilities of its own.

Multi-model architecture solves a genuine problem. No single AI model is best at every task. Using GPT for its strengths in exploration and synthesis, then Claude for its strengths in accuracy review and citation quality, produces better outputs than either model could produce alone. The 13.88 percent improvement on DRACO is the measurable evidence that the architecture works.

Microsoft is also not limiting Copilot to OpenAI models elsewhere. Within Copilot Studio, enterprise users can select Claude Opus 4.6 and Claude Sonnet 4.5 for heavy reasoning tasks. GitHub Copilot offers Google's Gemini 2.5 Pro for coding contexts. The March 2026 announcement of Critique and Council fits a broader pattern of Copilot becoming a multi-model platform rather than an OpenAI wrapper.

What This Means for Enterprise Research

For Microsoft 365 users doing serious research inside Copilot, the practical implication is better output quality without any change to the interface or workflow. Critique runs automatically. You do not choose which model handles which phase. You receive a single refined report that has already been checked for accuracy and citation quality by a second model.

Council adds the ability to see competing perspectives on complex questions, which is useful for research where the framing of the question affects the conclusions. Seeing where different models agree and where they diverge is more informative than a single confident answer on genuinely contested topics.

Both Critique and Council are available through Microsoft's Frontier early access program as of the announcement date.

Frequently Asked Questions

Does using Claude inside Copilot mean Microsoft is moving away from OpenAI?

No. Microsoft's partnership with OpenAI remains intact. GPT models continue to handle the primary research and drafting phase inside Researcher. Claude is added as a second model specifically for the review and critique phase. Microsoft is expanding to a multi-model architecture rather than replacing one partner with another.

How is this different from just using Copilot and Claude separately?

When you use Copilot and Claude as separate tools, you manage the handoff yourself. You copy output from one into the other, review the results, and decide what to accept. Critique automates this entire process inside a single workflow. GPT drafts, Claude critiques, and you receive the refined output without any manual intervention between the two models.

What is the DRACO benchmark and is it a reliable measure?

DRACO stands for Deep Research Accuracy, Completeness, and Objectivity. It was introduced by researchers from Perplexity and academia in February 2026 and tests 100 complex research tasks across 10 domains. Microsoft applied the same evaluation protocol published in the original benchmark paper and used GPT-5.2 as the evaluation judge. The benchmark provides a standardised comparison across systems, though as with all benchmarks, real-world performance on specific tasks may differ from aggregate scores.

Discover: News

Discussion (0)

Be the first to comment.