Discovery Is No Longer the Bottleneck

27 May, 2026

Where AI vulnerability scanners actually belong in the SSDLC, and why finding the bug was never the hard part

In its first month, Project Glasswing pointed an unreleased model, Claude Mythos Preview, at more than a thousand of the open-source projects that hold up the modern internet. The scan flagged 23,019 potential vulnerabilities, of which Anthropic estimated 6,202 as high or critical-severity. Among the confirmed findings were a flaw in OpenBSD that had survived 27 years and one in FFmpeg untouched for 16, bugs that had gotten past every prior round of human review and automated testing. These are not the issues your linter nags you about on save. They are the kind that outlive the careers of the people who wrote the code, and the model found them in weeks.

That much is real, and it settles one debate: AI-native discovery genuinely finds a class of bug that humans and pattern matchers have missed for decades. The interesting part is what happened next.

Of the high and critical-rated findings put through additional review, Anthropic reports 1,587 validated as true positives, a 90.6% true-positive rate, and 1,094 confirmed as high or critical. It has disclosed 530 of those to maintainers so far. At the time of the update, 75 had been patched.

Sit with that last ratio, because it is the entire argument. The model can surface critical vulnerabilities faster than the ecosystem can fix them, so much faster that some open-source maintainers have asked Anthropic to slow down its rate of disclosure, because they need more time to design patches. Anthropic's own framing is blunt: the constraint on software security used to be how quickly we could find vulnerabilities; now it is how quickly we can verify, disclose, and patch the ones AI finds.

For many years, application security has operated on an implicit premise: that we are discovery-constrained. If we could just see more of the bugs, sooner, we would be safer. Glasswing is the clearest evidence yet that this premise has flipped. Discovery is becoming cheap, fast relative to humans, and effectively unbounded. What that exposes is uncomfortable, finding the bug was rarely the expensive part. Reproducing it, proving it is reachable, assigning it an owner, and shipping a fix was always the constraint. We mostly hid that behind backlogs.

So the question worth an engineer's time is not whether a Mythos-class model is impressive. It is. The question is where a capability like this belongs in a pipeline that has to ship software on Tuesday, and what a security program looks like when the thing it was organized around, finding the next vulnerability, stops being the hard part.

What Glasswing actually is, and what it isn't

Two facts about the technology change the shape of the whole question, and most coverage skips both.

Project Glasswing is a coalition, launched in April 2026, of roughly fifty organizations, among them AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, NVIDIA, and the Linux Foundation. Claude Mythos Preview is the model underneath it: an unreleased frontier model Anthropic describes as purpose-built to find and remediate flaws in critical software, backed by up to $100M in usage credits for participants.

The first fact: it is not a product. Anthropic has said it does not plan to release Mythos Preview. You cannot install it, drop it into a CI job, or buy a seat, access is gated to the coalition. So any discussion of "adopting" it today is a thought experiment, not a procurement decision.

The second fact is the reason the first one matters: the model is gated because it is dangerous in the open. Anthropic states plainly that no one, including Anthropic, has yet built safeguards strong enough to stop a model this capable from being misused offensively. The capability is real enough that the controlling concern is proliferation.

That is precisely why this is worth thinking through even though you can't use the tool. The capability class, deep, agentic, reasoning-based vulnerability discovery, will diffuse. Cheaper, weaker, purchasable versions will arrive from every vendor with a model and a security story to tell. The useful question is not "should I adopt Mythos." It is: when Mythos-class discovery becomes something I can rent, where in my SSDLC does it belong, what does it cost, and what does it break downstream?

The SSDLC is a latency budget, and that's where this gets decided

The cleanest way to place any scanner is to stop arguing about capability and start reasoning about budget per stage. Every stage of the secure development lifecycle has an implicit ceiling on how long a check can take and how much it can cost per run. Tools live or die by whether they fit that ceiling, not by how clever they are.

Inner loop, IDE and pre-commit. Budget: seconds, at near-zero marginal cost, because it runs on every save and every commit for every developer. This is the home of linters, lightweight pattern-based SAST, and secret scanning. The only credible role for AI here is small, fast models offering inline autofix hints. A deep agentic audit at this tier is a non-starter, the equivalent of running a full penetration test on every keystroke.

CI gate, pull request. Budget: minutes, and deterministic enough to gate a merge without becoming a coin flip. This is where Semgrep, Trivy, and conventional SAST/SCA belong. AI's realistic role at this tier is not discovery, it is triage: filtering the deterministic scanner's output to suppress false positives before a human sees them.

Pre-release and periodic deep audit. Budget: hours to days, with high cost tolerance because it runs rarely and only on what matters. This is the tier where Mythos-class agentic discovery makes sense, a release-gated or quarterly deep pass over crown-jewel components: the auth service, the cryptography, the internet-facing parser.

Post-release and continuous. This is Glasswing itself, run by someone else against the open source you depend on. Their discovery becomes your SCA feed. You consume the results, CVEs, advisories, through existing dependency management, not by running the model.

The takeaway is uncomfortable for the hype cycle: a Mythos-class model is structurally incompatible with the part of the pipeline where developers actually experience security, the inner loop. It is too slow and too expensive to live there, regardless of how good it is. It belongs in the deep, infrequent tiers, and the value it produces there is real only if the stages downstream can absorb what it finds.

The economics, done properly

Cost is where most commentary waves its hands, and where a practitioner can be useful. I'll reason carefully because Anthropic published credits, not a price sheet.

Let's start with what is public. By one published benchmark, shallow AI code review, the kind that summarizes a pull request, runs roughly $15–25 per PR on token usage. That looks harmless until it scales: a 400-developer organization producing on the order of 5,000 PRs a month lands in the five-to-six-figure range monthly, before anyone runs a full-repository scan. And that is the cheap, shallow case.

Now layer on what research says about agentic workloads, which is what Mythos-class discovery is. A Stanford Digital Economy Lab / Microsoft Research study found that agentic tasks consume on the order of 1000x more tokens than ordinary code reasoning or chat; that runs on the same task can vary by up to 30x in total tokens; and, critically, that higher token spend does not reliably translate into higher accuracy. Spend is dominated by input tokens, and the models cannot reliably predict their own consumption.

Three engineering consequences fall out, and none of them are finance's problem to solve. They are yours:

Deep agentic discovery is orders of magnitude more expensive per run than anything in your current pipeline, and its cost is stochastic. You cannot put an unbounded-variance, high-cost-per-run step inside a per-commit gate and keep a predictable bill. This is a second, independent reason it belongs in the periodic-audit tier rather than the inner loop.
You cannot spend your way to better results. Because accuracy saturates, and can even peak at intermediate cost, "give it more budget" is not a strategy. Where you point the expensive model matters more than how much you spend on it.
The metric that matters is cost-per-verified-finding, not cost-per-scan. A free, deterministic Semgrep run that yields a reproducible true positive in sixty seconds can be cheaper per closed bug than a thousand-dollar agentic audit that produces fifty plausible findings nobody can triage in time.

That last point is the bridge to the real problem.

Replace SAST, or enhance it? Neither, and that's the wrong axis

This is the question every reader scans for, so here is the direct answer: AI discovery does not replace SAST, SAST does not make it redundant, and the durable architecture is hybrid, split by tier, not by capability.

Deterministic SAST keeps the fast lane. It is reproducible, cheap, and gateable, properties an agentic model does not have and a merge gate cannot do without. You can block a PR on a deterministic finding and defend that decision; you cannot reliably block a merge on a non-deterministic one.

AI reasoning earns two distinct roles, and conflating them is most of the confusion in the market:

AI-assisted, the model sits after a traditional scanner and filters its output, suppressing false positives and explaining real ones. The published research is consistent here: hybrid LLM-plus-static-analysis configurations outperform either alone, raising recall while filtering noise more effectively than the deterministic tool by itself. This is the role that pays off today.
AI-native, the model is the discovery engine, reasoning over control flow and intent the way a human auditor would, finding cross-file and business-logic flaws that pattern matchers structurally cannot see. This is the Mythos tier. The decades-old OpenBSD and FFmpeg findings are the proof it is real, and it is slow and expensive, which is exactly why it cannot be your CI gate.

The nuance the hype skips runs both ways: AI-native discovery genuinely finds a class of bug deterministic tooling cannot, and it belongs in a narrower place than the vendors selling it would like. Both are true. More capable is not the same as more deployable.

The real work is verification, routing, and remediation

Here is the part only someone who has run a vulnerability management program will tell you, and it is the most important section.

When discovery becomes cheap and abundant, triage stops being a queue you work down and becomes the whole game. This was already true before AI: industry data puts a large share of application-security time into triage, and false-positive rates on traditional SAST against unfamiliar code have been measured as high as 91%. We were already strained. A Mythos-class firehose does not relieve that pressure; it multiplies it. The Glasswing numbers are the proof, discovery up by an order of magnitude, patches in the dozens.

Which leads to the principle this whole piece circles:

A finding you cannot verify quickly is not an asset. It is a liability with good PR.

A vulnerability report without a reproducible proof of concept, a reachability verdict, and a named owner does not make you safer. It consumes attention, inflates the backlog, and erodes trust in the toolchain, and once developers stop trusting the scanner, every finding it produces, true or false, gets ignored equally. Volume without verification is how you train an organization to disregard security.

So the design rules that matter in a Mythos-class world are not about discovery at all:

Discovery tools never write straight to the tracker. Verification status is the gate. A finding earns a ticket by being confirmed, not by being generated. This is the first discipline that volume breaks, and the most important to hold.
Reachability and exploitability analysis become mandatory pre-filters, not nice-to-haves. The only sustainable way to face thousands of findings is to never look at the ones that aren't reachable in production.
The unit of value shifts from "found" to "fixed." Budget the patch pipeline, owner routing, autofix PRs, SLAs keyed to reachability, before you budget the scanner. The 75-of-530 number is what happens when you don't.

What a defensible 2026–2027 architecture looks like

None of this is an argument against AI in the pipeline. It is an argument for putting it where the budget and the physics allow, and for building the thing downstream that makes it worth anything:

Inner loop: fast deterministic checks, linting, lightweight SAST, secret scanning, plus small-model autofix hints. Seconds, cheap, local.
CI gate: deterministic SAST and SCA that can block a merge, with an AI-assisted triage layer filtering the output so humans see signal, not volume.
Periodic deep audit: Mythos-class agentic discovery pointed only at crown-jewel components, on a release or quarterly cadence, with its cost variance budgeted and contained.
Remediation pipeline: the part almost everyone underbuilds, reachability-based prioritization, ownership routing, autofix PRs, and SLAs that can actually absorb the volume the audit tier produces.

If you take one line from this, take this one: budget for the patch pipeline before you budget for the scanner.

AI vulnerability discovery is real, powerful, and, in the wrong hands, dangerous. But its defensive value does not come from the finding. It comes from whether your organization can verify, route, and fix faster than the model can generate. Discovery is no longer the bottleneck. The organizations that internalize that will be the ones that are genuinely more secure, rather than just better informed about how exposed they were all along.