Comfy Internals | How we got four rival AI labs to fight over our code reviews

I handle numerous code reviews at Comfy, with the majority no longer authored by humans. An agent composes it, I refine it, and the quantity I oversee increases while my personal typing decreases. A fatigued individual cannot vigilantly monitor such vast code volumes. Therefore, I ceased attempting and constructed a solution capable of doing so.

The approach: distribute a PR diff to four models from distinct labs, each executing two passes, then have one judge consolidate outcomes. It operates in CI for a fixed $200 monthly fee. The underlying assumption is counterintuitive: multiple models from a single lab represent one viewpoint in varied voices, not multiple opinions. The remedy for an exhausted reviewer wasn't a superior model—it was greater lab diversity.

I released it publicly for the team and others (repository details below). Here's the functionality and expenses involved.

The problem

Adversarial review is the task I least entrust to my own focus. By the third PR of the afternoon, I'm less critical than on the first, yet bugs persist regardless of timing. Hidden errors, silent type conversions, and off-by-one issues that emerge at scale demand a fresh, skeptical reader; by late afternoon, I'm weary and accommodating.

The process was already automated. Paste the diff into one model, request an attack on the change. Paste into another, ask for edge cases. Reconcile lists, then begin my review. That's a script waiting to be written. I hadn't done so because a single model performs inadequately. It evaluates code against the same biases used to generate it, merely echoing my preconceptions.

To clarify "my code" here: this assesses the cloud infrastructure operating ComfyUI, not its rendering engine. Practically, that includes our Go backend (ingest and inference services, OAuth implementation, asset pipeline), MCP server, CI and infrastructure-as-code, and workflow-API-to-graph converter, plus anything I target locally. It hasn't reviewed sampler nodes or CUDA paths. Bugs detected involve concurrency in inference serving, authentication handling, prototype pollution in workflow-graph parsing, and resource exhaustion in upload paths. This scope is intentional and aligns with our actual review volume.

The constraints

Fixed cost limit, not per-PR pricing. Pay-per-use on a busy repository risks unforeseen expenses. The system must fit within one $200/month Cursor Ultra subscription. If it can exceed the budget, someone will disable it eventually.
Operates in CI, not locally. A review that only runs when manually triggered adds unnecessary steps.
Immune to malicious PR manipulation. The diff is attacker-controlled; if the reviewer reads instructions from within the PR, the PR could self-approve.
Runs alongside CodeRabbit, not replacing it. We already use it effectively. I sought a supplementary, distinct perspective, not a substitute.

Why four different labs

The mechanism: models from identical lineages share training biases, resulting in common blind spots and false alarms. They flag typical errors for code shapes, not specific flaws. Agreement among four yields false consensus, worse than a single reviewer by mimicking validation.

Diverse labs disrupt this. As of mid-2026, the lineup includes top models from OpenAI, Anthropic, Google, and Moonshot (Kimi), each failing uniquely. One fixates on concurrency, another catches API drift, a third spots unclosed resources. Three concurring on a line signals reliability; one outlier finding highlights issues a same-lineage reviewer would miss.

For instance, a change implementing image editing for two providers had two reviewers each catch a bug overlooked by the others. Claude alone noted one provider accepts a single image, not multiples allowed by the code, causing failures deep in calls instead of upfront rejection. On the same diff, GPT-5 Codex alone detected dropped content-moderation settings, silently reverting to defaults if safety filtering was increased. Four models from one lab would have approved both.

The obvious counter: isn't this ensemble variance? Wouldn't multiple runs of one strong model at varied temperatures catch identical issues? Some, yes. But temperature reshuffles the same distribution; it doesn't introduce new priors to catch omissions others structurally ignore. Blind spots stem from training, not sampling. I haven't rigorously tested four-temperature-versus-four-labs on labeled data and would welcome such analysis. My hypothesis is lineage diversity provides coverage temperature cannot.

This matters more when an agent drafts code. If Claude authors and reviews, it's the same perspective twice, blind where the author was.

The architecture

It began as a local Cursor CLI command fanning a diff to all four labs. Each model runs two passes: adversarial (assume broken, find flaws) and edge-case (assume happy path works, find problematic inputs). Four models, two passes, eight reviews per PR.

Eight raw reviews are excessive: noisy, redundant, plagued by false consensus. Thus, nothing posts directly to the PR. Output funnels to one judge, the latest Claude Opus, instructed not to trust reviewers. The judge reads actual changed files (reviewers see diffs; judge sees ground truth), categorizing findings as verified, pre-existing, or false-positive, then caps output at ten highest-signal items. Reviewers intentionally over-flag; the judge discards most.

The fan-out uses an 8-cell GitHub Action matrix:

strategy: fail-fast: false matrix: model: - gpt-5.3-codex-xhigh - claude-opus-4-7-thinking-xhigh - gemini-3.1-pro - kimi-k2.5 review_type: [adversarial, edge-case]# 4 models × 2 review types = 8 independent reviews per PR

I productionized it as a label-triggered GitHub Action. Apply a cursor-review label to fire the fan-out; assignment as reviewer auto-adds it. About 110 PRs have used it so far. It's label-based, not per-PR, for two reasons: aggressive reviews on trivial changes train users to ignore bots, and CodeRabbit already handles every-PR. This is an opt-in deep pass; PRs where both flag the same line get priority.

Three critical details:

Idempotent per HEAD SHA. Re-labeling or retries avoid duplicate reviews or billing for unchanged diffs.
5,000-line diff cap. Larger diffs indicate deeper issues beyond missing reviews.
Prompts reside in a separate unreachable repo. Security-wise, reviewer and judge prompts come from the workflow's own repo, pinned to a ref, never the PR's checkout. Hostile PRs can't alter grading rules.

How I use it, and what it cost

It runs early, not late. When authoring, I execute it locally post-agent completion, pre-commit. When reviewing others' PRs, assignment auto-adds the label, completing the pass before I view the diff. I read the bot's verdict first, then the code, with output remaining on the PR for auditability.

For example: an approved change involving paginated lists had four reviewers across three labs flag the same line—sorting only worked with exact direction spelling. Blanks, typos, or raw parameters silently reversed it, risking skipped or repeated items in shared code. When rival models converge on a line cleared by humans, it merits immediate attention.

Previously, I manually ran this on assigned PRs only. Now: eight adversarial reviews plus a judge on ~110 PRs, fixed $200/month, never exceeding limits. Built in ~24 days and 35 commits, mostly debating "verified" criteria with the judge.

One design choice proved valuable: severity tags (critical/high/medium/low/nit), with malformed tags defaulting to medium. Preventing critical bug loss to formatting errors was paramount.

It also became communal. As a shared Action, anyone applies the label for identical results—no installation or permissions needed. It transitioned from private tool to team infrastructure when another engineer requested it for frontend repos.

What’s still open

The lineup evolves. "Top model per lab" is fluid; lab diversity is durable, not the roster, enabling updates via shared config.
The judge's ten-item cap is heuristic. Occasionally, PRs have over ten genuine issues truncated; ten feels right empirically.
The judge is a Claude model, same as one reviewer. LLM judges exhibit self-preference, potentially over-weighting Claude findings. Access to real files mitigates this, but not fully resolved.
No benchmarking exists. No labeled bug sets, precision/recall metrics, or controlled comparisons—just ~110 PRs of real catches humans missed. Trusted results over cited studies; proper benchmarks would interest me.

The architecture is the contribution, so prompts and workflows are open:

Cursor Review GitHub Workflow →

Adopt it, apply to your PRs, and critique the judge cap. We open-source workflows to attract engineers who debate designs—if that's you, join us.