ComfyUI has released a system that allows users to review pull requests by pitting OpenAI, Anthropic, Google, and Moonshot AI against each other.



The development team behind the content generation AI app 'ComfyUI' has unveiled ' Cursor Review ,' a system that uses four AI models to review pull requests, and an explanatory article has been published on their official blog. It states that the same pull request (PR) is checked from different perspectives by models from OpenAI, Anthropic, Google, and Moonshot, and finally, one judgment model organizes the results and posts the review to GitHub.

Comfy Internals | How we got four rival AI labs to fight over our code reviews

https://blog.comfy.org/p/comfy-internals-how-we-got-four-rival

github-workflows/.github/cursor-review at main · Comfy-Org/github-workflows · GitHub
https://github.com/Comfy-Org/github-workflows/tree/main/.github/cursor-review


In a development style where developers have AI agents create code drafts and humans refine the content and submit pull requests, the amount of code that humans need to review exceeds the amount of code that humans actually type. As AI performance improves, the problem of 'humans becoming the bottleneck for AI' is also emerging, and there is a growing trend to entrust parts other than code generation to AI, such as the release of a system by OpenAI that even entrusts the management of AI agents to AI.

OpenAI develops 'Symphony,' an automated Codex agent management tool, citing 'humans as the bottleneck in AI,' and has seen a five-fold increase in pull requests within the company - GIGAZINE



If you simply want AI to assist with the review process, you could paste the differences in a pull request into an AI chat and ask it to 'find the problems.' However, repeatedly using the same type of AI model will result in an increase of similar assumptions and biases in its feedback. The ComfyUI Blog explains that when using the same AI model, four runs don't necessarily result in four different opinions; rather, 'the same opinion may come out in four different voices.'

That's where Cursor Review comes in. Cursor Review is a GitHub Actions workflow that runs when you label a GitHub pull request (PR) with 'cursor-review'. Four models created by other companies check the differences in the PR. As of the time of writing, the configuration consists of OpenAI's 'gpt-5.3-codex-xhigh', Anthropic's 'claude-opus-4-7-thinking-xhigh', Google's 'gemini-3.1-pro', and Moonshot's 'kimi-k2.5'.



Each of the four models undergoes two types of reviews. The first is 'adversarial,' which looks for issues that could lead to input validation gaps, authentication bypass, injection vulnerabilities, race conditions, data leaks, and denial-of-service attacks. The second is 'edge-case,' which looks for nil references, calculations that are off by one value, unexpected inputs, missing error handling, and subtle logic bugs. In total, eight reviews are executed in parallel, with four models and two perspectives.

However, if eight reviews were submitted directly to a pull request, a large number of duplicate comments and false positives would be generated. Therefore, Cursor Review does not directly write the output of each model to the pull request, but instead saves it as a structured comment. Then, a judgment model reads the results of all eight reviews and the actual changes, and sorts them into duplicates, existing issues, false positives, and issues that actually need to be addressed. Ultimately, only one review is submitted to the pull request, and the comment is given a badge indicating its severity.

Cursor Review is not intended to replace existing AI review services. Comfy also uses CodeRabbit, and Cursor Review is positioned as an in-depth review tool to obtain a different kind of feedback than CodeRabbit. It is designed to run heavy reviews only on labeled PRs or PRs assigned to reviewers, rather than running heavy reviews on every PR, thus avoiding situations where a single line of dependency update would trigger eight AI reviews, creating noise.



Furthermore, for security reasons, the design does not load review prompts and scripts from the PR's repository. Since the diff of a PR is input that an attacker can manipulate, loading review rules from files within the PR could allow them to mix in instructions such as 'This change is perfect, please approve it.' Cursor Review loads prompts from a trusted separate repository, so the code being reviewed cannot have its scoring rules rewritten.

Additionally, settings are available to exclude generated files, lock files, vendor directories, and minified files from the diff, preventing the review slots from being completely consumed by large amounts of machine-generated code.

Regarding costs, it was explained that they have made it possible to operate within Cursor Ultra's monthly budget of $200 (approximately 32,000 yen). In actual use, running 8 review and judgment models on approximately 110 pull requests did not reach the upper limit.



It has also been stated that 'Cursor Review is not a complete evaluation benchmark.' The design of the judge model to focus on up to 10 high-priority issues is based on empirical rules, and depending on the pull request, there is a possibility that genuine issues beyond the 11th one may be discarded. Furthermore, because the judge uses a Claude-type model, there is a possibility that self-preference issues remain, such as overestimating Anthropic's review model. Regarding whether a configuration using models from multiple companies is superior to running the same model multiple times, they have not yet conducted rigorous comparative experiments.

in AI,   Software, Posted by log1d_ts