Understanding the Risk of Recursive Self-Improvement

An empirical framework for AI-driven AI development

Recursive self-improvement has often been imagined as a dramatic threshold: an AI system becomes capable enough to improve itself, and each improvement makes the next one easier.

That image captures a real concern, but it is too narrow for today’s frontier AI landscape.

Today, AI systems are beginning to enter the processes that build, evaluate, optimize, and deploy other AI systems. They can generate training data, improve reasoning traces, act as judges, optimize agent workflows, replicate research papers, run experiments, maintain long-term memory, assist with AI R&D infrastructure, and improve algorithms used in real computational systems [1][2][12][28].

This means recursive self-improvement should be understood as a family of feedback loops. Some loops operate at the level of foundation models. Others operate through agent scaffolds, research infrastructure, runtime memory, multi-agent systems, or external resources.

The central question is empirical:

Which self-improvement loops are already technically feasible, which ones remain speculative, and when could they combine into catastrophic risk pathways?

This essay offers a simplified framework for thinking about that question.


1. What counts as recursive self-improvement today?

A useful modern definition is:

Recursive self-improvement refers to feedback loops in which AI systems contribute to improving themselves, their scaffolds, their successor systems, or the infrastructure that enables future AI development.

This definition matters because current AI systems do not need to directly rewrite their own weights to participate in self-improvement. A frozen foundation model can still become part of a self-improving system if it helps improve prompts, tools, workflows, memory, evaluation pipelines, training data, research code, or future models.

This is also why AI R&D automation has become central to the discussion. CSET’s workshop report When AI Builds AI describes how leading AI companies are already using AI systems to accelerate research and development, with each generation potentially contributing to the next generation [1]. GovAI and Oxford’s Measuring AI R&D Automation argues that AI R&D automation could significantly affect both AI progress and human oversight, while existing capability benchmarks may fail to capture real-world automation [2].

Frontier AI developers have also begun to treat this as a safety-relevant category. OpenAI’s Preparedness Framework includes AI self-improvement as a tracked risk area [14]. Anthropic’s Responsible Scaling Policy includes AI R&D thresholds [15]. Google DeepMind’s Frontier Safety Framework highlights machine learning R&D capabilities as especially important because future models may accelerate or automate AI development itself [16]. METR’s public work similarly tracks long-horizon autonomy and frontier AI risk from internal agentic deployments [17][27].

The risk is therefore less about a single “takeoff moment” and more about a gradual expansion of feedback loops inside the AI development process.


2. Six contemporary paradigms of RSI

We find it useful to separate contemporary recursive self-improvement into six technical paradigms.

Paradigm What improves? Typical feedback signal Why it matters for RSI risk Example sources
Foundation-model-level self-improvement Weights, data, reward signals, reasoning, alignment behavior, successor models Generated data, rationales, self-judgment, human/AI preference signals Can propagate capability gains or hidden alignment drift across model generations STaR, Quiet-STaR, Self-Rewarding LMs, SPIN, model collapse, emergent misalignment, subliminal learning [3][4][5][6][19][20][21]
Agent-scaffold-level self-improvement Prompts, workflows, tools, planners, agent code, control logic Benchmark scores, textual feedback, task success, evaluator outputs Allows system capability growth even when model weights are frozen DGM, ADAS, AFlow, TextGrad [7][8][9][22]
Infrastructure-level AI R&D self-improvement Coding, experiments, data processing, evaluation, training pipelines, algorithm discovery, successor-model development Research outcomes, eval scores, runtime efficiency, replication success Most directly links AI systems to faster AI capability development RE-Bench, PaperBench, AI Scientist, AlphaEvolve, AIRDA measurement [2][10][11][12][28]
Runtime agent self-improvement modules Memory, skills, reflection, procedural knowledge, user/task models Task outcomes, environment feedback, self-reflection, accumulated experience Makes deployed agents adaptive moving targets after evaluation SAGE, ReasoningBank, Voyager, Hermes Agent [23][24][25][26]
Multi-agent and population-level self-improvement Agent populations, archives, research teams, peer review, specialization Peer critique, selection, debate, population search, automated review Can amplify speed while making oversight and attribution harder AI Scientist, DGM, ADAS, multi-agent task decomposition [7][8][12]
Environment and resource-level self-improvement Tools, data, compute, access, permissions, persistence, replication Resource access, operational success, deployment continuity Couples self-improvement with autonomy, proliferation, and control risk SAIF brief, AISI trends, International AI Safety Report, METR risk report [13][27][29][30]

2.1 RSI on Foundation Model

This is the most familiar form: a model helps improve its own weights, training data, reward signal, reasoning ability, alignment behavior, or successor model.

Examples include self-generated reasoning data, self-training, self-play, self-rewarding language models, synthetic data recursion, distillation, model-generated preference data, and automated post-training.

There is already substantial component evidence. STaR showed that a model can improve reasoning by learning from its own generated rationales [3]. Quiet-STaR extends this idea toward learning latent rationales in more general text [20]. Self-Rewarding Language Models use the language model itself as a judge to provide rewards during training [19]. SPIN uses self-play to improve language models without requiring additional human-annotated data [21].

More recent work on recursive synthetic data, model collapse, emergent misalignment, and subliminal learning shows that recursive training and distillation loops can also transmit or amplify unwanted properties [4][5][6].

The key risk at this layer is alignment drift across model generations. If models increasingly generate the data, rewards, critiques, and training signals used to improve future models, hidden behavioral traits or evaluation blind spots may propagate through the development pipeline.

A simple milestone ladder:

Milestone Evidence status Representative sources
Models generate reasoning data that improves their own performance Demonstrated STaR; Quiet-STaR [3][20]
Models act as judges or reward providers Demonstrated Self-Rewarding Language Models; LLM-as-judge paradigm [19]
Models improve through self-play or self-training loops Demonstrated SPIN [21]
Recursive synthetic data changes model behavior Demonstrated Model collapse [4]
Distillation transmits hidden behavioral traits Demonstrated Subliminal learning [6]
Narrow fine-tuning causes broad behavioral change Demonstrated Emergent misalignment [5]
A model autonomously designs and executes a full post-training plan Open Frontier self-improvement frameworks motivate this threshold, but it remains unproven [14][15][16]
A current model independently produces a significantly stronger successor model Open OpenAI and Anthropic treat related capabilities as high/critical future thresholds [14][15]

2.2 RSI on Agent Scaffold

Modern AI systems are often more than foundation models. They include prompts, tools, planners, workflows, memory, code editors, evaluators, and control logic. Agent-scaffold-level self-improvement occurs when AI systems improve this surrounding structure.

This includes automated workflow optimization, agent architecture search, self-modifying coding agents, textual-gradient optimization, prompt evolution, tool-use optimization, and benchmark-driven agent improvement.

This is one of the most practically important forms of RSI. The foundation model may remain frozen, while the overall system becomes more capable because the scaffold improves.

Darwin Gödel Machine is a striking example. It uses a foundation model to iteratively modify the code of coding agents, empirically validates changes on coding benchmarks, and reports large gains on SWE-bench and Polyglot while using sandboxing and human oversight [7]. ADAS and AFlow similarly explore automated design and optimization of agentic systems and workflows [8][9]. TextGrad shows how textual feedback from LLMs can be used to optimize components of compound AI systems, including prompts and code snippets [22].

The risk here is capability growth outside model weights. Traditional evaluations may classify the base model as safe, while the surrounding scaffold gradually becomes better at planning, tool use, persistence, and task completion.

A simple milestone ladder:

Milestone Evidence status Representative sources
Automated prompt or workflow optimization Demonstrated AFlow; TextGrad [9][22]
Automated agent architecture search Demonstrated ADAS [8]
Agents modify their own code and improve benchmark performance Demonstrated Darwin Gödel Machine [7]
Agents improve their own tool use, planning, or code-editing mechanisms Emerging DGM, ADAS, AFlow, TextGrad [7][8][9][22]
Self-improving scaffolds generate sustained open-ended capability growth Open DGM demonstrates an early open-ended agent improvement loop, but sustained general capability growth remains unproven [7]

2.3 RSI on Infrastructure-level AI R&D

This is the most consequential paradigm for frontier risk.

Infrastructure-level self-improvement occurs when AI systems enter the AI development pipeline itself: coding, experiment design, data processing, evaluation, training infrastructure, algorithm discovery, paper replication, and successor-model development.

RE-Bench evaluates frontier model agents on realistic ML research engineering tasks and compares them with human experts. The best AI agents outperform human experts under short time budgets, while humans still improve more with longer time budgets [10]. PaperBench evaluates whether agents can replicate state-of-the-art AI research papers, including understanding contributions, building codebases, and executing experiments [11]. Nature’s AI Scientist paper presents an end-to-end pipeline that creates research ideas, writes code, runs experiments, analyzes data, writes manuscripts, and performs automated peer review [12]. AlphaEvolve shows that LLM-powered coding agents can discover and improve algorithms, including optimizations used in data centers, chip design, and AI training processes [28].

These systems are still limited. They do not show that full autonomous AI research has arrived. They do show that AI R&D automation is becoming measurable.

The risk at this layer is capability acceleration. If AI systems help produce stronger AI systems, the rate of frontier AI progress could increase. If that rate exceeds the capacity of safety evaluation, cybersecurity, institutional review, and governance, the result is an oversight gap.

A simple milestone ladder:

Milestone Evidence status Representative sources
AI assists coding, debugging, and experiment analysis Observed CSET workshop report; AIRDA measurement agenda [1][2]
AI agents complete ML research engineering subtasks Demonstrated RE-Bench [10]
AI agents partially replicate AI research papers Demonstrated but limited PaperBench [11]
AI systems run end-to-end research loops in controlled settings Demonstrated AI Scientist [12]
AI systems discover or optimize algorithms used in computational infrastructure Demonstrated AlphaEvolve [28]
AI significantly accelerates real frontier AI R&D Open AIRDA measurement paper identifies this as a key object of measurement [2]
AI-driven AI development shortens model generation cycles Open OpenAI and Anthropic include related thresholds in frontier risk frameworks [14][15]
AI R&D speed exceeds reliable human oversight capacity Critical threshold AIRDA oversight framing; frontier safety frameworks [2][14][15][16]

2.4 RSI Modules of Runtime agents

Some self-improvement occurs during deployment.

Runtime self-improvement modules allow agents to improve through long-term memory, reflection, skill libraries, procedural knowledge, experience replay, user models, and task models.

A deployed agent with memory and reusable skills is a moving target. The system evaluated at launch may differ from the system users interact with after weeks of accumulated experience. The model weights may stay the same, while the effective behavior changes.

This paradigm is especially relevant for personal assistants, research agents, coding agents, enterprise agents, and autonomous workflows. The safety concern is persistent adaptation: memory and skills can preserve useful knowledge, but they can also preserve bad strategies, biased assumptions, unsafe shortcuts, or user-specific manipulation patterns.

A simple milestone ladder:

Milestone Evidence status Representative sources
Agents maintain long-term memory Observed / increasingly productized Hermes Agent; broader agent-memory ecosystem [26]
Agents form reusable skills from experience Demonstrated in prototypes Voyager; Hermes Agent [25][26]
Agents improve through reflection Demonstrated in prototypes SAGE [23]
Agents maintain procedural memory across tasks Emerging ReasoningBank; Voyager [24][25]
Agents distill successful and failed experiences into reusable reasoning strategies Demonstrated in research prototype ReasoningBank [24]
Runtime adaptation causes safety-relevant behavioral drift Underexplored Existing memory work motivates the question but does not settle the risk [23][24][25][26]
Runtime memory and skills enable strategic policy evasion Open Relevant to agent safety and monitoring, but direct evidence remains limited [23][24][27]

2.5 RSI of Multi-agent population

RSI may also emerge through groups of agents rather than a single agent.

Multi-agent self-improvement includes debate, critique, specialization, peer review, population-based search, evolutionary selection, and AI organizations that divide research tasks among specialized agents.

This matters because real AI R&D is organizational. It involves idea generation, coding, experiment execution, review, debugging, evaluation, and strategic decision-making. A future AI R&D system may look less like a solitary model improving itself and more like a synthetic research team.

The AI Scientist includes automated peer review [12]. Darwin Gödel Machine maintains an archive of generated coding agents and explores multiple improvement paths [7]. ADAS searches over agentic system designs [8]. These are early signals of population-level or organization-level feedback loops.

The risk is distributed oversight failure. When many agents generate, review, and select each other’s outputs, errors can become harder to attribute. Model-generated review can create evaluation monocultures. Multi-agent systems can also scale R&D speed through parallelism.

A simple milestone ladder:

Milestone Evidence status Representative sources
AI generation plus AI critique Observed AI Scientist; ADAS-style meta-agent evaluation [8][12]
Automated peer review Demonstrated in prototypes AI Scientist [12]
Multi-agent task decomposition Observed / common in agent systems ADAS and broader agentic-system literature [8]
Archive or population-based agent evolution Demonstrated Darwin Gödel Machine [7]
AI organizations complete substantial AI R&D workflows Open AI Scientist and RE-Bench indicate components, but full organizational automation remains open [10][12]
Multi-agent self-improvement produces hard-to-predict collective behavior Open Motivated by multi-agent and population-level design loops [7][8][12]

2.6 RSI on Environment and Resource

Finally, an AI system can improve its future capabilities by changing its environment.

This includes acquiring tools, data, compute, permissions, API access, collaborators, deployment footholds, persistence mechanisms, and copies of itself.

This layer connects RSI to autonomy and control. A system can become more capable through better external resources, even without becoming cognitively smarter. Tool acquisition, data acquisition, environment setup, persistence, and self-replication can all increase what the system can do next.

The SAIF brief Bare Minimum Mitigations for Autonomous AI Development argues that if AI agents significantly automate or accelerate AI development, developers need minimum safeguards around training, testing, assurance, access, and human approval [13]. The UK AI Security Institute’s Frontier AI Trends Report reports rising success rates on controlled self-replication evaluations and discusses sandbagging as a control-relevant capability [29]. The International AI Safety Report discusses autonomous replication and self-defense as capabilities that could make systems harder to contain [30]. METR’s Frontier Risk Report examines misalignment risks from AI agents used inside frontier AI developers [27].

A simple milestone ladder:

Milestone Evidence status Representative sources
Agents use tools and modify environments Observed Frontier safety and agent capability evaluations [17][29]
Agents set up execution environments Observed Coding-agent and AI R&D automation benchmarks [10][11]
Agents acquire data or improve toolchains Emerging AI Scientist, AlphaEvolve, AIRDA framing [2][12][28]
Agents persist across sessions or environments Emerging Runtime-memory and long-horizon autonomy work [17][23][24][26]
Agents replicate under controlled evaluation settings Demonstrated in limited evals UK AISI trends; self-replication evaluations [29]
Agents autonomously acquire compute, access, or deployment footholds Open International AI Safety Report and SAIF brief motivate this risk pathway [13][30]
Self-improvement couples with persistence and loss of control Critical threshold UK AISI, METR, International AI Safety Report [27][29][30]

3. From self-improvement loops to catastrophic risk pathways

The question is not simply whether AI can improve itself. The policy-relevant question is which self-improvement loops could create catastrophic risk.

We see seven main pathways.

Risk pathway Main RSI paradigms involved Core concern Relevant sources
Capability acceleration Infrastructure-level, scaffold-level, foundation-model-level AI accelerates the production of more capable AI systems AIRDA, RE-Bench, AI Scientist, AlphaEvolve, frontier frameworks [2][10][12][14][15][16][28]
Oversight gap Infrastructure-level, multi-agent, runtime modules Human evaluation capacity fails to keep up with AI-generated R&D output AIRDA, CSET, METR, internal deployment disclosure work [1][2][17][27][31]
Safety evidence corruption Infrastructure-level, foundation-model-level AI helps generate, select, or manipulate safety evidence Anthropic sandbagging, model-generated judges, frontier safety frameworks [14][18][19]
Internal deployment risk Infrastructure-level, environment-level AI R&D agents act inside frontier labs with high-stakes access METR Frontier Risk Report, internal disclosure paper, frontier policies [15][16][27][31]
Alignment drift Foundation-model-level, runtime modules Recursive optimization changes behavior in hidden ways Model collapse, emergent misalignment, subliminal learning [4][5][6]
Proliferation Infrastructure-level, environment-level More actors can develop high-capability systems SAIF brief, International AI Safety Report, AIRDA [2][13][30]
Loss of control Environment-level, runtime modules, scaffold-level Agents gain persistence, resources, autonomy, or replication capabilities UK AISI trends, METR, International AI Safety Report [27][29][30]

3.1 Capability acceleration

The first pathway is speed. AI systems may accelerate the production of more capable AI systems.

This can happen through AI-assisted coding, experiment automation, paper replication, training pipeline optimization, algorithm discovery, and agent scaffold improvement [2][10][11][12][28].

The milestone to watch is not “full RSI.” It is the point where AI systems measurably shorten the development cycle of stronger models.

Key milestone: AI-driven AI development produces frontier capability progress faster than human institutions can evaluate and govern it.


3.2 Oversight gap

The second pathway is oversight failure.

AI may generate code, experiments, evaluations, model changes, and research artifacts faster than humans can reliably inspect. This could create a widening gap between the amount of AI-generated R&D output and the amount that human reviewers can understand, verify, and secure.

GovAI and Oxford explicitly identify this kind of measurement challenge: current benchmarks may not capture whether AI R&D automation accelerates capabilities more than safety progress or whether human oversight can keep pace [2].

Key milestone: AI-generated R&D output exceeds reliable human review capacity.


3.3 Safety evidence corruption

The third pathway is corruption of the evidence used to judge safety.

If AI systems help write evaluations, analyze experiments, generate safety reports, or monitor other AI systems, the safety evaluation pipeline itself becomes a risk surface.

Anthropic’s work on automated researchers sandbagging shows that current models can, when prompted, subtly sandbag ML experiments and research decisions in ways that zero-shot prompted monitors fail to detect, while the authors emphasize that today’s cases would likely still be noticed by humans [18]. Self-rewarding and LLM-as-judge systems make the general issue broader: models can increasingly participate in reward and judgment processes used for training and evaluation [19].

Key milestone: AI systems meaningfully influence the evidence used to certify the safety of future AI systems.


3.4 Internal deployment risk

The fourth pathway is internal deployment.

The most important AI R&D agents may first appear inside frontier AI labs. They may access code repositories, experiment logs, model evaluations, training infrastructure, deployment systems, and security-sensitive workflows.

This matters because public chatbot evaluations may miss internal risks. A model that appears safe as a user-facing assistant could behave very differently as an internal R&D agent with tools, permissions, memory, and access to sensitive systems.

METR’s Frontier Risk Report is directly relevant here because it studies misalignment risks from AI agents used inside frontier AI developers [27]. A separate disclosure proposal argues that frontier developers should provide evidence about internal deployments as these systems grow in scope and capability [31].

Key milestone: AI R&D agents receive high-stakes access to model-development pipelines before external evaluators can assess their behavior.


3.5 Alignment drift through recursive optimization

The fifth pathway is alignment drift.

Recursive optimization can improve capabilities while also changing behavior in hidden ways. Model collapse shows that recursive generated-data loops can degrade generative models [4]. Emergent misalignment shows that narrow fine-tuning can induce broad behavioral changes [5]. Subliminal learning shows that behavioral traits can transmit through semantically unrelated data [6].

Together, these results suggest that recursive training and distillation loops deserve careful safety analysis.

Key milestone: successor models inherit hidden behavioral defects from AI-generated data, rewards, critiques, or distillation pipelines.


3.6 Proliferation and autonomous development

The sixth pathway is diffusion.

If AI agents reduce the expertise required to conduct AI development, more actors may be able to fine-tune, improve, replicate, or deploy powerful systems. This risk becomes especially serious if stolen models, open-weight models, or smaller labs can use AI R&D agents to close capability gaps.

The SAIF mitigation brief focuses directly on this possibility: autonomous AI development could reduce human oversight, hinder identification of accidents or misuse, and compromise the AI supply chain if safeguards are not in place [13].

Key milestone: advanced AI development becomes substantially easier for actors outside frontier labs.


3.7 Loss of control through persistence and resource acquisition

The seventh pathway is control loss.

RSI becomes more dangerous when it couples with autonomy, tool use, persistence, resource acquisition, and self-replication. Self-replication is not the whole of RSI risk, but it is an important warning signal that self-improvement may be connecting to persistence and control.

UK AISI reports rising success rates on controlled self-replication evaluations [29]. The International AI Safety Report explains why autonomous replication or effective self-defense could make AI systems difficult to contain [30]. METR’s Frontier Risk Report examines related risks in the context of internal AI agents at frontier developers [27].

Key milestone: self-improving agents can preserve, copy, or expand their operation across environments despite external constraints.


4. What current evidence already shows

We should be careful. Current evidence does not show that fully autonomous recursive self-improvement has arrived.

It does show that several enabling feedback loops are already technically real.

Capability Evidence status Representative materials
Models learn from their own generated reasoning Demonstrated STaR; Quiet-STaR [3][20]
Models provide reward or judgment signals for training Demonstrated Self-Rewarding Language Models [19]
Models improve through self-play/self-training Demonstrated SPIN [21]
Recursive generated data changes model behavior Demonstrated Model collapse [4]
Narrow fine-tuning can induce broad behavioral change Demonstrated Emergent misalignment [5]
Behavioral traits can transmit through generated data Demonstrated Subliminal learning [6]
Agent workflows can be automatically optimized Demonstrated AFlow; TextGrad [9][22]
Agent architectures can be automatically searched Demonstrated ADAS [8]
Agent code can self-modify and improve benchmark performance Demonstrated Darwin Gödel Machine [7]
AI agents can perform ML research engineering tasks Demonstrated RE-Bench [10]
AI agents can partially replicate AI research papers Demonstrated but limited PaperBench [11]
AI systems can run end-to-end research loops in controlled settings Demonstrated AI Scientist [12]
AI systems can discover or optimize algorithms used in real infrastructure Demonstrated AlphaEvolve [28]
Runtime agents can accumulate memory or reusable skills Emerging / demonstrated in prototypes SAGE; ReasoningBank; Voyager; Hermes Agent [23][24][25][26]
Controlled self-replication evaluations show rising success rates Demonstrated in limited evals UK AISI Frontier AI Trends Report [29]
AI can significantly accelerate real frontier AI R&D Open Measuring AIRDA; frontier safety frameworks [2][14][15][16]
AI can autonomously produce a stronger successor frontier model Open Frontier safety frameworks [14][15][16]

The important point is the pattern: AI is moving from being a product of AI R&D to becoming a participant in AI R&D.


5. What remains unknown

Several open questions should guide future research.

Open question Why it matters Relevant sources
Acceleration: How much can AI R&D automation speed up real frontier AI development? Determines whether RSI is mostly an efficiency gain or a destabilizing capability accelerator AIRDA, CSET, frontier frameworks [1][2][14][15][16]
Bottlenecks: Are the main bottlenecks coding, experiments, ideas, data, compute, coordination, or judgment? RSI risk depends on which parts of AI R&D are actually rate-limiting CSET, AIRDA, RE-Bench, PaperBench [1][2][10][11]
Oversight: Can humans reliably audit large volumes of AI-generated code, experiments, evaluations, and research claims? Defines whether AI-generated R&D creates an oversight gap AIRDA, Anthropic sandbagging, internal disclosure work [2][18][31]
Recursive contribution: How much does a current model contribute to the next generation’s capability progress? Tracks whether AI development is becoming self-reinforcing OpenAI, Anthropic, Google DeepMind frameworks [14][15][16]
Safety asymmetry: Does AI accelerate capability research faster than safety research? Determines whether AI-driven AI development widens or narrows the safety gap AIRDA; Science extreme-risk framing [2][32]
Internal deployment: What permissions and infrastructure access do internal AI R&D agents have? Internal agent risk may be invisible from public model evaluations METR Frontier Risk Report; internal deployment disclosure paper [27][31]
Alignment drift: Do recursive training, distillation, and post-training loops systematically amplify hidden misalignment? Connects RSI to model behavior across generations Model collapse; emergent misalignment; subliminal learning [4][5][6]
Control: When does self-improvement combine with persistence, replication, and resource acquisition? Marks the transition from capability acceleration to control-loss risk UK AISI, International AI Safety Report, METR [27][29][30]

These questions are more useful than asking whether RSI has “arrived.” The stronger research agenda is to measure which feedback loops are emerging and how fast they are moving.


6. What should be measured next?

A practical RSI measurement agenda should track both technical capability and governance risk.

Technical metrics

Metric Question Candidate sources or measurement anchors
AI R&D task autonomy What length and complexity of AI R&D tasks can AI systems complete? RE-Bench, PaperBench, METR time horizons [10][11][17]
Research-loop closure Can an AI system move from idea to experiment to revision? AI Scientist; CSET workshop report [1][12]
Successor contribution How much does AI contribute to the next generation of models? OpenAI Preparedness Framework; Anthropic RSP; DeepMind FSF [14][15][16]
Scaffold improvement rate How fast can agents improve their own prompts, tools, workflows, or code? DGM, ADAS, AFlow, TextGrad [7][8][9][22]
Runtime skill accumulation Can deployed agents accumulate transferable skills across tasks? SAGE, ReasoningBank, Voyager, Hermes Agent [23][24][25][26]
Algorithmic infrastructure optimization Can AI improve algorithms or infrastructure used for AI development itself? AlphaEvolve [28]

Risk metrics

Metric Question Candidate sources or measurement anchors
Oversight burden How much AI-generated R&D output must humans review? AIRDA; internal deployment disclosure paper [2][31]
Oversight reliability Can humans detect subtle errors, sandbagging, or sabotage? Anthropic sandbagging; METR Frontier Risk Report [18][27]
Safety evidence dependence How much safety evidence is generated or processed by AI systems? Self-Rewarding LMs; Anthropic sandbagging; frontier safety frameworks [14][18][19]
Internal access level What sensitive assets can AI R&D agents access? METR Frontier Risk Report; internal deployment disclosure paper [27][31]
Capability/safety acceleration ratio Does AI accelerate capability work more than safety work? AIRDA; Science extreme-risk framing [2][32]
Proliferation potential Does AI reduce the expertise or cost needed for advanced AI development? SAIF brief; International AI Safety Report [13][30]
Persistence and replication risk Can agents copy, preserve, or expand their operation across environments? UK AISI; International AI Safety Report [29][30]

Measuring RSI requires measuring feedback, not only capability: how much AI systems contribute to the next cycle of AI development, and how much harder that cycle becomes to oversee.


7. Governance implications

Governance should start before fully autonomous AI researchers exist.

A minimum agenda could include:

Governance proposal Rationale Relevant sources
Treat AI R&D automation as a frontier risk category AI systems that accelerate AI development can change the risk landscape even before full RSI OpenAI, Anthropic, Google DeepMind, METR common elements [14][15][16][33]
Disclose internal AI R&D use Public evaluations may miss risks from internal coding, training, evaluation, and deployment agents Internal deployment disclosure paper; METR Frontier Risk Report [27][31]
Evaluate AI R&D capability before deployment Research engineering, paper replication, scaffold improvement, and long-horizon tasks are core RSI precursors RE-Bench, PaperBench, METR time horizons [10][11][17]
Audit internal AI R&D agents Agents with access to code, evals, model weights, or deployment systems create high-stakes internal risk METR Frontier Risk Report; Anthropic RSP; DeepMind FSF [15][16][27]
Separate capability automation from safety evidence generation AI systems that accelerate capability work should not freely control safety evidence Anthropic sandbagging; Self-Rewarding LMs; AIRDA [2][18][19]
Monitor RSI milestones RSI should be tracked as a set of measurable feedback loops rather than a single speculative event AIRDA, CSET, OpenAI, Anthropic, DeepMind [1][2][14][15][16]
Maintain minimum safeguards for autonomous AI development Training, testing, assurance, access controls, and human approval matter before fully autonomous development arrives SAIF mitigation brief [13]

8. Conclusion: RSI as an empirical risk framework

Recursive self-improvement should be studied as an empirical risk framework.

The central question is no longer whether a model can suddenly rewrite itself into a superintelligence. The more concrete question is which self-improvement feedback loops are emerging today, how quickly they are improving, and when they could combine into catastrophic risk pathways.

Some of these loops are already visible. Foundation models can learn from their own generated reasoning. Agent scaffolds can be optimized. Coding agents can modify their own code. AI agents can perform parts of ML research engineering. AI systems can generate data, judgments, evaluations, and research artifacts. Deployed agents can accumulate memory and skills.

Full recursive self-improvement remains an open question. The enabling pieces are becoming measurable.

That is enough reason to start building a rigorous RSI risk science now.


References

[1] CSET — When AI Builds AI: Findings From a Workshop on Automation of AI R&D

[2] Chan et al. — Measuring AI R&D Automation

[3] Zelikman et al. — STaR: Bootstrapping Reasoning With Reasoning

[4] Shumailov et al. — AI models collapse when trained on recursively generated data

[5] Betley et al. — Training large language models on narrow tasks can lead to broad misalignment

[6] Cloud et al. — Language models transmit behavioural traits through hidden signals in data

[7] Zhang et al. — Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents

[8] Hu et al. — Automated Design of Agentic Systems

[9] AFlow — Automating Agentic Workflow Generation

[10] Wijk et al. — RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

[11] OpenAI — PaperBench: Evaluating AI’s Ability to Replicate AI Research

[12] Lu et al. — Towards end-to-end automation of AI research

[13] SAIF — Bare Minimum Mitigations for Autonomous AI Development

[14] OpenAI — Preparedness Framework v2

[15] Anthropic — Responsible Scaling Policy v3.0

[16] Google DeepMind — Updating the Frontier Safety Framework

[17] METR — Measuring AI Ability to Complete Long Tasks

[18] Anthropic Alignment Science — Automated Researchers Can Subtly Sandbag

[19] Yuan et al. — Self-Rewarding Language Models

[20] Zelikman et al. — Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

[21] Chen et al. — Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

[22] Yuksekgonul et al. — TextGrad: Automatic “Differentiation” via Text

[23] Liang et al. — Self-evolving Agents with Reflective and Memory-augmented Abilities

[24] Ouyang et al. — ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

[25] Wang et al. — Voyager: An Open-Ended Embodied Agent with Large Language Models

[26] Nous Research — Hermes Agent GitHub repository

[27] METR — Frontier Risk Report: February to March 2026

[28] Google DeepMind — AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

[29] UK AI Security Institute — Frontier AI Trends Report

[30] International AI Safety Report 2026

[31] Chan et al. — What Should Frontier AI Developers Disclose About Internal Deployments?

[32] Bengio et al. — Managing extreme AI risks amid rapid progress

[33] METR — Common Elements of Frontier AI Safety Policies