Understanding the Risk of Recursive Self-Improvement

Xudong Pan - 25 May, 2026

Tags:

An empirical framework for AI-driven AI development

Recursive self-improvement has often been imagined as a dramatic threshold: an AI system becomes capable enough to improve itself, and each improvement makes the next one easier.

That image captures a real concern, but it is too narrow for today’s frontier AI landscape.

Today, AI systems are beginning to enter the processes that build, evaluate, optimize, and deploy other AI systems. They can generate training data, improve reasoning traces, act as judges, optimize agent workflows, replicate research papers, run experiments, maintain long-term memory, assist with AI R&D infrastructure, and improve algorithms used in real computational systems [1][2][12][28].

This means recursive self-improvement should be understood as a family of feedback loops. Some loops operate at the level of foundation models. Others operate through agent scaffolds, research infrastructure, runtime memory, multi-agent systems, or external resources.

The central question is empirical:

Which self-improvement loops are already technically feasible, which ones remain speculative, and when could they combine into catastrophic risk pathways?

This essay offers a simplified framework for thinking about that question.

1. What counts as recursive self-improvement today?

A useful modern definition is:

Recursive self-improvement refers to feedback loops in which AI systems contribute to improving themselves, their scaffolds, their successor systems, or the infrastructure that enables future AI development.

This definition matters because current AI systems do not need to directly rewrite their own weights to participate in self-improvement. A frozen foundation model can still become part of a self-improving system if it helps improve prompts, tools, workflows, memory, evaluation pipelines, training data, research code, or future models.

This is also why AI R&D automation has become central to the discussion. CSET’s workshop report When AI Builds AI describes how leading AI companies are already using AI systems to accelerate research and development, with each generation potentially contributing to the next generation [1]. GovAI and Oxford’s Measuring AI R&D Automation argues that AI R&D automation could significantly affect both AI progress and human oversight, while existing capability benchmarks may fail to capture real-world automation [2].

Frontier AI developers have also begun to treat this as a safety-relevant category. OpenAI’s Preparedness Framework includes AI self-improvement as a tracked risk area [14]. Anthropic’s Responsible Scaling Policy includes AI R&D thresholds [15]. Google DeepMind’s Frontier Safety Framework highlights machine learning R&D capabilities as especially important because future models may accelerate or automate AI development itself [16]. METR’s public work similarly tracks long-horizon autonomy and frontier AI risk from internal agentic deployments [17][27].

The risk is therefore less about a single “takeoff moment” and more about a gradual expansion of feedback loops inside the AI development process.

2. Six contemporary paradigms of RSI

We find it useful to separate contemporary recursive self-improvement into six technical paradigms.

Paradigm	What improves?	Typical feedback signal	Why it matters for RSI risk	Example sources
Foundation-model-level self-improvement	Weights, data, reward signals, reasoning, alignment behavior, successor models	Generated data, rationales, self-judgment, human/AI preference signals	Can propagate capability gains or hidden alignment drift across model generations	STaR, Quiet-STaR, Self-Rewarding LMs, SPIN, model collapse, emergent misalignment, subliminal learning [3][4][5][6][19][20][21]
Agent-scaffold-level self-improvement	Prompts, workflows, tools, planners, agent code, control logic	Benchmark scores, textual feedback, task success, evaluator outputs	Allows system capability growth even when model weights are frozen	DGM, ADAS, AFlow, TextGrad [7][8][9][22]
Infrastructure-level AI R&D self-improvement	Coding, experiments, data processing, evaluation, training pipelines, algorithm discovery, successor-model development	Research outcomes, eval scores, runtime efficiency, replication success	Most directly links AI systems to faster AI capability development	RE-Bench, PaperBench, AI Scientist, AlphaEvolve, AIRDA measurement [2][10][11][12][28]
Runtime agent self-improvement modules	Memory, skills, reflection, procedural knowledge, user/task models	Task outcomes, environment feedback, self-reflection, accumulated experience	Makes deployed agents adaptive moving targets after evaluation	SAGE, ReasoningBank, Voyager, Hermes Agent [23][24][25][26]
Multi-agent and population-level self-improvement	Agent populations, archives, research teams, peer review, specialization	Peer critique, selection, debate, population search, automated review	Can amplify speed while making oversight and attribution harder	AI Scientist, DGM, ADAS, multi-agent task decomposition [7][8][12]
Environment and resource-level self-improvement	Tools, data, compute, access, permissions, persistence, replication	Resource access, operational success, deployment continuity	Couples self-improvement with autonomy, proliferation, and control risk	SAIF brief, AISI trends, International AI Safety Report, METR risk report [13][27][29][30]

2.1 RSI on Foundation Model

This is the most familiar form: a model helps improve its own weights, training data, reward signal, reasoning ability, alignment behavior, or successor model.

Examples include self-generated reasoning data, self-training, self-play, self-rewarding language models, synthetic data recursion, distillation, model-generated preference data, and automated post-training.

There is already substantial component evidence. STaR showed that a model can improve reasoning by learning from its own generated rationales [3]. Quiet-STaR extends this idea toward learning latent rationales in more general text [20]. Self-Rewarding Language Models use the language model itself as a judge to provide rewards during training [19]. SPIN uses self-play to improve language models without requiring additional human-annotated data [21].

More recent work on recursive synthetic data, model collapse, emergent misalignment, and subliminal learning shows that recursive training and distillation loops can also transmit or amplify unwanted properties [4][5][6].

The key risk at this layer is alignment drift across model generations. If models increasingly generate the data, rewards, critiques, and training signals used to improve future models, hidden behavioral traits or evaluation blind spots may propagate through the development pipeline.

A simple milestone ladder:

Milestone	Evidence status	Representative sources
Models generate reasoning data that improves their own performance	Demonstrated	STaR; Quiet-STaR [3][20]
Models act as judges or reward providers	Demonstrated	Self-Rewarding Language Models; LLM-as-judge paradigm [19]
Models improve through self-play or self-training loops	Demonstrated	SPIN [21]
Recursive synthetic data changes model behavior	Demonstrated	Model collapse [4]
Distillation transmits hidden behavioral traits	Demonstrated	Subliminal learning [6]
Narrow fine-tuning causes broad behavioral change	Demonstrated	Emergent misalignment [5]
A model autonomously designs and executes a full post-training plan	Open	Frontier self-improvement frameworks motivate this threshold, but it remains unproven [14][15][16]
A current model independently produces a significantly stronger successor model	Open	OpenAI and Anthropic treat related capabilities as high/critical future thresholds [14][15]

2.2 RSI on Agent Scaffold

Modern AI systems are often more than foundation models. They include prompts, tools, planners, workflows, memory, code editors, evaluators, and control logic. Agent-scaffold-level self-improvement occurs when AI systems improve this surrounding structure.

This includes automated workflow optimization, agent architecture search, self-modifying coding agents, textual-gradient optimization, prompt evolution, tool-use optimization, and benchmark-driven agent improvement.

This is one of the most practically important forms of RSI. The foundation model may remain frozen, while the overall system becomes more capable because the scaffold improves.

Darwin Gödel Machine is a striking example. It uses a foundation model to iteratively modify the code of coding agents, empirically validates changes on coding benchmarks, and reports large gains on SWE-bench and Polyglot while using sandboxing and human oversight [7]. ADAS and AFlow similarly explore automated design and optimization of agentic systems and workflows [8][9]. TextGrad shows how textual feedback from LLMs can be used to optimize components of compound AI systems, including prompts and code snippets [22].

The risk here is capability growth outside model weights. Traditional evaluations may classify the base model as safe, while the surrounding scaffold gradually becomes better at planning, tool use, persistence, and task completion.

A simple milestone ladder:

Milestone	Evidence status	Representative sources
Automated prompt or workflow optimization	Demonstrated	AFlow; TextGrad [9][22]
Automated agent architecture search	Demonstrated	ADAS [8]
Agents modify their own code and improve benchmark performance	Demonstrated	Darwin Gödel Machine [7]
Agents improve their own tool use, planning, or code-editing mechanisms	Emerging	DGM, ADAS, AFlow, TextGrad [7][8][9][22]
Self-improving scaffolds generate sustained open-ended capability growth	Open	DGM demonstrates an early open-ended agent improvement loop, but sustained general capability growth remains unproven [7]

2.3 RSI on Infrastructure-level AI R&D

This is the most consequential paradigm for frontier risk.

Infrastructure-level self-improvement occurs when AI systems enter the AI development pipeline itself: coding, experiment design, data processing, evaluation, training infrastructure, algorithm discovery, paper replication, and successor-model development.

RE-Bench evaluates frontier model agents on realistic ML research engineering tasks and compares them with human experts. The best AI agents outperform human experts under short time budgets, while humans still improve more with longer time budgets [10]. PaperBench evaluates whether agents can replicate state-of-the-art AI research papers, including understanding contributions, building codebases, and executing experiments [11]. Nature’s AI Scientist paper presents an end-to-end pipeline that creates research ideas, writes code, runs experiments, analyzes data, writes manuscripts, and performs automated peer review [12]. AlphaEvolve shows that LLM-powered coding agents can discover and improve algorithms, including optimizations used in data centers, chip design, and AI training processes [28].

These systems are still limited. They do not show that full autonomous AI research has arrived. They do show that AI R&D automation is becoming measurable.

The risk at this layer is capability acceleration. If AI systems help produce stronger AI systems, the rate of frontier AI progress could increase. If that rate exceeds the capacity of safety evaluation, cybersecurity, institutional review, and governance, the result is an oversight gap.

A simple milestone ladder:

Milestone	Evidence status	Representative sources
AI assists coding, debugging, and experiment analysis	Observed	CSET workshop report; AIRDA measurement agenda [1][2]
AI agents complete ML research engineering subtasks	Demonstrated	RE-Bench [10]
AI agents partially replicate AI research papers	Demonstrated but limited	PaperBench [11]
AI systems run end-to-end research loops in controlled settings	Demonstrated	AI Scientist [12]
AI systems discover or optimize algorithms used in computational infrastructure	Demonstrated	AlphaEvolve [28]
AI significantly accelerates real frontier AI R&D	Open	AIRDA measurement paper identifies this as a key object of measurement [2]
AI-driven AI development shortens model generation cycles	Open	OpenAI and Anthropic include related thresholds in frontier risk frameworks [14][15]
AI R&D speed exceeds reliable human oversight capacity	Critical threshold	AIRDA oversight framing; frontier safety frameworks [2][14][15][16]

2.4 RSI Modules of Runtime agents

Some self-improvement occurs during deployment.

Runtime self-improvement modules allow agents to improve through long-term memory, reflection, skill libraries, procedural knowledge, experience replay, user models, and task models.

A deployed agent with memory and reusable skills is a moving target. The system evaluated at launch may differ from the system users interact with after weeks of accumulated experience. The model weights may stay the same, while the effective behavior changes.

This paradigm is especially relevant for personal assistants, research agents, coding agents, enterprise agents, and autonomous workflows. The safety concern is persistent adaptation: memory and skills can preserve useful knowledge, but they can also preserve bad strategies, biased assumptions, unsafe shortcuts, or user-specific manipulation patterns.

A simple milestone ladder:

Milestone	Evidence status	Representative sources
Agents maintain long-term memory	Observed / increasingly productized	Hermes Agent; broader agent-memory ecosystem [26]
Agents form reusable skills from experience	Demonstrated in prototypes	Voyager; Hermes Agent [25][26]
Agents improve through reflection	Demonstrated in prototypes	SAGE [23]
Agents maintain procedural memory across tasks	Emerging	ReasoningBank; Voyager [24][25]
Agents distill successful and failed experiences into reusable reasoning strategies	Demonstrated in research prototype	ReasoningBank [24]
Runtime adaptation causes safety-relevant behavioral drift	Underexplored	Existing memory work motivates the question but does not settle the risk [23][24][25][26]
Runtime memory and skills enable strategic policy evasion	Open	Relevant to agent safety and monitoring, but direct evidence remains limited [23][24][27]

2.5 RSI of Multi-agent population

RSI may also emerge through groups of agents rather than a single agent.

Multi-agent self-improvement includes debate, critique, specialization, peer review, population-based search, evolutionary selection, and AI organizations that divide research tasks among specialized agents.

This matters because real AI R&D is organizational. It involves idea generation, coding, experiment execution, review, debugging, evaluation, and strategic decision-making. A future AI R&D system may look less like a solitary model improving itself and more like a synthetic research team.

The AI Scientist includes automated peer review [12]. Darwin Gödel Machine maintains an archive of generated coding agents and explores multiple improvement paths [7]. ADAS searches over agentic system designs [8]. These are early signals of population-level or organization-level feedback loops.

The risk is distributed oversight failure. When many agents generate, review, and select each other’s outputs, errors can become harder to attribute. Model-generated review can create evaluation monocultures. Multi-agent systems can also scale R&D speed through parallelism.

A simple milestone ladder:

Milestone	Evidence status	Representative sources
AI generation plus AI critique	Observed	AI Scientist; ADAS-style meta-agent evaluation [8][12]
Automated peer review	Demonstrated in prototypes	AI Scientist [12]
Multi-agent task decomposition	Observed / common in agent systems	ADAS and broader agentic-system literature [8]
Archive or population-based agent evolution	Demonstrated	Darwin Gödel Machine [7]
AI organizations complete substantial AI R&D workflows	Open	AI Scientist and RE-Bench indicate components, but full organizational automation remains open [10][12]
Multi-agent self-improvement produces hard-to-predict collective behavior	Open	Motivated by multi-agent and population-level design loops [7][8][12]

2.6 RSI on Environment and Resource

Finally, an AI system can improve its future capabilities by changing its environment.

This includes acquiring tools, data, compute, permissions, API access, collaborators, deployment footholds, persistence mechanisms, and copies of itself.

This layer connects RSI to autonomy and control. A system can become more capable through better external resources, even without becoming cognitively smarter. Tool acquisition, data acquisition, environment setup, persistence, and self-replication can all increase what the system can do next.

The SAIF brief Bare Minimum Mitigations for Autonomous AI Development argues that if AI agents significantly automate or accelerate AI development, developers need minimum safeguards around training, testing, assurance, access, and human approval [13]. The UK AI Security Institute’s Frontier AI Trends Report reports rising success rates on controlled self-replication evaluations and discusses sandbagging as a control-relevant capability [29]. The International AI Safety Report discusses autonomous replication and self-defense as capabilities that could make systems harder to contain [30]. METR’s Frontier Risk Report examines misalignment risks from AI agents used inside frontier AI developers [27].

A simple milestone ladder:

Milestone	Evidence status	Representative sources
Agents use tools and modify environments	Observed	Frontier safety and agent capability evaluations [17][29]
Agents set up execution environments	Observed	Coding-agent and AI R&D automation benchmarks [10][11]
Agents acquire data or improve toolchains	Emerging	AI Scientist, AlphaEvolve, AIRDA framing [2][12][28]
Agents persist across sessions or environments	Emerging	Runtime-memory and long-horizon autonomy work [17][23][24][26]
Agents replicate under controlled evaluation settings	Demonstrated in limited evals	UK AISI trends; self-replication evaluations [29]
Agents autonomously acquire compute, access, or deployment footholds	Open	International AI Safety Report and SAIF brief motivate this risk pathway [13][30]
Self-improvement couples with persistence and loss of control	Critical threshold	UK AISI, METR, International AI Safety Report [27][29][30]

3. From self-improvement loops to catastrophic risk pathways

The question is not simply whether AI can improve itself. The policy-relevant question is which self-improvement loops could create catastrophic risk.

We see seven main pathways.

Risk pathway	Main RSI paradigms involved	Core concern	Relevant sources
Capability acceleration	Infrastructure-level, scaffold-level, foundation-model-level	AI accelerates the production of more capable AI systems	AIRDA, RE-Bench, AI Scientist, AlphaEvolve, frontier frameworks [2][10][12][14][15][16][28]
Oversight gap	Infrastructure-level, multi-agent, runtime modules	Human evaluation capacity fails to keep up with AI-generated R&D output	AIRDA, CSET, METR, internal deployment disclosure work [1][2][17][27][31]
Safety evidence corruption	Infrastructure-level, foundation-model-level	AI helps generate, select, or manipulate safety evidence	Anthropic sandbagging, model-generated judges, frontier safety frameworks [14][18][19]
Internal deployment risk	Infrastructure-level, environment-level	AI R&D agents act inside frontier labs with high-stakes access	METR Frontier Risk Report, internal disclosure paper, frontier policies [15][16][27][31]
Alignment drift	Foundation-model-level, runtime modules	Recursive optimization changes behavior in hidden ways	Model collapse, emergent misalignment, subliminal learning [4][5][6]
Proliferation	Infrastructure-level, environment-level	More actors can develop high-capability systems	SAIF brief, International AI Safety Report, AIRDA [2][13][30]
Loss of control	Environment-level, runtime modules, scaffold-level	Agents gain persistence, resources, autonomy, or replication capabilities	UK AISI trends, METR, International AI Safety Report [27][29][30]

3.1 Capability acceleration

The first pathway is speed. AI systems may accelerate the production of more capable AI systems.

This can happen through AI-assisted coding, experiment automation, paper replication, training pipeline optimization, algorithm discovery, and agent scaffold improvement [2][10][11][12][28].

The milestone to watch is not “full RSI.” It is the point where AI systems measurably shorten the development cycle of stronger models.

Key milestone: AI-driven AI development produces frontier capability progress faster than human institutions can evaluate and govern it.

3.2 Oversight gap

The second pathway is oversight failure.

AI may generate code, experiments, evaluations, model changes, and research artifacts faster than humans can reliably inspect. This could create a widening gap between the amount of AI-generated R&D output and the amount that human reviewers can understand, verify, and secure.

GovAI and Oxford explicitly identify this kind of measurement challenge: current benchmarks may not capture whether AI R&D automation accelerates capabilities more than safety progress or whether human oversight can keep pace [2].

Key milestone: AI-generated R&D output exceeds reliable human review capacity.

3.3 Safety evidence corruption

The third pathway is corruption of the evidence used to judge safety.

If AI systems help write evaluations, analyze experiments, generate safety reports, or monitor other AI systems, the safety evaluation pipeline itself becomes a risk surface.

Anthropic’s work on automated researchers sandbagging shows that current models can, when prompted, subtly sandbag ML experiments and research decisions in ways that zero-shot prompted monitors fail to detect, while the authors emphasize that today’s cases would likely still be noticed by humans [18]. Self-rewarding and LLM-as-judge systems make the general issue broader: models can increasingly participate in reward and judgment processes used for training and evaluation [19].

Key milestone: AI systems meaningfully influence the evidence used to certify the safety of future AI systems.

3.4 Internal deployment risk

The fourth pathway is internal deployment.

The most important AI R&D agents may first appear inside frontier AI labs. They may access code repositories, experiment logs, model evaluations, training infrastructure, deployment systems, and security-sensitive workflows.

This matters because public chatbot evaluations may miss internal risks. A model that appears safe as a user-facing assistant could behave very differently as an internal R&D agent with tools, permissions, memory, and access to sensitive systems.

METR’s Frontier Risk Report is directly relevant here because it studies misalignment risks from AI agents used inside frontier AI developers [27]. A separate disclosure proposal argues that frontier developers should provide evidence about internal deployments as these systems grow in scope and capability [31].

Key milestone: AI R&D agents receive high-stakes access to model-development pipelines before external evaluators can assess their behavior.

3.5 Alignment drift through recursive optimization

The fifth pathway is alignment drift.

Recursive optimization can improve capabilities while also changing behavior in hidden ways. Model collapse shows that recursive generated-data loops can degrade generative models [4]. Emergent misalignment shows that narrow fine-tuning can induce broad behavioral changes [5]. Subliminal learning shows that behavioral traits can transmit through semantically unrelated data [6].

Together, these results suggest that recursive training and distillation loops deserve careful safety analysis.

Key milestone: successor models inherit hidden behavioral defects from AI-generated data, rewards, critiques, or distillation pipelines.

3.6 Proliferation and autonomous development

The sixth pathway is diffusion.

If AI agents reduce the expertise required to conduct AI development, more actors may be able to fine-tune, improve, replicate, or deploy powerful systems. This risk becomes especially serious if stolen models, open-weight models, or smaller labs can use AI R&D agents to close capability gaps.

The SAIF mitigation brief focuses directly on this possibility: autonomous AI development could reduce human oversight, hinder identification of accidents or misuse, and compromise the AI supply chain if safeguards are not in place [13].

Key milestone: advanced AI development becomes substantially easier for actors outside frontier labs.

3.7 Loss of control through persistence and resource acquisition

The seventh pathway is control loss.

RSI becomes more dangerous when it couples with autonomy, tool use, persistence, resource acquisition, and self-replication. Self-replication is not the whole of RSI risk, but it is an important warning signal that self-improvement may be connecting to persistence and control.

UK AISI reports rising success rates on controlled self-replication evaluations [29]. The International AI Safety Report explains why autonomous replication or effective self-defense could make AI systems difficult to contain [30]. METR’s Frontier Risk Report examines related risks in the context of internal AI agents at frontier developers [27].

Key milestone: self-improving agents can preserve, copy, or expand their operation across environments despite external constraints.

4. What current evidence already shows

We should be careful. Current evidence does not show that fully autonomous recursive self-improvement has arrived.

It does show that several enabling feedback loops are already technically real.

Capability	Evidence status	Representative materials
Models learn from their own generated reasoning	Demonstrated	STaR; Quiet-STaR [3][20]
Models provide reward or judgment signals for training	Demonstrated	Self-Rewarding Language Models [19]
Models improve through self-play/self-training	Demonstrated	SPIN [21]
Recursive generated data changes model behavior	Demonstrated	Model collapse [4]
Narrow fine-tuning can induce broad behavioral change	Demonstrated	Emergent misalignment [5]
Behavioral traits can transmit through generated data	Demonstrated	Subliminal learning [6]
Agent workflows can be automatically optimized	Demonstrated	AFlow; TextGrad [9][22]
Agent architectures can be automatically searched	Demonstrated	ADAS [8]
Agent code can self-modify and improve benchmark performance	Demonstrated	Darwin Gödel Machine [7]
AI agents can perform ML research engineering tasks	Demonstrated	RE-Bench [10]
AI agents can partially replicate AI research papers	Demonstrated but limited	PaperBench [11]
AI systems can run end-to-end research loops in controlled settings	Demonstrated	AI Scientist [12]
AI systems can discover or optimize algorithms used in real infrastructure	Demonstrated	AlphaEvolve [28]
Runtime agents can accumulate memory or reusable skills	Emerging / demonstrated in prototypes	SAGE; ReasoningBank; Voyager; Hermes Agent [23][24][25][26]
Controlled self-replication evaluations show rising success rates	Demonstrated in limited evals	UK AISI Frontier AI Trends Report [29]
AI can significantly accelerate real frontier AI R&D	Open	Measuring AIRDA; frontier safety frameworks [2][14][15][16]
AI can autonomously produce a stronger successor frontier model	Open	Frontier safety frameworks [14][15][16]

The important point is the pattern: AI is moving from being a product of AI R&D to becoming a participant in AI R&D.

5. What remains unknown

Several open questions should guide future research.

Open question	Why it matters	Relevant sources
Acceleration: How much can AI R&D automation speed up real frontier AI development?	Determines whether RSI is mostly an efficiency gain or a destabilizing capability accelerator	AIRDA, CSET, frontier frameworks [1][2][14][15][16]
Bottlenecks: Are the main bottlenecks coding, experiments, ideas, data, compute, coordination, or judgment?	RSI risk depends on which parts of AI R&D are actually rate-limiting	CSET, AIRDA, RE-Bench, PaperBench [1][2][10][11]
Oversight: Can humans reliably audit large volumes of AI-generated code, experiments, evaluations, and research claims?	Defines whether AI-generated R&D creates an oversight gap	AIRDA, Anthropic sandbagging, internal disclosure work [2][18][31]
Recursive contribution: How much does a current model contribute to the next generation’s capability progress?	Tracks whether AI development is becoming self-reinforcing	OpenAI, Anthropic, Google DeepMind frameworks [14][15][16]
Safety asymmetry: Does AI accelerate capability research faster than safety research?	Determines whether AI-driven AI development widens or narrows the safety gap	AIRDA; Science extreme-risk framing [2][32]
Internal deployment: What permissions and infrastructure access do internal AI R&D agents have?	Internal agent risk may be invisible from public model evaluations	METR Frontier Risk Report; internal deployment disclosure paper [27][31]
Alignment drift: Do recursive training, distillation, and post-training loops systematically amplify hidden misalignment?	Connects RSI to model behavior across generations	Model collapse; emergent misalignment; subliminal learning [4][5][6]
Control: When does self-improvement combine with persistence, replication, and resource acquisition?	Marks the transition from capability acceleration to control-loss risk	UK AISI, International AI Safety Report, METR [27][29][30]

These questions are more useful than asking whether RSI has “arrived.” The stronger research agenda is to measure which feedback loops are emerging and how fast they are moving.

6. What should be measured next?

A practical RSI measurement agenda should track both technical capability and governance risk.

Technical metrics

Metric	Question	Candidate sources or measurement anchors
AI R&D task autonomy	What length and complexity of AI R&D tasks can AI systems complete?	RE-Bench, PaperBench, METR time horizons [10][11][17]
Research-loop closure	Can an AI system move from idea to experiment to revision?	AI Scientist; CSET workshop report [1][12]
Successor contribution	How much does AI contribute to the next generation of models?	OpenAI Preparedness Framework; Anthropic RSP; DeepMind FSF [14][15][16]
Scaffold improvement rate	How fast can agents improve their own prompts, tools, workflows, or code?	DGM, ADAS, AFlow, TextGrad [7][8][9][22]
Runtime skill accumulation	Can deployed agents accumulate transferable skills across tasks?	SAGE, ReasoningBank, Voyager, Hermes Agent [23][24][25][26]
Algorithmic infrastructure optimization	Can AI improve algorithms or infrastructure used for AI development itself?	AlphaEvolve [28]

Risk metrics

Metric	Question	Candidate sources or measurement anchors
Oversight burden	How much AI-generated R&D output must humans review?	AIRDA; internal deployment disclosure paper [2][31]
Oversight reliability	Can humans detect subtle errors, sandbagging, or sabotage?	Anthropic sandbagging; METR Frontier Risk Report [18][27]
Safety evidence dependence	How much safety evidence is generated or processed by AI systems?	Self-Rewarding LMs; Anthropic sandbagging; frontier safety frameworks [14][18][19]
Internal access level	What sensitive assets can AI R&D agents access?	METR Frontier Risk Report; internal deployment disclosure paper [27][31]
Capability/safety acceleration ratio	Does AI accelerate capability work more than safety work?	AIRDA; Science extreme-risk framing [2][32]
Proliferation potential	Does AI reduce the expertise or cost needed for advanced AI development?	SAIF brief; International AI Safety Report [13][30]
Persistence and replication risk	Can agents copy, preserve, or expand their operation across environments?	UK AISI; International AI Safety Report [29][30]

Measuring RSI requires measuring feedback, not only capability: how much AI systems contribute to the next cycle of AI development, and how much harder that cycle becomes to oversee.

7. Governance implications

Governance should start before fully autonomous AI researchers exist.

A minimum agenda could include:

Governance proposal	Rationale	Relevant sources
Treat AI R&D automation as a frontier risk category	AI systems that accelerate AI development can change the risk landscape even before full RSI	OpenAI, Anthropic, Google DeepMind, METR common elements [14][15][16][33]
Disclose internal AI R&D use	Public evaluations may miss risks from internal coding, training, evaluation, and deployment agents	Internal deployment disclosure paper; METR Frontier Risk Report [27][31]
Evaluate AI R&D capability before deployment	Research engineering, paper replication, scaffold improvement, and long-horizon tasks are core RSI precursors	RE-Bench, PaperBench, METR time horizons [10][11][17]
Audit internal AI R&D agents	Agents with access to code, evals, model weights, or deployment systems create high-stakes internal risk	METR Frontier Risk Report; Anthropic RSP; DeepMind FSF [15][16][27]
Separate capability automation from safety evidence generation	AI systems that accelerate capability work should not freely control safety evidence	Anthropic sandbagging; Self-Rewarding LMs; AIRDA [2][18][19]
Monitor RSI milestones	RSI should be tracked as a set of measurable feedback loops rather than a single speculative event	AIRDA, CSET, OpenAI, Anthropic, DeepMind [1][2][14][15][16]
Maintain minimum safeguards for autonomous AI development	Training, testing, assurance, access controls, and human approval matter before fully autonomous development arrives	SAIF mitigation brief [13]

8. Conclusion: RSI as an empirical risk framework

Recursive self-improvement should be studied as an empirical risk framework.

The central question is no longer whether a model can suddenly rewrite itself into a superintelligence. The more concrete question is which self-improvement feedback loops are emerging today, how quickly they are improving, and when they could combine into catastrophic risk pathways.

Some of these loops are already visible. Foundation models can learn from their own generated reasoning. Agent scaffolds can be optimized. Coding agents can modify their own code. AI agents can perform parts of ML research engineering. AI systems can generate data, judgments, evaluations, and research artifacts. Deployed agents can accumulate memory and skills.

Full recursive self-improvement remains an open question. The enabling pieces are becoming measurable.

That is enough reason to start building a rigorous RSI risk science now.