Agentic Loops and How to Use Them

Executive Summary

Over the course of a few sessions with Claude Code, I took an existing multi-agent CI code review system and refactored it into a proper Claude Code plugin — complete with a marketplace definition, a /teim-review slash command for local use, and a single orchestration agent that replaced three separate Ansible-driven Claude invocations. The final commit was 26 files changed, 601 insertions, 397 deletions.

What made this interesting wasn’t the feature itself. It was the process. The tool we were building — an AI code reviewer — reviewed its own development iteratively across multiple sessions. Each review cycle produced a structured plan, plans were executed, and then the reviewer was turned back on the result. But the loop wasn’t just AI reviewing AI. It was an agent with access to the full engineering stack: linters, unit tests, molecule tests, and pre-commit hooks. The AI could verify its own work against objective reality, not just its own judgement. That closed quality gate is something I want to document here, because I think it represents a genuinely useful template for how to approach this kind of agentic development work.


What Is teim-review? Setting the Context

Before we get to the how, it helps to understand what we were working with and why.

teim-review is an AI-powered code review system built for OpenStack development. It runs inside a Zuul CI job — specifically on my personal third-party Zuul instance at zuul.teim.app, which is part of my homelab setup — and uses Claude Code to analyse proposed changes, compare them against OpenStack coding standards, and post structured feedback back to Gerrit. The output is both a JSON findings file and a rendered HTML report that surfaces as a CI artifact.

The system lives in the openstack-ai-style-guide repository, which also contains the style guides themselves — quick-rules and a comprehensive guide — that inform the review.

The architecture before the change

At the point we started this work, the system looked like this:

zuul.d/jobs.yaml
    └── teim-code-review job
         └── playbooks/teim-code-review/run.yaml
              ├── role: ai_review_setup     (configure Claude, set model mappings)
              ├── role: ai_code_review      (invoke claude code → context extraction agent)
              ├── role: ai_code_review      (invoke claude code → commit summary agent)
              ├── role: ai_code_review      (invoke claude code → code review agent)
              ├── role: ai_html_generation  (render JSON → HTML)
              └── role: ai_zuul_integration (post comments, register artifacts)

Each ai_code_review role invocation was a separate claude CLI call — meaning Zuul was spinning up Claude three times, each time loading context from scratch. The model names in the agent definitions also referenced GLM model names directly, which made the agents tightly coupled to that specific homelab deployment.

The Ansible playbook was doing orchestration work — sequencing agents, passing outputs between them, handling retries — that really belonged inside a single Claude session.

The architecture after the change

zuul.d/jobs.yaml
    └── teim-code-review job
         └── playbooks/teim-code-review/run.yaml
              ├── role: ai_review_setup     (install plugin, configure Claude)
              ├── role: ai_code_review      (invoke claude code → @teim-review-agent)
              │                              │
              │         ┌────────────────────┴────────────────────────┐
              │         │       Single Claude session                  │
              │         │                                              │
              │         │  @zuul-context-extractor  (haiku)           │
              │         │       ↓                                      │
              │         │  @commit-summary          (haiku)           │
              │         │       ↓                                      │
              │         │  @project-guidelines-extractor (haiku)      │
              │         │       ↓                                      │
              │         │  @code-review-agent       (inherit)         │
              │         │       ↓                                      │
              │         │  render_html_from_json.py                   │
              │         └──────────────────────────────────────────────┘
              ├── role: ai_html_generation  (verify HTML artifact)
              └── role: ai_zuul_integration (post comments, register artifacts)

One Ansible invocation, one Claude session, one coherent context shared across all the subagents. The orchestration logic moved from the playbook into the agent definition, where it belongs. And because the system is now a proper plugin, the same /teim-review slash command that runs in CI is available locally — no manual Ansible required.

What prompted the change

There were four specific things I wanted to fix:

  1. Agent definitions used deployment-specific model names. The agents referenced glm-5-turbo, glm-4.7, and glm-4.7-flash directly, which meant they’d break outside the homelab CI without editing. We needed to use Claude’s canonical model tier names (haiku, sonnet, opus, inherit) so the mapping could happen at configuration time. The CI job configures a ~/.claude/settings.json that maps each tier name to whatever model is available in the environment — GLM models in the homelab, or any other provider elsewhere. That means the hardcoded tier names in the agent definitions are effectively just defaults; the actual model used is determined at runtime by the settings file, and can be overridden without touching the agent definitions at all.

  2. Three Claude invocations instead of one. Every invocation is cold-start overhead. More importantly, agents running in separate sessions couldn’t share context naturally. A single orchestration agent handling the full pipeline in one session is more efficient and more coherent.

  3. No way to run it locally. The review pipeline only existed as a Zuul job. If you wanted to run it on your local branch before pushing, you’d have to manually orchestrate the Ansible playbook. That’s not a workflow anyone was going to adopt.

  4. Not distributable as a plugin. The whole system was wired to a specific repo clone path in CI. There was no way for another project to install and use it without significant manual configuration.


A Claude Code Primer: Skills, Subagents, Slash Commands, and Plugins

If you’re new to Claude Code’s extensibility model, a quick orientation before we go further. This is one of the areas where terminology can get confusing quickly, especially if you’re used to VS Code extensions or similar systems.

TL;DR: Think of it as a layered system — slash commands are the entry points, skills are packaged workflows, subagents are specialist collaborators, and plugins are the distribution mechanism that bundles all of the above.

Slash commands

A slash command (/teim-review) is the most explicit form of extension in Claude Code. You type it, Claude Code executes the associated skill or command definition. It’s an intentional trigger — nothing happens unless you invoke it.

Skills

A skill is a structured instruction file (conventionally named SKILL.md) that teaches Claude Code how to perform a specific workflow. It defines the trigger condition, required parameters, what tools are available, what output to produce, and where to put it. When a skill is installed via a plugin, its slash command becomes available in your session.

The key distinction from a bare slash command: a skill is self-describing. It includes enough context about when and how to use it that Claude can activate it appropriately, and it carries configuration that shapes the tool invocations it makes.

Subagents

A subagent (or agent) is a markdown file in an agents/ directory that defines a specialist role. When Claude Code (or another agent) invokes it via @agent-name, it gets its own execution context with a specific system prompt, model assignment, and tool permissions. Subagents are how you decompose a complex pipeline into focused, reusable pieces.

The analogy that works well: subagents are like library functions. You define them once, give them a clear contract (input → output), and call them from anywhere. The orchestrating agent doesn’t need to know how they work internally.

Plugins and the marketplace: Git-as-a-Registry

This is where the distribution model gets interesting, and where I think the framing of “marketplace” can mislead people. This isn’t a store. It’s git-as-a-registry.

A plugin bundles related skills, agents, hooks, and MCP servers into a distributable unit with a plugin.json at the repo root. A marketplace is a marketplace.json that acts as a catalogue — you point Claude Code at any git repository and it becomes a source of installable tools. The tools live in the repo; the marketplace manifest is just the index.

claude plugin marketplace add /path/to/openstack-ai-style-guide
claude plugin install teim-review@openstack-ai-style-guide

Or, once the marketplace is registered, you can browse and install directly from within Claude Code using the built-in /plugins command — no CLI required.

After that, /teim-review is available in any session, @teim-review-agent is callable from other agents, and everything is version-tracked alongside the source repo. When the repo updates, the plugin updates.

The DX benefit for humans is real: no manual file copying, no PATH configuration, no “clone this repo and source that script.” But the CI benefit is just as significant. What was previously several manual steps — clone the style guide repo, configure model mappings in ~/.claude/settings.json, symlink the agents directory — became a single versioned command in the Ansible playbook. The ai_review_setup role went from bespoke configuration logic to a clean two-liner: marketplace add, then plugin install. The tool installation is now auditable, reproducible, and as easy to update as bumping a branch ref.

If you’ve built internal tooling before, you’ll recognise the pattern: the goal is to make the right way to use your tools also the easy way. Plugins get you there for Claude Code tooling.


The How: Starting with the Prompt

Here is the complete initial prompt I gave Claude Code to kick off this work. I’ve left it exactly as I typed it — typos, run-on sentences, and all. That’s deliberate, and it matters.

currently i have a number of agent in this repo @agents/ that are orcrestated by
ansible @playbooks/ as part of a zuul ci job @zuul.d/ . in the ci job we configure
claude by redefining the defautl haiku, sonnet and opus names and for rubustness we
also override the model when we invoke claude code explcity on the command line.
i want to make several optimisations and enhancments to how this works.

first i would like the existing sub agent to not use the glm names, we should either
update the defintions to inherit or use the calude equivlent names.
second i would like to create a teim-review subagent or skill that will intelegently
orchestrate teh exisitng subagents.
effectivly i want the skill or subagent to be usabel locally or in ci allowing it to
replace the exisitng ansible orchestration so we do not need to invoke claude multiple
times.
this new skill would instaead detect if we are runing in zuul or if we are invoking
it localy to do review and either genreate the zuul contex or review the current
repo instead.
then it shoudl gather context form the commit message and repo usign the exisng sub
agents and finally invoke teh review subagent

optionally it would be nice if we coudl ask the skill/subagent to generate teh same
html report as the ci job and json finding file in a .teim-reivew subfolder when
locally or in a location specifed in the promt when invokeing the subagent/skill

lets reason about how to do this interactivly before we proceed to the implemation

i would also like to enable the current repo to be used as a plugin and market place

the ci job should continue to functon troughut this porcess as it will use the
updated defietion to review the propsoed change when we submit it as a pr.
so we will need to adtap the zuul job to work with the new plugin/marketpalce layout
and also update the way we enabel the skills/subagent before we invoke claude in the
zuul job.

once we have tested the changes locally and pushed a pr for review i will also take
the review feedback and any error taht are found in the output fo the job and we will
use that to refien the impelatin futuer in a seperate sesssion.

note that before we proceed to that initall push and refinement loop all pre-commit
and local tests shoudl be exectued and any linitn issues adressed as those will also
be enfoced in ci and they provide valable quality gates during this refactor and
feature devlopement.
we also need to update the repos docuementiaon comprehsnivley for this work.

That prompt produced a detailed, targeted implementation plan covering 26 files, with the correct model tier strategy, the right plugin schema, and a working CI integration — on the first planning pass. Here’s why it worked.

Structure of intent, not structure of prose

The model doesn’t care about your spelling. What it needs is a clear picture of what you want, what already exists, and what cannot change. This prompt delivers all three, with no polish whatsoever.

@agents/, @playbooks/, @zuul.d/ — these are the most important characters in the prompt. In Claude Code, the @ prefix is a live context reference. Rather than pasting 900 lines of Ansible YAML into the conversation, I pointed at the source directly. Claude Code loaded and understood the existing architecture before generating a single line of plan. The model’s understanding of the codebase was grounded in the actual current state of those files, not in my description of them. That’s the difference between contextual grounding and paraphrasing.

Required vs. optional, stated explicitly — “first… second…” sets the hard requirements. “Optionally it would be nice…” marks what can be deferred. This distinction propagates through the planning phase. The model won’t trade backward compatibility for an optional HTML report. You’ve told it the hierarchy.

“lets reason about how to do this interactivly before we proceed to the implemation” — This single line is load-bearing. It’s an explicit instruction to enter a planning loop before touching any code. It’s the difference between an agent that plans and an agent that acts. I’ll come back to why this habit is one of the most valuable things you can build.

Backward compatibility as a hard constraint — “the ci job should continue to functon troughut this porcess.” Not aspirational. Not a preference. A constraint. When this is stated explicitly, the planning system evaluates every proposed change against it. Unstated constraints get optimised away.

Staged delivery acknowledged upfront — “once we have tested the changes locally and pushed a pr for review i will also take the review feedback… in a seperate sesssion.” This scopes the session. The model isn’t trying to solve three iterations of refinement in one go. It has a clear definition of done.

The takeaway isn’t that you should write better prompts. It’s that you should write prompts with structure of intent and contextual grounding. The typos are irrelevant. The architecture they’re describing isn’t.


Plan Mode: The Iterative Planning Loop

TL;DR: Start every non-trivial task in plan mode. Separating “understand and plan” from “execute” is one of the highest-leverage habits in agentic development.

Claude Code’s plan mode is a read-only phase where the model explores the codebase, asks clarifying questions, and produces a structured implementation plan — without touching a single file. Only after you approve the plan does execution begin.

Most AI coding tools now have some equivalent of this concept. The specific name and interaction model varies, but the underlying principle is consistent across the ecosystem: separate understanding from execution. Jumping straight to implementation is tempting, especially for tasks that feel straightforward. Resist it. The planning phase is where ambiguities surface cheaply, before they become bugs or rework.

What emerged from the planning phase here was a structured document that identified exactly which files to change, what changes to make in each, the order of operations, and a verification checklist. A representative excerpt from the v1 plan:

Design constraint: Claude Code resolves a subagent’s model from its own model: frontmatter — the orchestrating agent cannot override it at call time. So model assignment must live in the individual agent definitions.

That’s a non-obvious architectural constraint that would have caused confusion mid-implementation without the planning phase to surface it. The plan also identified the two-tier model strategy: lightweight extraction agents (zuul-context-extractor, commit-summary, project-guidelines-extractor) should use model: haiku — mapping to a fast, cost-efficient GLM model in the homelab CI — while code-review-agent should inherit the session model. That’s the power of inherit: in CI it resolves to whatever flagship model the job is configured with (GLM-5 or GLM-5-turbo in this case), while locally it resolves to whatever model is active in your Claude Code session — Opus, Sonnet, or anything else. The agent definition doesn’t change; the capability scales with the context it’s running in. That decision came out of the planning conversation, not the initial prompt.

The full v1 plan is available here if you want to see what a complete planning output looks like for a change of this scope.


Self-Reinforcing Review: The Closed Quality Gate

This is the part I find most interesting to reflect on, and I want to be specific about what “self-reinforcing” actually means here — because it’s broader than AI reviewing AI.

Once the v1 implementation was in place, the development loop looked like this:

implement changes
      ↓
pre-commit hooks (ruff, markdownlint, ansible-lint, license headers)
      ↓
unit tests (stestr) + molecule tests (role testing)
      ↓
/teim-review (AI code review against OpenStack standards)
      ↓
triage findings → plan fixes → implement → repeat

The key is that the agent wasn’t just asking itself “does this look right?” It had access to linters that enforced objective rules, unit tests that verified behaviour, and molecule tests that exercised the Ansible roles in a real environment. When a pre-commit hook failed, that was a hard signal — not a suggestion. When a unit test broke, the agent knew exactly what it had changed that caused it.

The AI code review was one layer in a quality gate ecosystem, not the whole gate. And that’s the right framing for this kind of work generally. An AI reviewer is excellent at catching design smells, architectural inconsistencies, missing documentation, and patterns that violate project conventions. It is not a substitute for a test suite that exercises the actual behaviour. They’re complementary, and the combination is significantly more valuable than either alone.

With that context, here’s how the review loop played out.

The v1 report

The first self-review produced this HTML report:

  • 0 Critical
  • 1 High — SKILL.md was missing json_schema and tools_dir parameter documentation, meaning local invocations relied on implicit defaults not visible to users
  • 3 Warnings — plugin.json lacking explicit directory fields, unconditional changed_when: true in an Ansible role (idempotency violation), stale reference to a deleted role in the docs
  • 4 Suggestions — template placeholders, undocumented model strategy, HTML artifact placement, fragile regex in post-tasks

The overall assessment was “ready with minor fixes.” That’s a reasonable first-pass outcome for a feature that introduced 24 changed files. The pre-commit and unit test gates had already caught the mechanical issues; the AI review was surfacing design and documentation gaps.

Triaging findings: signal vs noise

Not all findings are equal, and one of the key skills in working with AI reviewers is developing judgment about what’s a genuine issue versus a false positive. This is where domain knowledge becomes the deciding vote.

When I brought the v1 report findings into the next session, the exchange went roughly like this:

Me: lets plan how to adress the new findign iteritively for the first high item this is a false positive lets not it in the AGENTS.md so it is not hilighted again. auto discovery is supproted adn adding agents and skill to the plugin.json is only required if not using the defualt locations. ask me clarifing question on how to proceed with the other finding

That fragment captures the pattern well. I identified which finding was a false positive and why — auto-discovery is the default; explicit directory fields in plugin.json are only needed for non-standard layouts. I told Claude where to document this to suppress future false positives. And I explicitly asked for clarifying questions rather than letting the model guess at how to handle the remaining findings.

The v2 plan that resulted triaged 8 findings with explicit classifications:

  • plugin.json missing agents_dir/skills_dir → False positive. Auto-discovery is the default. Document in HACKING.rst.
  • changed_when: true unconditional → Genuine. Ansible idempotency violation. Fix the task conditions.
  • Stale ai_context_extraction reference in docs → Genuine. Update to describe the current flow.
  • HTML script copied to home directory → Genuine. As I put it at the time: “it ideally would not be copied at all but instead executed via the fully qualifed path. the git repo will be cloned in the ci job in a well know localtion and we also knwo the location of the plugin so we can use the plugin dir info to constuct the fully qualifed path and invoke the tool direclty skiping the copy and cleanup entirly”

That last one is a good example of domain knowledge flowing back into the implementation. The reviewer flagged a smell; I knew the architectural solution because I knew the deployment constraints. The reviewer and the engineer each contributed what the other couldn’t.

The v2 plan and v3 plan are available if you want to see how the triage and refinement were structured across iterations.

The v2 report

After addressing the genuine v1 findings, a second self-review produced this report:

  • 0 Critical
  • 2 High — plugin install tasks in ai_review_setup had no failure handling (newly identified), and the SKILL.md schema path issue correctly elevated from suggestion to High
  • 3 Warnings
  • 4 Suggestions

The move from 1 High to 2 High might look like regression. It isn’t. One finding was newly identified (the failure handling gap), and one was correctly elevated in severity. The overall quality of the codebase improved between v1 and v2; the reviewer just got a more thorough look at it second time around.

The v2 assessment: “Approve with comments; fix High issues before merge.” That’s exactly the kind of output you want — not a block, but a clear signal of what needs attention before the change lands.

HACKING.rst as an artifact of the process

One output of the refinement loop worth highlighting is HACKING.rst. This file didn’t exist before this feature. It emerged from the need to document patterns the AI reviewer was correctly flagging as unusual but that were intentional:

  • Plugin manifest has no explicit agents_dir/skills_dir — auto-discovery is the default
  • Ansible variable cross-references in defaults/main.yaml — valid pattern, resolved at play time
  • regex_search with default/fallback guards — intentional safety handling for empty stdout

When your automated reviewer produces false positives, don’t suppress them silently. Document why the pattern is intentional. Future contributors and future reviewers — human or AI — benefit from understanding the reasoning. HACKING.rst is the right place for exactly this kind of project-specific exception documentation.


From First Plan to Merged Commit: The Evolution

The commit history tells the broader story of how this system developed before the plugin refactor:

ee59ccc Add teim-review plugin, skill, and orchestration agent  ← merged commit
63c018d Enhance code-review-agent with security and maintainability guidance
9b3c1a7 Add two-tier confidence routing and fix agent/schema alignment
0d630d8 Improve review signal quality based on reviewer feedback analysis
e125a1b Add semaphore to limit teim-code-review concurrency to 1
54ac296 teim-code-review: use glm-5-turbo as opus/reviewer model
10ec918 update model tier mappings to GLM 5/4.7/4.7-flash
e84e2be Add test harness with unit tests, molecule tests, and linting
8a1107d Add structured output and JSON validation for reliable review reports

The plugin refactor was the culmination of a longer arc. Structured output and JSON schema validation came first — you can’t reliably parse AI output without that. A proper test harness came after. Model tier mappings were tuned. Review signal quality was improved based on real feedback from human reviewers on actual patches.

That order matters. You can’t productively refactor something you don’t understand, and you shouldn’t refactor something you can’t verify. Each earlier commit was either adding capability needed for the refactor, or building confidence in what the system was actually doing before changing how it was wired together.

The final plugin commit touched 26 files, added 601 lines, and removed 397:

  • New: .claude-plugin/plugin.json and marketplace.json — the distribution layer
  • New: agents/teim-review-agent.md — single orchestration agent replacing multi-step Ansible orchestration
  • New: skills/teim-review/SKILL.md — the /teim-review slash command
  • New: HACKING.rst — documented false-positive patterns
  • Updated: ai_review_setup role — now installs the plugin via claude plugin install rather than manually configuring agent paths
  • Updated: ai_code_review role — now invokes a single teim-review-agent call rather than multiple sequential agent calls
  • Removed: ai_context_extraction role — its functionality absorbed into teim-review-agent
  • Updated: All lightweight agents — model: inheritmodel: haiku

Key Takeaways / Lessons Learned

Always start in plan mode — or the equivalent.

The line “lets reason about how to do this interactivly before we proceed” did more work than anything else in that initial prompt. It forced a shared understanding of the architecture before a single file was touched. Whatever tool you’re using, find the plan/spec/architect mode and use it by default for anything non-trivial. The discipline of separating understanding from execution is where most of the value in agentic development comes from.

Ground your prompts in the actual code, not your description of it.

@agents/, @playbooks/, @zuul.d/ — live references, not copy-pasted YAML. The model’s understanding of the codebase should be built from the actual current state of the files, not from your summary of them. Summaries drift. Files don’t. Use @ references to give the agent contextual grounding, and it will plan against reality rather than against your memory of reality.

Structure of intent is what matters, not grammatical structure.

Every message I sent throughout this work was written quickly, with typos, in stream-of-consciousness style. The prompt you read above produced a 26-file implementation plan. The effectiveness came from having a clear structure of intent: what was required, what was optional, what constraints applied, what was out of scope for this session. None of that requires polish. It requires clarity about what you actually want.

Explicitly scope what you’re delegating.

“Ask me clarifying questions on how to proceed with the other findings” is a powerful pattern. It keeps you in the loop for decisions that need your domain knowledge, while letting the model handle the mechanical work. The alternative — letting the model infer all decisions — leads to implementations that are internally consistent but disconnected from your actual constraints.

The AI reviewer is one layer in a quality gate, not the whole gate.

The self-reinforcing loop in this work wasn’t just AI reviewing AI. It was an agent with access to linters, unit tests, and molecule tests that could verify its own work against objective signals. Design your quality gate with that in mind: pre-commit hooks for mechanical correctness, a test suite for behavioural verification, and an AI reviewer for design and convention alignment. They’re complementary. Don’t ask the AI reviewer to do the job of the test suite.

Develop judgment about false positives early.

An AI code reviewer will flag things that look unusual. Some flags are correct. Some are false positives from patterns the model hasn’t seen in the right context. Developing the ability to quickly distinguish between “this is a genuine issue” and “this is expected behaviour that needs documentation” is a skill that improves with practice. When you identify a false positive, document it — don’t just move on.

Right model for the task.

Using haiku for context extraction and inherit for the actual code review is not a premature optimisation. In CI, extraction agents run many times per day across multiple projects. The cost difference between flash-tier and flagship models is an order of magnitude. Design your model assignment strategy deliberately — don’t default everything to the most capable model just because it’s available.

The iterative loop is the development model.

Plan → implement → verify (linters, tests, AI review) → triage findings → plan fixes → implement → repeat. This is not overhead on top of development. This is the development model when working with agentic AI. Each cycle is faster than the last because the codebase improves, the quality gates get better calibrated, and you develop better intuition for what to delegate and what to hold.


References and Further Reading

Claude Code documentation:

Agentic workflows and best practices:

Session artefacts from this work:


If you’re experimenting with agentic AI in your own CI pipelines or want to dig into the teim-review implementation, the style guide repo is public. Feel free to reach out on IRC (sean-k-mooney) or through the usual OpenStack channels.