Summoner: A Plan-Execute-Reflect Agent Framework on GCP
An extensible multi-agent scaffold built on Vertex AI, Cloud Storage, Pub/Sub, and Firestore. The artifact is the orchestration pattern, not the demo task.
The phrase "agentic AI" has gotten so loose in 2025 that I no longer trust most of what people post under it. Half the time it means a chat wrapper that calls a single tool. The other half it means a flowchart in a slide deck. I wanted to build something I could read end to end and point at when somebody asked me what an agent system actually looks like.
Summoner is that artifact. It is an extensible Python framework for building tool-using LLM agents on Google Cloud Platform, structured around a deliberate three-step lifecycle and a small set of GCP tool wrappers. It is not a product. It is a scaffold I can reach for the next time someone says "we need an agent that does X" and I want to skip a week of plumbing.
This page is short because the project is honest about being a scaffold. The lessons are in the structural choices, not in any single shipped task.
The lifecycle: plan, execute, reflect
+-----------+
goal ---> | PLAN | Gemini via Vertex AI
+-----+-----+
|
v
+-----------+ Cloud Storage
| EXECUTE | ---> Pub/Sub
+-----+-----+ Firestore
| Vertex AI
v
+-----------+
| REFLECT | did the goal get met?
+-----+-----+
|
+-----------+-----------+
| |
success retry (up to N)
| |
v v
result back to PLAN
Every BaseAgent in Summoner runs the same loop:
- Plan. Given a goal, the agent calls Gemini through Vertex AI to produce a structured plan. The plan is a sequence of tool invocations, in language the framework can parse.
- Execute. The agent dispatches each step against the GCP tool layer (Cloud Storage, Pub/Sub, Firestore, Vertex AI), capturing inputs, outputs, and exceptions.
- Reflect. The agent calls the LLM again with the execution trace and the original goal, asking whether the goal was met. If not, retry with an adjusted plan, up to a configured retry budget.
The loop is wrapped in tenacity retry decorators with exponential backoff (5 attempts, exponential from 1 to 30 seconds, retrying on a small set of network and API errors and not on logic failures). The agent's working memory is a simple dict, optionally persisted to Firestore so a long-running task survives a restart. The persistence shape is deliberately boring: a single document keyed by run ID, with the plan, the trace, and the verdict as nested fields. I considered modelling steps as their own documents and decided against it; reading and writing one document is cheaper, and the queries that would justify a richer schema do not exist yet.
class BaseAgent:
def __init__(self, name, tools, memory, llm_client):
self.name = name
self.tools = tools
self.memory = memory
self.llm = llm_client
def run(self, goal):
for attempt in range(self.max_retries):
plan = self.plan(goal)
trace = self.execute(plan)
verdict = self.reflect(goal, trace)
if verdict.success:
return verdict.result
return verdict
The point of writing this loop down in code, instead of describing it in prose, is that "agentic" stops being a vibe and starts being a sequence of method calls a unit test can pin down.
Why the loop is the load-bearing decision
Most "agent frameworks" you find in the wild are either (a) a single LLM call wrapped in a class, or (b) a graph of LLM calls with no separation between deciding what to do and verifying that it got done. The plan-execute-reflect split is older than the current LLM hype cycle. It comes out of the autonomous-systems literature (Russell and Norvig's textbook spends chapters on it) and it survives the LLM transition because the failure modes it addresses are the same.
The failure modes:
- The LLM hallucinates a plan that looks plausible and does not work. Without an execute step that captures real tool outputs, the agent never finds out.
- The tool execution succeeds technically but does not satisfy the goal. Without a reflect step that compares output against intent, the agent reports success when it has not done the job.
- A transient failure (rate limit, network blip) kills a multi-step task. Without retry-with-replanning at the lifecycle level, you have to encode resilience inside every tool call.
Splitting the lifecycle into three named phases is what lets you write tests that target each one. The test suite mocks the LLM client, mocks the GCP tool layer, and verifies that the loop transitions correctly under success, partial failure, and total failure. The framework is testable in a way an undifferentiated "agent that does stuff" is not.
Concretely, the tests I wrote for the loop included: a planning step that returns a malformed plan (the agent should fail fast with a parsing error rather than execute partial garbage), an execute step where one tool call out of three raises (the agent should record the failure in the trace and let the reflect step decide whether to retry), and a reflect step where the LLM declares success when the trace clearly shows a tool failure (the test pins the current behaviour: the agent trusts reflect but logs the contradiction so a human can audit later). Each test is short because the lifecycle is named. Without that decomposition I would have ended up with a single integration test asserting "good thing eventually happens", which is the signature of a system you cannot debug.
The GCP tool layer
The four wrappers that ship with Summoner cover the verbs an agent actually needs on GCP:
- Cloud Storage. Read, write, list, signed URLs. Most agent tasks that involve files involve GCS.
- Pub/Sub. Publish a message, optionally with attributes. Agents that fire off downstream work go through Pub/Sub.
- Firestore. Read and write documents. The persistent-memory backend uses this, and so do agents whose tasks involve structured state.
- Vertex AI. Generate text or embeddings. This is how agents call the LLM beyond the planning step.
Each wrapper has a consistent interface (a call() method that takes typed kwargs and returns a typed response) and a corresponding mock for tests. The point of standardizing the surface is that adding a new tool, say BigQuery or Cloud Run jobs, is a copy-paste plus an interface implementation rather than a rethinking of how tools attach to agents.
The deliberate omission, for now, is anything that touches authentication beyond the default GCP credential chain. Agents inherit whatever the host process has access to. Locking down per-agent IAM is the next step and is not in this cut.
Specialised agents on top
Two example agents ship as proofs of concept:
ResearchAgentuses Vertex AI as its primary tool, generating answers with a research-style system prompt and storing intermediate findings in Firestore.DataProcessingAgentorchestrates a small ETL pattern: read CSVs from GCS, transform with pandas, write the result back to GCS, optionally publish a Pub/Sub notification.
Both inherit from BaseAgent, override the tools and memory they need, and otherwise lean entirely on the framework. That is the test of whether the abstraction works. Could I write a third agent in 50 lines because the lifecycle and tool layer are doing the work? On the two existing agents, yes.
What I did not do, and why
A few deliberate omissions worth naming.
No agent-to-agent communication protocol. Multi-agent systems need a coordination layer (auction, contract net, blackboard, message bus). Summoner has Pub/Sub available as a tool, but I have not built the abstraction layer that would make agent-to-agent handoffs first-class. That is the next major addition.
No cost or latency budget. Every plan-execute-reflect cycle is at least three LLM calls. That is expensive at production scale and slow at interactive scale. A real production framework needs token budgeting, plan caching, and fallback to deterministic execution when the LLM call is unnecessary. Summoner does not yet.
No structured plan validation. The plan returned by the planning step is parsed liberally. A serious framework would validate it against a schema before executing, both for safety (rejecting plans that try to do things outside the agent's tool set) and for correctness (catching plans the LLM hallucinated tool names into). This is one of the next items.
These omissions are documented in the README's roadmap section. The framework is honest about what is built and what is sketched.
What it taught me
Two things that travel beyond this project.
Agent frameworks are mostly plumbing. The "intelligence" lives in the LLM and the prompt. The framework's job is to make the LLM's calls observable, retryable, testable, and bounded. Every hour I spent on this was an hour of plumbing, and the plumbing is the part that earns trust. The "what model do you use" question is much less interesting than "what happens when the model returns garbage."
The plan-execute-reflect split survives LLMs. I started this project skeptical that classical autonomous-systems patterns would matter for LLM-driven agents. I am now convinced of the opposite. The patterns are older than this generation of language models because they address failure modes that are intrinsic to autonomous decision-making, not specific to any one decision-making algorithm. LLMs make the planning step cheaper and more flexible. They do not change why the lifecycle has three phases.
That insight ports back to my work at metricHEALTH the moment any healthcare workflow grows more than two steps. A patient-enrollment job that calls an EHR, writes to a state store, and then notifies a downstream system is structurally an agent task, even if the "intelligence" is a conventional service. Naming the plan, the execute, and the reflect phases for those workflows (instead of letting them collapse into one large try/except) is the same discipline applied at a smaller scale.
The repo is at https://github.com/TirtheshJani/GCP-Agents---Summoner. It is a scaffold to read, fork, and build on, not a finished product to deploy. Read it as the bones of the kind of system I would build at scale, with the structural choices made deliberately and the missing pieces named honestly.