Most business software chat features follow the same pattern: collect context, paste it into a prompt, run the model, and hope the answer stays grounded.
That can work for a demo. It becomes risky when the context is a live, multi-million real estate deal.
Every example in this post uses one public demo deal: Moosstrasse 4a, 6a and 8a in Abtwil (SG), three multi-family buildings with 46 apartments, offered as a package for an indicative CHF 19'000'000 at a 4.08% gross yield. The exposé is 48 pages: per-unit rent rolls, land-register servitudes, floor plans, building descriptions, three oil heating systems from 1989, 1993 and 2005, and one detail that will matter later: building 6a has no elevator.
Source: Baubeschrieb, page 9. Highlight: H6a has no passenger lift.
In AssetOS, a deal like Abtwil isn't a single document. It's a complex, living graph of assets, parcels, owners, offers, broker fields, building blueprints, emails, internal notes, and due-diligence findings. Some of this data is neatly structured; some is chaotic text. Some can be edited; some is set in stone.
The first version of our agent treated the deal as one large prompt payload. The current version treats it as a scoped application surface. That shift moved the agent from summarizing deal data to taking controlled actions inside AssetOS.
What's in this post
This post is about one design decision: a trustworthy deal agent cannot be a bigger prompt. It has to be a scoped application surface, with tools for reading, searching, and writing.
We will use three concrete failures to show the difference:
- The wrong yield: a model infers from similar-looking financial numbers instead of using code-computed KPIs.
- The wrong elevator answer: a document inventory tells the agent that a file exists, but not what page 9 says.
- The fake task update: a model says "done" even though no platform command ran.
The architecture that follows, from read tools to response contracts, exists to prevent those three failures from reaching a transaction team.
01 Why the giant snapshot failed
When we started, the temptation to load everything was real:
// The "Easy" Way (That Fails)
const dealContext = await loadEverythingUseful(dealId)
const answer = await model.generateText({
system: 'Answer only from this deal context.',
prompt: JSON.stringify(dealContext),
})
This failed in three predictable ways.
It turned every question into an all-data question. If a user asks a quick KPI question about Abtwil, the model doesn't need to read the last eight broker emails or analyze the geometry of parcel no. 1496.
It caused severe information pollution. Real estate deals contain many similar metrics. The Abtwil exposé alone includes an asking price, three insurance values, three per-building net rents, gross rents, and per-square-meter figures. Asking a model to infer "yield" from all of them is unreliable.
Try it yourself:
Play the unguided LLM
All six numbers below appear in the Abtwil exposé. Pick the two you would divide to get the gross yield, as a model might when reading one giant prompt.
Pick a “price”
Pick a “rent”
Pick a “price”
Pick a “rent”
Gross yield
0 of 9 combinations tried
-
An unguided model sees all six numbers in one prompt. calculateDealKpis sees exactly two and returns 4.08% every time.
Every number in this game appears in the real exposé. Only one of nine combinations is right. An unguided model has no reliable reason to prefer it.
The system should define business logic, not the LLM. We'll come back to this in section 07.
Snapshots cannot change things. "Add this to my internal notes" is not a text-generation problem. It is an application command, and a JSON blob has no way to express one.
To fix all three, we split the agent's capabilities into three distinct surfaces:
- Read tools for bounded deal facts.
- Search tools for document and market evidence.
- Write tools for explicit platform mutations.
The rest of this post shows how each surface works and how we guard it.
02 The safety perimeter
Before the AI runtime starts, our API endpoint validates the request, authenticates the user, checks organization access, verifies the chat thread, and loads compacted memory. Only then do we start the agent.
In simplified form, it looks like this:
const thread = await dealChatCommands.ensureActiveThread({
organizationId,
dealId,
userId,
threadId,
})
const context = await dealChatQueries.getThreadContext({
organizationId,
dealId,
threadId: thread.data.threadId,
userId,
historyLimit: MAX_HISTORY_MESSAGES,
})
const runResult = await runAssistant({
scope: { mode: 'deal', organizationId, dealId },
userId,
prompt,
messages: context.data.messages,
threadMemory: context.data.memory,
attachments,
formattingPreferences,
timeZone,
})
One architectural detail matters here:
The agent's context comes straight from the database, never from the browser's optimistic chat UI. The frontend does not decide what the model knows.
If the model is about to claim "I created that task", the runtime can check the database and see whether the task actually exists. A context assembled from the browser's local state cannot provide that guarantee. The UI would confirm its own optimistic state.
03 Small read tools
Instead of handing the model a large data dump, we gave it a specific toolset. Each tool acts as a runtime boundary that pulls one slice of data, returns clear citations, and leaves an audit trail.
Pick a question about the Abtwil deal and watch how it routes:
Route the question
Pick a question about the Abtwil deal and see which read tools the planner loads, and which slices of the deal stay untouched.
Pick a question above, or tap a tool to see what it does.
A KPI question never touches the email tool. An email question never loads parcel geometry. Scope is part of the product behavior.
Each tool returns:
- context sections for the model;
- response sources for citations;
- count metadata for progress UI and traces;
- optional grounding documents for later document search;
- optional shared data that another tool can reuse.
Crucially, these tools can cooperate. If load_deal_kpis needs core asset data that load_deal_overview already fetched during the same run, it reuses that data out of shared state rather than hitting the database a second time.
04 Token discipline
Large context windows do not remove the need for selection. They change the failure mode. A model may be able to accept more tokens, but it still has to attend to the right evidence, choose the right tool, and ignore stale or irrelevant facts.12
Prompt budget is therefore a reliability budget before it is a cost budget. A large prompt can make the answer slower and more expensive. It can also bury the relevant sentence inside noisy context and push the model toward the wrong source.34
The prompt budget
Prompt budget is a reliability budget. A broad question loads a ranked first slice, then narrows when needed.
One noisy source can consume the budget and bury the evidence the question needs.
That is why our context builders cap high-cardinality data. The cap is not a blind cutoff. It is a retrieval contract: load a ranked first slice, mark it as partial, and make the next retrieval path explicit.56
For example, if a deal has 73 recent emails, load_deal_emails does not paste all 73 into the prompt. It loads a ranked slice, marks it as partial, and exposes follow-up paths for deeper retrieval. If the user asks a broad deal-status question, the top emails may be enough. If the user asks about a specific tenant, broker, topic, or date range, the agent can call a narrower search tool and retrieve the relevant emails from outside the initial slice.7
The important rule is simple: capped context must never pretend to be complete.
A partial slice should tell the model what it loaded, what it did not load, and which tool to call if the answer depends on the missing tail.
That is the difference between token discipline and data loss. The goal is not to hide emails 9-72. The goal is to avoid loading them by default, then retrieve them deliberately when the question requires it.
05 The runtime: a flexible loop with a hard fallback
The agent has tools and a budget. The runtime still needs to ensure those tools are used when required. It does this through a two-pass system: an agent-first tool loop and a deterministic fallback path.
The primary path (AI SDK tool loop): The model evaluates the user's prompt, checks its tool catalog, and calls tools. It can take up to 9 steps, but it must invoke a strict deal_response_contract exactly once before delivering the final answer.
The guarded fallback: If the model ignores a required tool or produces a false capability denial, for example "I don't have access to tasks" when it does, the runtime catches it. We trigger an Action Planner and a Source Planner to load the missing data through a deterministic waterfall.
The deal_response_contract gives the runtime structural integrity:
{
requestType: 'read_only' | 'platform_write' | 'clarification' | 'other',
platformMutationRequested: boolean,
platformMutationApplied: boolean,
confidence: 'low' | 'medium' | 'high'
}
Why does a four-field object matter?
Without this contract, a model can output a convincing sentence like "I've updated that task for you!" even if it did not call the underlying write tool. The contract gives our runtime a structured object to inspect and validate instead of a sentence it has to trust.
Here's exactly what that looks like from the runtime's perspective:
Same sentence, different truth
Both runs below end with the exact same chat message. Switch between them and watch what the runtime sees.
What the user sees
Write-tool log
update_work_item({ dueDate: '2026-06-19' }) → ok
Response contract
{
requestType: 'platform_write',
platformMutationRequested: true,
platformMutationApplied: true,
confidence: 'high'
}Runtime verdict
Verified: Contract matches the tool log. Response ships to the user.
The user cannot tell these two runs apart. The runtime can, because the contract is a structured object it can inspect, not a sentence it has to trust.
The user sees a sentence. The runtime sees a claim it can verify.
To catch regional nuances, our source planner also includes a lexical backstop. If a user asks a question in German using terms like Kaufpreisfaktor (multiplier) or Bruttoanfangsrendite (gross initial yield), the system does not wait for the model to reason from scratch. It routes the question to the correct read tools.
With the architecture in place, we can look at three areas where it matters.
06 Documents need a second layer
The document-list tool alone cannot answer document-content questions. The Abtwil exposé shows why. Try the elevator question both ways:
Under the hood
- The model only sees the document inventory: a file named “Baubeschrieb.pdf” and an overview summary mentioning lifts.
- It generalizes from 4a and 8a to the whole package. The actual page was never read.
Issue: Wrong. 6a has no lift. A casual tenant question just became bad deal advice.
The exposé states it in one line on page 9: "H6a verfügt über keinen Personenlift." A document inventory knows the file exists. Only the second layer reads the line.
load_deal_documents gives the agent the document inventory and builds searchable document metadata. The search text includes the document name and parent directory names, which helps when the filename is generic but the folder is meaningful.
When the user asks a document-content question, search_deal_documents does more work:
- Load the document inventory if it has not already been loaded.
- Build grounding chunks for the current question.
- Try direct document access for documents the prompt names explicitly.
- Run a document-question specialist when direct document reads are available.
- Merge excerpt sources and direct-document sources.
- Remap document citations so the same document keeps a stable
[D#]tag.
There is one more guard: if the general answer and the direct document answer disagree, a small selector chooses which answer should win. It prefers the direct document answer when the general answer refuses, omits, or contradicts the document evidence.
A model can read a direct lease excerpt and still answer from a weaker overview summary. The selector lets document evidence override the generic answer.
07 KPIs: computed in code, not inferred by AI
Real estate KPIs are exactly where LLMs should not improvise. The same nine-combination trap from section 01 can also appear in chat:
Under the hood
- The prompt contains the asking price, three insurance values, three net rents and per-m² figures. They are all similar-sounding.
- The model picked the sum of insurance values as “the value”. The arithmetic is correct; the inputs are wrong.
Issue: 6.4% would make Abtwil look like a steal. The exposé says 4.08%.
One wrong denominator and a 4.08% deal looks like a 6.4% deal. That is not a rounding error. It is a different investment decision.
The KPI tool calls calculateDealKpis over the structured deal, assets, parcels, and formatting preferences. It returns a [K] source with the computed KPI payload and the asset rent snapshot used for calculation.
The agent prompt is explicit: for KPI questions, call load_deal_kpis.
The runtime also checks some KPI answers after generation. If the KPI tool was loaded and the answer cites [K], an assessment classifies whether the answer is informative, a source absence, a capability denial, or something else. If the model says "I cannot calculate that" even though the KPI tool was loaded, the runtime can continue instead of accepting the answer.
The same pattern exists for work items. If load_deal_work_items returned due dates or tasks and the model still claims it cannot access them, the runtime can produce a direct fallback answer from the loaded sources.
This is not the visible part of the product. It is runtime logic that prevents bad refusals from reaching the user.
08 Mutations go through strict commands
Read tools can run in parallel. Write tools are more sensitive. The agent supports direct platform actions such as creating tasks, editing due dates, and writing notes, but those actions are tightly restricted. The Abtwil exposé gives us a concrete test case: the oil heating in 8a dates from 1989, which is the kind of item a buyer might ask the broker to clarify.
Under the hood
- No write tool was called. The sentence is pure text generation.
- Ten minutes later the user types “any updates?”. The model re-reads the chat history, sees the old request, and creates the task twice.
- “Next Friday” was never resolved. Whichever date lands in the database is a guess.
Issue: A confident sentence with no guarantees and a replay risk.
A failed read wastes a question. A failed write can corrupt the deal. They need different guardrails.
Three rules govern every write:
- No replays: The Action Planner only looks at the current message, ignoring chat history for writes. This ensures an old "create a task" request from ten minutes ago isn't accidentally re-executed when a user types "any updates?".
- Date resolution: Standard phrases like "next Friday" are automatically resolved into hard ISO 8601 timestamps using the user's local timezone before touching the database.
- Strict serialization: Read operations are fast and parallel; writes are queued sequentially to guarantee absolute database order:
let writeToolQueue: Promise<void> = Promise.resolve()
const runSerializedWriteTool = async (operation) => {
const resultPromise = writeToolQueue.then(operation, operation)
writeToolQueue = resultPromise.then(
() => undefined,
() => undefined
)
return resultPromise
}
After the write, the response contract from section 05 closes the loop: platformMutationApplied must match what actually happened.
09 The anatomy of a trustworthy answer
When the model finishes its work, our finalizer packages the result. It merges citations from all tools, scores grounding confidence, and builds a telemetry object.
The visible answer is intentionally compact:
Gross yield: 4.08%: CHF 774'780 net annual rent ÷ CHF 19'000'000 indicative price. [K]
Behind that one sentence, the runtime keeps the evidence that makes it trustworthy:
{
requestType: 'read_only',
platformMutationRequested: false,
platformMutationApplied: false,
confidence: 'high',
sources: ['K'],
toolsUsed: ['load_deal_kpis']
}
For a task mutation, the same pattern turns a dangerous sentence into a verifiable claim:
Task created: "Ask broker about oil heating (8a, 1989)". Due Fri, 19 Jun 2026. [T1]
That answer is only allowed through if a write command actually ran, the resolved date is stored, and the response contract reports platformMutationApplied: true.
Because our backend tracks sources through a structured schema rather than a flat text label, our frontend can render interactive UI tags like [KPI], [Doc 1], or [Task 2] without losing the underlying data payload.
This makes a bad answer easier to debug. The traces show the breakdown:
- Did the source planner choose the wrong tool?
- Did the document search return empty excerpts?
- Did the model ignore an available data source?
- Did the write planner mistake a question for a command?
These are engineering problems we can track, reproduce, and fix.
10 The takeaway
Building a tool-backed agent is complex. We had to build planners, resolvers, contracts, and fallbacks to power one chat surface.
That complexity has a clear purpose. Once an agent moves from talking to doing, casual prompts are not enough. A snapshot-fed model can guess the Abtwil yield, invent an elevator in 6a, or claim to have created a task it never wrote. Each failure creates risk in a real CHF 19 million decision.
By scoping the AI to a single business object, forcing calculations into native code, treating writes as strict application commands, and maintaining a deterministic fallback path, we built an agent AssetOS users can trust in transaction work.
Want to see the agent on a live deal?