Like most interesting engineering problems, this one started with a simple question: is it cheaper for two people to go to the movies at Red Cinemas or Golden Ticket? While Copilot Money (which I strongly recommend for personal financial management) excels at transaction categorization, its reporting capabilities are surprisingly limited, for example vs. actuals only works for the current month, and custom analysis requires manual spreadsheet work. Over the holiday break, I decided to build something better: a natural-language financial reporting system using Claude Code that could answer complex financial analytical questions on demand.
The Technical Architecture
The solution required three distinct layers, each handling a specific concern:
- API Integration Layer: Reverse-engineered the Copilot Money GraphQL API to enable programmatic data access
- Agent Layer: An MCP server embedded in a Chrome extension that translates natural language queries into structured API calls
- Analysis Layer: Domain-specific tools that perform calculations and aggregate data—because LLMs shouldn’t do math

The architecture deliberately separates concerns: the agent orchestrates, the MCP server executes reliable calculations, and the LLM handles only what it’s good at: understanding intent and presenting results. This isn’t just good practice; it’s essential when working with AI that has well-documented mathematical limitations.
Building for Fuzzy Logic and Domain Understanding
The real challenge wasn’t technical implementation, it was encoding financial domain knowledge into a system that could handle real-world complexity. Consider this query:

A traditional system would search for exact merchant name matches. But I built the agent to take any query and research alternative keywords. In that case, the AI found out that “movies” encompasses cinema, theater, and ticket purchases, and automatically generate search variants. More importantly, it needed to understand financial semantics: when you go to a movie, you might have three transactions (tickets via Fandango, parking, concessions), but that’s really one “movie visit” for cost analysis purposes.


First, notice the nice formatting of the results. That doesn’t happen by accident, but was the result of careful prompt engineering. After that, notice how the agent correctly identified Red Cinema, but Golden Ticket is missing. Its is unique enough that it wasn’t caught by standard fuzzy matching. The system needed to support iterative refinement:


Notice it’s now grouping related transactions—multiple purchases during a single movie visit are automatically associated together. This wasn’t emergent AI behavior; I specifically designed the getMultiMerchantTransactions tool to handle this exact pattern because it applies to any entertainment expense: concerts, sporting events, any activity where tickets, parking, and concessions are separate but conceptually unified. To make it even more complicated, look at the movie visit on Nov 5th. There are two different names on the transaction: “Golden Ticket Cinegreensboro” and “Golden Ticket Cine”. The fuzzy logic used for searches also has to apply to vendor names.

When I mentioned Fandango purchases, the agent correctly understood that even though Fandango isn’t a theater, those transactions should be attributed to the movie visit cost. It also shows off how the context was carefully managed so that the AI is aware of previous prompts and responses. Again, this doesn’t happen automatically, but had to be carefully programmed. The result is a complete financial picture that required sophisticated transaction attribution logic:



Architectural Decisions: Why This Approach Works
The key insight is understanding where AI adds value and where traditional software engineering is superior. I designed 11 specialized MCP tools, each with single-responsibility focus:
| Tool Name | Purpose | Design Rationale |
|---|---|---|
| getCategoryBudgetVsActual | Compare budgeted amounts to actual spending | Extends Copilot’s built-in feature to support historical analysis |
| getAccountBalances | Retrieve balances with filtering | Foundation for net worth calculations |
| getNetWorth | Calculate and track net worth over time | Aggregates across accounts with temporal analysis |
| getMonthlySpending | Spending patterns by category and month | Enables trend detection and comparative analysis |
| getTransactions | Advanced transaction filtering | Core search primitive for all queries |
| getMultiMerchantTransactions ⭐ | Multi-merchant searches with cost attribution | Handles the “movie theater” problem—groups related transactions from different merchants |
| discoverMerchantsFromTransactions | Find merchants by business type | Enables exploratory analysis: “Where did I buy groceries last year?” |
| getUpcomingBills | Recurring payment tracking | Cash flow planning support |
| getTags | Transaction tag enumeration | Enables tag-based analysis |
| getCategoryGroups | Category hierarchy discovery | Supports budget rollup queries |
| getBudgetUtilizationTrend | Multi-month budget performance | Long-term budget adherence tracking |
| calculate | All mathematical operations | Critical: Guarantees accuracy—AI never touches arithmetic |
Each tool has carefully considered inputs, outputs, and error handling. The calculate tool is particularly important: by mandating that all math happens in deterministic code, I eliminated an entire class of AI hallucination issues. The agent can reason about what to calculate, but not perform the calculation itself.
Why This Matters
Building production AI systems isn’t about prompt engineering, it’s about architecture. The pattern here (agent orchestration + specialized tools + deterministic processing) applies far beyond personal finance. I’ve used similar approaches for test automation analysis, build system optimization, and developer workflow tools.
The reason Copilot Money hasn’t built this yet? It’s genuinely hard. Production systems need comprehensive error handling, edge case management, security considerations, and user interface polish. This prototype took a few days because I own the problem space and built exactly what I need. A commercial product requires orders of magnitude more effort. The stakes are low here, but what if I was trying to handle something important like taxes?
Even with the challenges, I’d bet on this pattern becoming common in the next year, not because the AI is magic, but because we’re finally learning how to architect systems that use AI where it excels and traditional software where determinism matters.