MICHAEL CZEISZPERGER

Building an AI Financial Analysis System for Copilot Money

Like most interesting engineering problems, this one started with a simple question: is it cheaper for two people to go to the movies at Red Cinemas or Golden Ticket? While Copilot Money (which I strongly recommend for personal financial management) excels at transaction categorization, its reporting capabilities are surprisingly limited, for example vs. actuals only works for the current month, and custom analysis requires manual spreadsheet work. Over the holiday break, I decided to build something better: a natural-language financial reporting system using Claude Code that could answer complex financial analytical questions on demand.

The Technical Architecture

The solution required three distinct layers, each handling a specific concern:

  1. API Integration Layer: Reverse-engineered the Copilot Money GraphQL API to enable programmatic data access
  2. Agent Layer: An MCP server embedded in a Chrome extension that translates natural language queries into structured API calls
  3. Analysis Layer: Domain-specific tools that perform calculations and aggregate data—because LLMs shouldn’t do math

high-level-uml-diagram

The architecture deliberately separates concerns: the agent orchestrates, the MCP server executes reliable calculations, and the LLM handles only what it’s good at: understanding intent and presenting results. This isn’t just good practice; it’s essential when working with AI that has well-documented mathematical limitations.

Building for Fuzzy Logic and Domain Understanding

The real challenge wasn’t technical implementation, it was encoding financial domain knowledge into a system that could handle real-world complexity. Consider this query:

initial-prompt

A traditional system would search for exact merchant name matches. But I built the agent to take any query and research alternative keywords. In that case, the AI found out that “movies” encompasses cinema, theater, and ticket purchases, and automatically generate search variants. More importantly, it needed to understand financial semantics: when you go to a movie, you might have three transactions (tickets via Fandango, parking, concessions), but that’s really one “movie visit” for cost analysis purposes.

first-response

red-cinema-transactions

First, notice the nice formatting of the results. That doesn’t happen by accident, but was the result of careful prompt engineering. After that, notice how the agent correctly identified Red Cinema, but Golden Ticket is missing. Its is unique enough that it wasn’t caught by standard fuzzy matching. The system needed to support iterative refinement:

second-prompt

second-result

Notice it’s now grouping related transactions—multiple purchases during a single movie visit are automatically associated together. This wasn’t emergent AI behavior; I specifically designed the getMultiMerchantTransactions tool to handle this exact pattern because it applies to any entertainment expense: concerts, sporting events, any activity where tickets, parking, and concessions are separate but conceptually unified. To make it even more complicated, look at the movie visit on Nov 5th. There are two different names on the transaction: “Golden Ticket Cinegreensboro” and “Golden Ticket Cine”. The fuzzy logic used for searches also has to apply to vendor names.

grouped-transactions

When I mentioned Fandango purchases, the agent correctly understood that even though Fandango isn’t a theater, those transactions should be attributed to the movie visit cost. It also shows off how the context was carefully managed so that the AI is aware of previous prompts and responses. Again, this doesn’t happen automatically, but had to be carefully programmed. The result is a complete financial picture that required sophisticated transaction attribution logic:

third-prompt

third-result

movie-cost-bar-chart

Architectural Decisions: Why This Approach Works

The key insight is understanding where AI adds value and where traditional software engineering is superior. I designed 11 specialized MCP tools, each with single-responsibility focus:

Tool Name Purpose Design Rationale
getCategoryBudgetVsActual Compare budgeted amounts to actual spending Extends Copilot’s built-in feature to support historical analysis
getAccountBalances Retrieve balances with filtering Foundation for net worth calculations
getNetWorth Calculate and track net worth over time Aggregates across accounts with temporal analysis
getMonthlySpending Spending patterns by category and month Enables trend detection and comparative analysis
getTransactions Advanced transaction filtering Core search primitive for all queries
getMultiMerchantTransactions Multi-merchant searches with cost attribution Handles the “movie theater” problem—groups related transactions from different merchants
discoverMerchantsFromTransactions Find merchants by business type Enables exploratory analysis: “Where did I buy groceries last year?”
getUpcomingBills Recurring payment tracking Cash flow planning support
getTags Transaction tag enumeration Enables tag-based analysis
getCategoryGroups Category hierarchy discovery Supports budget rollup queries
getBudgetUtilizationTrend Multi-month budget performance Long-term budget adherence tracking
calculate All mathematical operations Critical: Guarantees accuracy—AI never touches arithmetic

Each tool has carefully considered inputs, outputs, and error handling. The calculate tool is particularly important: by mandating that all math happens in deterministic code, I eliminated an entire class of AI hallucination issues. The agent can reason about what to calculate, but not perform the calculation itself.

Why This Matters

Building production AI systems isn’t about prompt engineering, it’s about architecture. The pattern here (agent orchestration + specialized tools + deterministic processing) applies far beyond personal finance. I’ve used similar approaches for test automation analysis, build system optimization, and developer workflow tools.

The reason Copilot Money hasn’t built this yet? It’s genuinely hard. Production systems need comprehensive error handling, edge case management, security considerations, and user interface polish. This prototype took a few days because I own the problem space and built exactly what I need. A commercial product requires orders of magnitude more effort. The stakes are low here, but what if I was trying to handle something important like taxes?

Even with the challenges, I’d bet on this pattern becoming common in the next year, not because the AI is magic, but because we’re finally learning how to architect systems that use AI where it excels and traditional software where determinism matters.

×