Get in touch
5 min read

AI in production: what separates a demo from a reliable system

Any dev can get GPT to extract data from an invoice in 5 minutes. The hard part is making it work every time, with any input, with visibility when it breaks.

aiproductionllmobservabilityautomation

Any developer can get a language model to extract the value from a bank slip in five minutes. Paste the PDF into the prompt, see the output, show the client. Demo done.

The problem is that a demo isn't a system. A system is what works when the document is crooked, photographed with a phone in bad lighting, with a stamp covering the barcode. A system is what you trust at three in the morning with nobody monitoring.

What I'm describing here comes from financial document automation in production, extracting data from bank slips and invoices at volume. This isn't theory.

The specific problem with financial documents

Bank slips and invoices have a characteristic that makes the problem more demanding than it looks: errors have direct financial consequences. A wrong extracted value in an automated payment isn't a UX bug someone will report the next day. It's a compliance problem that can cost more than the entire project.

The flow involved classification and extraction. First the model needed to decide whether the submitted document was actually a valid bank slip or invoice. Then, if it was, extract the relevant fields. Two different problems, handled separately.

The test that most skip

The difference between a system that works in demo and one that works in production starts with testing. Not happy path tests, which anyone does. Adversarial tests.

The test suite we used included:

  • Real documents β€” bank slips and invoices of the types the system would process, with all the natural variation in layout, font, and quality
  • Fictitious documents β€” created to cover edge cases without exposing real data in the development environment
  • Hand-drawn sketches β€” to test the classifier's limits: does the model know how to say "this isn't a bank slip" when the input is clearly invalid?
  • Photos completely out of context β€” landscapes, selfies, game screenshots. The classifier needs to reject these with confidence

The out-of-context image test is what separates a toy classifier from a production classifier. In a demo, you only show the cases where it works. In production, users will send exactly what you didn't test.

Beyond classification, each extracted field was validated against business rules: is the numeric value plausible for the context? Does the tax ID have the correct format? Is the date not in the distant past? The model can extract with apparent confidence while the field can still be validated by deterministic logic. They're independent layers.

Observability: knowing the system is healthy without reviewing every response

In production, you won't manually review model output for every execution. That doesn't scale. The question is: how do you know everything is working?

What we had:

  • Sentry for failures β€” 5xx errors arrived as immediate alerts. When the system broke, the team knew before the user reported it
  • Grafana for executions β€” a log of every flow execution: input received, model response, extracted fields, final result. Complete traceability for diagnosis
  • Operations team monitoring quality β€” the part that tools can't replace. Someone needs to periodically look at model responses and identify degradation patterns before they become a problem

Model degradation is real. A provider can update the underlying model without notice. Behavior changes subtly. Without someone watching the outputs, you find out when the client complains.

Prompt management: who can change and with what process

A prompt is code. Changing a prompt changes system behavior. But a prompt is also natural language, and sometimes the person who best understands the domain isn't the engineer.

In the projects we've built, we use two approaches depending on context:

Prompt in code, as a .txt file β€” versioned with git, auditable, every change goes through code review and deployment. Slower to adjust, but traceable. Good for prompts that stay stable.

Prompt in Drive β€” the operations team can adjust without an engineer. Useful when the domain changes frequently and the ops team has the knowledge. The trade-off is traceability: without git, you don't know who changed what and when, and you can't easily revert.

The choice isn't technical. It's about who needs control and how often the prompt will change. What matters is that it's a conscious choice, not whichever path was easier to implement.

Real API cost at scale

At volume, API cost stops being negligible. What matters isn't just the price per token: it's the cost of long prompts with redundant context, the cost of retries when the provider fails, the cost of a more expensive model used where a cheaper one would solve it.

We used LiteLLM as an intermediary between the application and providers. This gives flexibility to switch providers without changing application code. If GPT-4o gets expensive, you point to Claude or Gemini with one line of configuration. If a provider becomes unstable, you have a fallback.

What to watch in production:

  • Average prompt size per execution β€” a prompt that grows over time (accumulated context without cleanup) is invisible growing cost
  • Retry rate β€” high retry rate indicates provider instability or rate limits being hit
  • Usage distribution by model β€” if you're using an expensive model where a cheap one would work, that's direct optimization

Security with sensitive data

Financial data has compliance implications that generic data doesn't. Some practices we apply:

  • Development environment only with fictitious or anonymized data. Real data never leaves production infrastructure for local testing
  • The model processes the document, doesn't store it. What goes to the database is the extracted field, not the original document with all the payer's data
  • Logs have calibrated detail level: enough for diagnosis, without exposing full PII as plain text in the records

At Chiarelli Labs

When we build AI automations, this discipline enters the project from the start, not as a retrofit after the first production bug.

Before any deploy, the test suite includes adversarial cases, not just the happy path. Observability is part of the delivery, not an extra. Prompt management is decided together with the client based on who needs control.

If you're evaluating vendors for a project with AI, the right question isn't "have you done this before?" It's "what happens when the model makes a mistake?" and "how will I know it's still working six months from now?"

If you want that conversation, get in touch.

Have a validated idea, process to automate, or product to build?

Get in touch