AI & Voice

99% Reliability Isn't a Number. It's an Architecture.

Frontier LLMs cap at 60-75% on multi-step tool use. The product question isn't whether models are smart. It's whether the stack around them is reliable.

Listen
99% Reliability Isn't a Number. It's an Architecture.

At 2:37am on a Sunday, a voice AI at a chain fast-food drive-thru, now widely known, added 260 chicken nuggets to a single order. The customer tried to stop it. The AI kept going. The video ran on social media for a week. Three months later, McDonald’s announced it was ending its three-year, 100-location partnership with IBM on drive-thru AI.

Around the same window, a British Columbia civil tribunal held Air Canada legally liable for a bereavement-fare discount its chatbot had promised a grieving customer. The chatbot was wrong. The tribunal ruled, in what is now the landmark precedent for consumer AI liability, that the company owns what its chatbot says. The BBC’s coverage quoted tribunal member Christopher Rivers:

The chatbot is a separate entity that is responsible for its own actions. This is a remarkable submission. While a chatbot has an interactive component, it is still just a part of Air Canada’s website.

BC Civil Resolution Tribunal, Moffatt v. Air Canada, Feb 2024

Twelve months after that, Klarna, which had announced in 2024 that an AI assistant had replaced 700 customer-service agents, reversed course and began rehiring humans. CEO Sebastiaan Siemiatkowski, on the record:

Cost unfortunately became too predominant an evaluation factor, leading to lower quality.

Each of these failures has, in the post-mortem press, been framed as a product-fit problem: McDonald’s should have piloted longer, Air Canada should have used better prompts, Klarna moved too fast. That framing is wrong. The underlying issue, at all three companies, is the same issue every serious LLM product team has been quietly confronting since 2024:

A single LLM call, no matter how capable the model, is not a reliable production substrate. The industry has been in the habit of reporting demo accuracy. The number that matters is production reliability across compound, tool-using, multi-turn workflows. That number is not 99%. It is nowhere close.

The benchmark the industry is not quoting

The benchmark every hospitality-AI vendor should be citing, and almost none are, is τ-bench (Tau-Bench), published by Sierra AI in 2024. It is the cleanest academic measurement of tool-using, multi-step agent reliability in customer-service contexts.

The primary metric τ-bench introduces is pass^k: the fraction of tasks an agent completes successfully across k independent trials of the same task. A pass^1 score of 60% means the agent gets the right answer 60% of the time on a single attempt. A pass^8 score of 25.6% means the agent succeeds across all 8 attempts of the same task, one after another, only 25.6% of the time. The pass^k metric is important because production agents are not graded on whether they can succeed; they are graded on whether they succeed reliably, every time, on every user interaction of the same shape.

Claude 3.5 Sonnet, the strongest single-model baseline we tested, achieved pass^1 = 60% and pass^8 = 25.6% on τ-bench retail tasks.

Bai et al., τ-bench: A Benchmark for Tool-Agent-User Interaction

Pass^8 of 25.6% means: of every four customer-service tasks attempted eight times, three fail at least once. In a production contact center with thousands of interactions per day, that is a compounding failure rate that absolutely will not pass a hospitality operator’s review.

The parallel number from the Berkeley Function-Calling Leaderboard v3:

Frontier models (GPT-4o, Claude 3.5 Sonnet) score ~85–88% on overall function-calling accuracy. On multi-turn, multi-step, and parallel / nested tool calls, accuracy drops to 60–75%.

This is the state of the art, not the state of the shortfall. The best models in the world, on the cleanest academic benchmark, cap out at 60-75% on the exact class of operations a hospitality voice agent has to perform every minute: given this user turn, call the right tool, with the right arguments, and compose the result with the prior turn’s context.

Gartner’s 2024 prediction is, in light of the benchmark data, unsurprising:

30% of generative AI projects will be abandoned after proof-of-concept by the end of 2025, due to poor data quality, inadequate risk controls, and escalating costs or unclear business value.

The McDonald’s, Air Canada, and Klarna failures are not outliers. They are, empirically, the expected distribution when a single LLM call is treated as a reliable production component.

What operators should hear when a vendor says “99% accurate”

This is the place to pause on vendor language. A lot of hospitality AI decks, including, we are sure, some of our own earlier materials, quote accuracy numbers in the high 90s. The honest accounting:

When a vendor quotes 99% without specifying the scope, they are usually quoting the first or second number. The third number, the one the operator is actually buying, is not 99%. Reframing this as an engineering honesty problem, not a product marketing problem, is the first step any serious hospitality AI team has to take.

At FlowStay, we think the correct framing is reliability as an architectural target, not as a measured SLO from day one. Our internal target is engineered toward 99%+ on end-to-end task completion, and we have been explicit with operators about what has to be true of the system for that target to be credible.

Why the architecture matters more than the model

If a single LLM call caps out at 25-75% on the production metric, then the only path to a 99% production outcome is an architecture that compounds correctness across the weakness of any single model call.

The academic literature on how to compound correctness is large, and well-enough developed now that FlowStay’s architecture is, frankly, an application of it rather than an invention.

Self-consistency, from Wang et al., Self-Consistency Improves Chain of Thought Reasoning (ICLR 2023):

Sampling multiple reasoning paths and selecting the answer that receives the most votes improves GSM8K accuracy from 56% (greedy decoding) to 74% with PaLM-540B.

Ensemble and mixture-of-experts fusion, from Jiang et al., LLM-Blender (ACL 2023) and the follow-on literature:

A pairwise ranker plus generative fuser across an ensemble of LLMs consistently beats any single model in the ensemble on MixInstruct.

Retrieval-augmented grounding, canonically from Shuster et al., Retrieval Augmentation Reduces Hallucination in Conversation (EMNLP 2021) and extended in Anthropic’s 2024 Contextual Retrieval post:

Contextual embeddings plus BM25 plus reranking reduce retrieval failures by up to 67% over naïve RAG.

Hallucination measurement, from the Vectara Hallucination Leaderboard:

Frontier models hallucinate 1.5–3% even with retrieved context. Mid-tier models 5–10%. Zero is not on the chart.

Compose these. A production architecture that:

  1. Retrieves against a contextually indexed, reranked knowledge base before reasoning,
  2. Samples multiple reasoning paths per user turn,
  3. Votes / fuses across the samples with a dedicated judge model,
  4. Validates each tool call against a pre-declared schema and a property-specific guardrail policy,
  5. Falls back to a specialized alternate model or to human handoff when any of the above fails a confidence check,

can move the production end-to-end reliability number from the raw model’s 60% toward the 99%+ target. Not because any single component is perfect. Because the architecture is designed to absorb the failures of any single component.

This is what we mean when we say 99% is not a number, it is an architecture. The number is an output of the design. A product that claims 99% without this kind of compounding substrate is either quoting the wrong metric, or is going to fail in production in roughly the shape the McDonald’s / Air Canada / Klarna failures did.

What we built at FlowStay

Here is how FlowStay’s voice and orchestration layer is architected, at the level an operator or a serious technical evaluator should understand. None of this is a trade secret; every component is drawn from the public literature above.

1. Grounded retrieval before generation. Every voice-agent turn begins with a retrieval step against the property’s live knowledge base (rates, availability, policies, prior-stay records, open tickets) using contextual embeddings plus BM25 plus a reranker, per the Anthropic Contextual Retrieval pattern. The model does not generate an answer from pretraining. It generates against the property’s current state.

2. Multi-sample reasoning with consistency voting. For any turn that involves a tool call or a policy-sensitive response (pricing, cancellation, overrides), the system samples multiple reasoning paths and selects the answer with the strongest internal consistency, per the Wang et al. self-consistency paper. A single hallucinated reasoning path is outvoted by the reliable majority.

3. Mixture-of-experts fallback. When the primary model’s confidence is below a calibrated threshold for a given task class, the system routes the turn to a specialized alternate model, a reservation-specialist model, a policy-lookup model, a billing-specific model, trained or prompted specifically on that domain. Poor single-model performance at a narrow intent is recoverable through specialization.

4. Schema-validated tool calls. Every tool call (create reservation, check availability, open maintenance ticket, quote rate) is validated against a pre-declared schema before it is executed. A malformed or off-policy tool call is caught before it reaches the PMS, the channel manager, or the guest’s folio. The Air Canada precedent is instructive: the chatbot said something the company had to honor. A schema-validated tool layer means the agent cannot make promises the system cannot keep.

5. Human-in-the-loop handoff. Any turn that fails the confidence check across steps 1-4 is routed, with full context, to a human on the property’s team. Not a cold transfer. A warm handoff with the full conversation state, the retrieved context, and the specific failure signal. The receptionist picks up at the exact point the system lost certainty.

6. Continuous evaluation. Every production conversation is logged, sampled, and graded against a task-completion rubric, with failure cases fed back into the retrieval index, the guardrail policy, and the prompt scaffolding. The system improves over time, per-property, because the evaluation pipeline runs continuously, not as a quarterly audit.

The result is not a claim that any single LLM call is 99% reliable. Neither OpenAI nor Anthropic nor Google makes that claim, and the public benchmarks say they would be wrong to. The result is that the architecture, composed from the models and the scaffolding around them, is engineered to hit a 99%+ task-completion target in production, and to degrade gracefully, into a human handoff, when it cannot.

The honest SLO framing

One more piece of honesty. We are calling 99% a reliability target, not a measured guarantee. In our current pilot deployments, our internal task-completion metric is in the high 90s on end-to-end completion, with the low single-digits of task failures resolving cleanly to human handoff. That number moves week-to-week as the evaluation corpus grows and new intents appear. We will publish a production SLO, with sample size and methodology, when the cohort is large enough to support one.

The difference between what we are building and what the McDonald’s / Air Canada / Klarna incidents had in common is not a capability gap. It is an architecture gap. All three deployed single-model or lightly-scaffolded systems into high-stakes production surfaces and paid, respectively, with a viral nugget incident, a legal precedent, and a 700-agent rehiring.

A hospitality operator reading this should ask the same question of any AI vendor they are evaluating: not what does your model score, but what does your architecture do when the model fails. If the answer is “it doesn’t,” or “we hand the guest a confused response,” the deployment is one viral moment away from the McDonald’s scenario. If the answer is a specific compounding architecture (retrieval, voting, ensemble fallback, schema validation, human-in-the-loop), the deployment has a chance.

Reliability is a design choice, not an aspiration

There is a tendency, in AI product marketing, to treat reliability as a quality that emerges with model improvement. As in: the next model will be better, so the production system will get better. This is true, in a narrow sense. Frontier models have improved. The Berkeley Function-Calling Leaderboard does move up.

It is also insufficient. The operational reliability of a production hospitality voice system does not come from the model. It comes from the system around the model: the retrieval, the voting, the ensembles, the guardrails, the schemas, the human handoff. Every one of those components has to be designed, built, measured, and maintained.

This is why we believe the most important thing FlowStay builds is not the voice agent, or the concierge, or the callback system. It is the reliability architecture those products sit inside. We spend most of our engineering calories there.

The guest who calls at 8:47pm on a Tuesday does not know the difference between a single-model deployment and a compounded architecture. She will know, within one turn, whether the system understood her, whether it told her the truth, and whether it did what it said it would do. Every one of those three judgments is downstream of the architecture, not the model.

The industry will spend 2026 sorting which vendors built for the benchmark and which built for the architecture. The failures will be loud; the successes, quiet. Hospitality operators evaluating AI this year should ask the architecture question on the first call. If the vendor does not have a clean answer about what happens when the model fails, the product is not ready for 2026 hospitality. It is still in 2023.

Ninety-nine percent is not a number. It is a design discipline. We are building it, in public, one property at a time.

Sources

  1. τ-bench: A Benchmark for Tool-Agent-User Interaction Sierra AI (Bai et al., 2024)
  2. Berkeley Function-Calling Leaderboard v3 UC Berkeley
  3. Self-Consistency Improves Chain of Thought Reasoning Wang et al., ICLR 2023
  4. Contextual Retrieval Anthropic
  5. Vectara Hallucination Leaderboard Vectara
  6. Air Canada chatbot ruling BBC
  7. Klarna pivots back to human customer service Fortune
  8. McDonald's ends IBM drive-thru AI partnership CNN Business
  9. Gartner predicts 30% of generative AI projects will be abandoned after PoC Gartner
← Back to all posts Book a demo →