The Prototype Gap: How to build Endgame yourself

November 7, 2025

Alex Bilmes

Most Endgame demos end with the same question:

“This is cool. But couldn’t we just build this ourselves in GPT or Claude?”

You could. In some cases, you should. Start with a prototype. Upload a few Gong transcripts into ChatGPT, add custom instructions for deal reviews, and get helpful insights. Go a step further with n8n or Zapier to build automated workflows. It’s fast, inexpensive, and works well for a few deals or opportunities.

Plus, it’s easier than ever to build prototypes. Vibe-coding products like Replit and Lovable have exploded over the past two years, fueling our confidence in what we think we can build. But the more time we spend building, the more we uncover challenges that one-shot prototyping can’t solve.

This paradox has plagued humanity for centuries: the Dunning–Kruger effect. With limited experience, we tend to overestimate our ability until deeper complexities emerge. The same pattern plays out today with AI GTM prototyping.

BetterUp ran into this scenario with the false hope of AI prototypes. Their GTM teams began experimenting with ChatGPT and Claude. Custom projects and GPTs proliferated across the organization. After the excitement of tinkering wore off, they ran into challenges. As Austin Johnsey, Director of GTM Systems shared:

"I got on Claude, started connecting a few things, querying about one person or one account. It worked decently well. Then I tried to look at someone's entire book of business and Claude just crashed.”

The pattern repeated across the organization. Teams would get something working for a single use case, then hit architectural limits when trying to scale. Leadership wanted book-level risk analysis. Account executives managing 25–50 accounts and 20+ renewals each needed full visibility into blind spots. Single-use workflows worked, but scaling them didn’t.

"People were just learning what an MCP server was, and most couldn't even talk about it. Meanwhile, everybody was fully allocated to other projects," notes Austin. Without Salesforce data, they couldn't generate account briefs. Adding Slack and call transcripts was the next logical step, but the architecture wasn’t there.

Prototyping is easy. Production-grade infrastructure is hard. You need emails, Slack threads, Salesforce records, company news, 10-Ks. You want consistent deal insights, governance, and the ability to ask questions across your entire book of business.

And that’s the prototyping gap.

The prototype gap.png


This is the harsh reality between tinkering with GPT project files and building production-grade data infrastructure. It’s also why 95% of enterprise AI projects fail. Not because the AI isn’t capable, but because teams underestimate what it takes to move from prototype to production.

But what shark-infested waters lurk in the canyon? And what would it actually take to build Endgame yourself? A few weeks ago, the CTO of a fast-growing 800-person software company asked the big question to his Head of RevOps:

“Can’t we just build this ourselves?”

So we partnered with their RevOps lead to clearly lay out the complexity ahead and what they’d need to build to cross the canyon.

  1. Data infrastructure. You need to connect every data source—Salesforce, Slack, Google Drive, Gong, email—and manage authentication, rate limits, and permissions. Without proper data pipelines, the system collapses under its own API calls.

  2. Retrieval. You must make that data searchable, context-aware, and properly governed. That means setting up a vector database to store the appropriate embeddings (assuming you are already experts in vectors and embeddings) to make the content useful for LLMs, then layering on the metadata you’ll need for access control & data governance.

  3. Reasoning. The AI needs orchestration logic: model routing, fine-tuning, guardrails, and multi-agent coordination that mimic how humans research, synthesize, and respond.

  4. Interface. Reps won’t use something that breaks their workflow. You need UX layers that live inside Slack, maintain context, and respond instantly.

  5. Evaluation. Production systems require observability—tracking accuracy, latency, costs, and user feedback in real time to keep improving.

The CTO’s answer after reviewing requirements: “No f***** way I’m building all that. Fine, go ahead."

That moment is the catalyst nearly every time. It’s why teams like Monte Carlo, Hex, BetterUp, Scale, Accuris, and Benchling chose to bet on Endgame. They recognized the deep complexity of building production-grade infrastructure for building revenue superintelligence:

That said, if you’re still excited about building Endgame on your own, here’s a transparent, in-depth look at how we built it from scratch—shared directly by our CTO Kyle Wild.

How to rebuild revenue superintelligence on your own

Before anything else, you need the data foundation. This is the first layer of the canyon, where most teams realize the real work isn’t in prompting models but in moving, securing, and structuring information from dozens of disconnected systems.

Layer 1: Data infrastructure

You need to connect every data source, including Salesforce, Slack, Google Drive, Gong, and email. But you can’t query these directly: agents hitting Salesforce’s API constantly will burn through rate limits and collapse the system.

What this means:

  • Building data connectors for each source with authentication

  • Handling API retry logic and quota limits (every API works differently)

  • Moving data into your own storage layer

  • Making it work reliably at scale

  • Implementing access controls and permissions at the data layer

You also need entity resolution: creating clean representations of accounts, deals, and people from messy source data. One person might have 7 Salesforce contacts: which is the real one? You need to verify identities across different systems (LinkedIn, company databases, internal records) and resolve duplicates.

55% of companies attempting GenAI run into data-related issues, making data quality the biggest inhibitor. While LLMs handle messy inputs better than traditional systems, poor quality still poses serious risks.

Why this is hard: This isn’t “AI work”, it’s data engineering, which is the part most companies underestimate. Most of Endgame’s team are senior data engineers with deep analytical backgrounds and experience.

Layer 2: Retrieval and search

This makes your data searchable and relevant: finding the right information from millions of documents in milliseconds while simultaneously respecting permissions.

What this means:

  • Breaking large files into manageable pieces that AI can process

  • Provisioning & populating vector databases for semantic search

  • Query decomposition, breaking complex questions into searchable parts

  • Reranking to prioritize the most relevant information

Think of retrieval as a smart research assistant. They don’t just search: they understand what’s being asked, where to look, and how to assemble the right materials.

When a question comes in, they break it down into parts, scan thousands of documents, and pull out only what’s relevant. For simple queries, you get direct answers. For more complex ones, they gather supporting evidence from multiple sources, compare perspectives, and build a well-structured response, citing where every fact came from.

Behind the scenes, every document is embedded and given a semantic fingerprint. This means the system can instantly retrieve contextually similar information, without reading everything line by line. That’s what modern retrieval systems do: understand the question, surface the right context, and make sure it’s both fast and grounded in truth.

Why this is hard: Basic retrieval is easy, just search and return results. Retrieval at scale requires continuous evaluation to ensure accuracy and that AI responses stay faithful to source material. You need frameworks measuring whether retrieved information actually answers the question and whether AI’s response matches what the sources say.

Layer 3: Reasoning and orchestration

This is where the AI lives: the brain that generates responses grounded in your data, with guardrails to keep it on track.

What this means:

  • Model selection, planning, and routing (which models for retrieval vs. composition)

  • Fine-tuning on sales playbooks, pitch decks, objection-handling scripts

  • Custom tools for sales tasks: account research, deal analysis, competitive intelligence

  • Multi-agent coordination, with different agents for different jobs

  • Memory systems managing context across conversations

  • Guardrails ensuring factuality, compliance, sales-appropriate tone

  • Hallucination detection and citation tracking

  • Optimizing for roughly 1,000 different types of sales questions

The hard part isn’t the AI models, those are commoditized and getting better every week. The difficulty is in building tools that work together effectively and coordinate specialized agents.

Think of it like a newsroom: some reporters gather facts from the field, editors check accuracy, and producers turn it all into a coherent story. You want some agents gathering information: pulling data from Salesforce, reading emails, checking news sources, finding relevant documents. Other agents compose that information: taking all those facts and creating well-structured, digestible summaries.

Each type of question needs its own optimization. Some queries are good and fast. Some are good but slow (requiring optimization of how data is structured). Some produce poor results (requiring fundamental rework). A grid of performance versus speed determines where engineering effort goes.

Why this is hard: This requires deep sales domain expertise in combination with technical complexity. You need to understand sales workflows, methodologies, and what good answers look like.

Layer 4: Interface and user experience

How sales teams actually interact with the system in their daily workflows.

What this means:

  • Making interactions fast (reps won’t use slow tools)

  • Providing the right abstraction level: simple for users, complex underneath

  • Implementing session management and context persistence

  • Slack integrations and email reminders, so reps can access where they work

  • MCP servers so other teams can also interface with the intelligence

  • Natural-language chat and Q&A interfaces

  • Exporting into CSVs, Google Drive, etc

  • Search-plus-synthesis blending citations with narrative answers

Why this is hard: Poor UX kills adoption regardless of how good the AI is. The interface needs to fit naturally into existing workflows rather than forcing reps to change how they work.

Layer 5: Evaluation and observability

Production systems need continuous monitoring to ensure they’re working correctly.

What this means:

  • Building provenance systems that verify claims are attributed to sources

  • Deploying observability tools for tracing, monitoring, cost tracking

  • Running continuous evaluations across accuracy, safety, performance, user experience

  • Collecting user feedback (upvotes/downvotes, corrections)

  • Model retraining and improvement based on real-world signals

  • A/B testing for retrieval strategies and reasoning approaches

Think of this like a quality-control inspector on an assembly line: constantly checking if the output is good, catching problems before they ship, and tracking what’s getting better or worse over time.

Production metrics that matter: request latency, error rates, token costs per query, hallucination frequency, user satisfaction scores. Alert systems for quality degradation.

Why this is hard: Without proper evaluation infrastructure, systems degrade over time without visibility into why. You’re flying blind. Building closed-loop systems that learn from outcomes requires sophisticated data pipelines connecting AI usage back to business results.

Context engineering: The bridge to revenue superintelligence

The prototype gap BRIDGE.png

If the five layers form the canyon, context engineering is the bridge that connects them. Andrej Karpathy describes it as “the delicate art and science of filling the context window with just the right information for the next step.”

Context engineering isn’t about clever, repeatable prompts. It’s about designing systems that automatically know what information to bring to the model, when to bring it, and how much of it is relevant. Production systems need to manage thousands of interactions while respecting permissions, handling versioning, and staying in sync with real-time data.

Models have become astonishingly good: longer context windows, sharper reasoning, fewer hallucinations. But the infrastructure that feeds them context hasn’t kept up. Every company that tries “building with AI” eventually runs into this wall. The problem isn’t whether the model can think—it’s whether your systems can keep it informed.

This is why context engineering differentiates between a clever prototype and a production-grade platform. It’s not a few prompt tweaks. It’s an architectural discipline that unites data pipelines, retrieval frameworks, and observability loops into a single system.

So should you still build prototypes? Yes, you absolutely should. But keep in mind that you’ll eventually find yourself standing at the edge of the canyon—acutely aware of the deep, messy infrastructure, riddled with context engineering challenges, that you would slog through to build it yourself.

We’ve already built the bridge. All you have to do is walk across it.

Edited by Scott Tousley at Margin

Share this article

Our newsletter

Our newsletter

Subscribe to our blog to learn how the best sales teams are using AI to improve their performance

Subscribe to our blog to learn how the best sales teams are using AI to improve their performance

© 2025 Endgame. Automate deep research and prep with AI—100x faster

© 2025 Endgame. Automate deep research and prep with AI—100x faster