The Prototype Gap: How to build Endgame yourself
November 7, 2025
Alex Bilmes
Most Endgame demos end with the same question:
“This is cool. But couldn’t we just build this ourselves in GPT or Claude?”
You could. In some cases, you should. Start with a prototype. Upload a few Gong transcripts into ChatGPT, add custom instructions for deal reviews, and get helpful insights. Go a step further with n8n or Zapier to build automated workflows. It’s fast, inexpensive, and works well for a few deals or opportunities.
Plus, it’s easier than ever to build prototypes. Vibe-coding products like Replit and Lovable have exploded over the past two years, fueling our confidence in what we think we can build. But the more time we spend building, the more we uncover challenges that one-shot prototyping can’t solve.
This paradox has plagued humanity for centuries: the Dunning–Kruger effect. With limited experience, we tend to overestimate our ability until deeper complexities emerge. The same pattern plays out today with AI GTM prototyping.

BetterUp ran into this scenario with the false hope of AI prototypes. Their GTM teams began experimenting with ChatGPT and Claude. Custom projects and GPTs proliferated across the organization. After the excitement of tinkering wore off, they ran into challenges. As Austin Johnsey, Director of GTM Systems shared:
"I got on Claude, started connecting a few things, querying about one person or one account. It worked decently well. Then I tried to look at someone's entire book of business and Claude just crashed.”
The pattern repeated across the organization. Teams would get something working for a single use case, then hit architectural limits when trying to scale. Leadership wanted book-level risk analysis. Account executives managing 25–50 accounts and 20+ renewals each needed full visibility into blind spots. Single-use workflows worked, but scaling them didn’t.
"People were just learning what an MCP server was, and most couldn't even talk about it. Meanwhile, everybody was fully allocated to other projects," notes Austin. Without Salesforce data, they couldn't generate account briefs. Adding Slack and call transcripts was the next logical step, but the architecture wasn’t there.
Prototyping is easy. Production-grade infrastructure is hard. You need emails, Slack threads, Salesforce records, company news, 10-Ks. You want consistent deal insights, governance, and the ability to ask questions across your entire book of business.
And that’s the prototyping gap.

This is the harsh reality between tinkering with GPT project files and building production-grade data infrastructure. It’s also why 95% of enterprise AI projects fail. Not because the AI isn’t capable, but because teams underestimate what it takes to move from prototype to production.
But what shark-infested waters lurk in the canyon? And what would it actually take to build Endgame yourself? A few weeks ago, the CTO of a fast-growing 800-person software company asked the big question to his Head of RevOps:
“Can’t we just build this ourselves?”
So we partnered with their RevOps lead to clearly lay out the complexity ahead and what they’d need to build to cross the canyon.
Data infrastructure. You need to connect every data source—Salesforce, Slack, Google Drive, Gong, email—and manage authentication, rate limits, and permissions. Without proper data pipelines, the system collapses under its own API calls.
Retrieval. You must make that data searchable, context-aware, and properly governed. That means setting up a vector database to store the appropriate embeddings (assuming you are already experts in vectors and embeddings) to make the content useful for LLMs, then layering on the metadata you’ll need for access control & data governance.
Reasoning. The AI needs orchestration logic: model routing, fine-tuning, guardrails, and multi-agent coordination that mimic how humans research, synthesize, and respond.
Interface. Reps won’t use something that breaks their workflow. You need UX layers that live inside Slack, maintain context, and respond instantly.
Evaluation. Production systems require observability—tracking accuracy, latency, costs, and user feedback in real time to keep improving.
The CTO’s answer after reviewing requirements: “No f***** way I’m building all that. Fine, go ahead."
That moment is the catalyst nearly every time. It’s why teams like Monte Carlo, Hex, BetterUp, Scale, Accuris, and Benchling chose to bet on Endgame. They recognized the deep complexity of building production-grade infrastructure for building revenue superintelligence:

That said, if you’re still excited about building Endgame on your own, here’s a transparent, in-depth look at how we built it from scratch—shared directly by our CTO Kyle Wild.
How to rebuild revenue superintelligence on your own
Before anything else, you need the data foundation. This is the first layer of the canyon, where most teams realize the real work isn’t in prompting models but in moving, securing, and structuring information from dozens of disconnected systems.
Layer 1: Data infrastructure
You need to connect every data source, including Salesforce, Slack, Google Drive, Gong, and email. But you can’t query these directly: agents hitting Salesforce’s API constantly will burn through rate limits and collapse the system.
What this means:
Building data connectors for each source with authentication
Handling API retry logic and quota limits (every API works differently)
Moving data into your own storage layer
Making it work reliably at scale
Implementing access controls and permissions at the data layer
You also need entity resolution: creating clean representations of accounts, deals, and people from messy source data. One person might have 7 Salesforce contacts: which is the real one? You need to verify identities across different systems (LinkedIn, company databases, internal records) and resolve duplicates.
55% of companies attempting GenAI run into data-related issues, making data quality the biggest inhibitor. While LLMs handle messy inputs better than traditional systems, poor quality still poses serious risks.
Why this is hard: This isn’t “AI work”, it’s data engineering, which is the part most companies underestimate. Most of Endgame’s team are senior data engineers with deep analytical backgrounds and experience.
Layer 2: Retrieval and search
This makes your data searchable and relevant: finding the right information from millions of documents in milliseconds while simultaneously respecting permissions.
What this means:
Breaking large files into manageable pieces that AI can process
Provisioning & populating vector databases for semantic search
Query decomposition, breaking complex questions into searchable parts
Reranking to prioritize the most relevant information
Think of retrieval as a smart research assistant. They don’t just search: they understand what’s being asked, where to look, and how to assemble the right materials.
When a question comes in, they break it down into parts, scan thousands of documents, and pull out only what’s relevant. For simple queries, you get direct answers. For more complex ones, they gather supporting evidence from multiple sources, compare perspectives, and build a well-structured response, citing where every fact came from.
Behind the scenes, every document is embedded and given a semantic fingerprint. This means the system can instantly retrieve contextually similar information, without reading everything line by line. That’s what modern retrieval systems do: understand the question, surface the right context, and make sure it’s both fast and grounded in truth.
Why this is hard: Basic retrieval is easy, just search and return results. Retrieval at scale requires continuous evaluation to ensure accuracy and that AI responses stay faithful to source material. You need frameworks measuring whether retrieved information actually answers the question and whether AI’s response matches what the sources say.
Layer 3: Reasoning and orchestration
This is where the AI lives: the brain that generates responses grounded in your data, with guardrails to keep it on track.
What this means:
Model selection, planning, and routing (which models for retrieval vs. composition)
Fine-tuning on sales playbooks, pitch decks, objection-handling scripts
Custom tools for sales tasks: account research, deal analysis, competitive intelligence
Multi-agent coordination, with different agents for different jobs
Memory systems managing context across conversations
Guardrails ensuring factuality, compliance, sales-appropriate tone
Hallucination detection and citation tracking
Optimizing for roughly 1,000 different types of sales questions
The hard part isn’t the AI models, those are commoditized and getting better every week. The difficulty is in building tools that work together effectively and coordinate specialized agents.
Think of it like a newsroom: some reporters gather facts from the field, editors check accuracy, and producers turn it all into a coherent story. You want some agents gathering information: pulling data from Salesforce, reading emails, checking news sources, finding relevant documents. Other agents compose that information: taking all those facts and creating well-structured, digestible summaries.
Each type of question needs its own optimization. Some queries are good and fast. Some are good but slow (requiring optimization of how data is structured). Some produce poor results (requiring fundamental rework). A grid of performance versus speed determines where engineering effort goes.
Why this is hard: This requires deep sales domain expertise in combination with technical complexity. You need to understand sales workflows, methodologies, and what good answers look like.
Layer 4: Interface and user experience
How sales teams actually interact with the system in their daily workflows.
What this means:
Making interactions fast (reps won’t use slow tools)
Providing the right abstraction level: simple for users, complex underneath
Implementing session management and context persistence
Slack integrations and email reminders, so reps can access where they work
MCP servers so other teams can also interface with the intelligence
Natural-language chat and Q&A interfaces
Exporting into CSVs, Google Drive, etc
Search-plus-synthesis blending citations with narrative answers
Why this is hard: Poor UX kills adoption regardless of how good the AI is. The interface needs to fit naturally into existing workflows rather than forcing reps to change how they work.
Layer 5: Evaluation and observability
Production systems need continuous monitoring to ensure they’re working correctly.
What this means:
Building provenance systems that verify claims are attributed to sources
Deploying observability tools for tracing, monitoring, cost tracking
Running continuous evaluations across accuracy, safety, performance, user experience
Collecting user feedback (upvotes/downvotes, corrections)
Model retraining and improvement based on real-world signals
A/B testing for retrieval strategies and reasoning approaches
Think of this like a quality-control inspector on an assembly line: constantly checking if the output is good, catching problems before they ship, and tracking what’s getting better or worse over time.
Production metrics that matter: request latency, error rates, token costs per query, hallucination frequency, user satisfaction scores. Alert systems for quality degradation.
Why this is hard: Without proper evaluation infrastructure, systems degrade over time without visibility into why. You’re flying blind. Building closed-loop systems that learn from outcomes requires sophisticated data pipelines connecting AI usage back to business results.
Context engineering: The bridge to revenue superintelligence

If the five layers form the canyon, context engineering is the bridge that connects them. Andrej Karpathy describes it as “the delicate art and science of filling the context window with just the right information for the next step.”
Context engineering isn’t about clever, repeatable prompts. It’s about designing systems that automatically know what information to bring to the model, when to bring it, and how much of it is relevant. Production systems need to manage thousands of interactions while respecting permissions, handling versioning, and staying in sync with real-time data.
Models have become astonishingly good: longer context windows, sharper reasoning, fewer hallucinations. But the infrastructure that feeds them context hasn’t kept up. Every company that tries “building with AI” eventually runs into this wall. The problem isn’t whether the model can think—it’s whether your systems can keep it informed.
This is why context engineering differentiates between a clever prototype and a production-grade platform. It’s not a few prompt tweaks. It’s an architectural discipline that unites data pipelines, retrieval frameworks, and observability loops into a single system.
So should you still build prototypes? Yes, you absolutely should. But keep in mind that you’ll eventually find yourself standing at the edge of the canyon—acutely aware of the deep, messy infrastructure, riddled with context engineering challenges, that you would slog through to build it yourself.
We’ve already built the bridge. All you have to do is walk across it.
—
Edited by Scott Tousley at Margin
Share this article



