Research / ASOC
ASOC: Outbound Sales as a Substrate for Agent Research
A research note on what outbound sales teaches the field about production agent systems.
DataFrontier Team — Ongoing research — ~16 minute read
ASOC is operated by BeyondCodes. The architectural research described here is conducted by DataFrontier.
Agent benchmarks remain stuck measuring task completion in toy environments because the field has no substrate that produces graded outcomes at scale. Outbound sales is one — possibly the only deployed one with the combination of properties research needs. ASOC is the apparatus built on that substrate. This note states what outbound makes possible, classifies the eight production modules by where the research lives and where the engineering lives, and lays out the open problems the substrate has surfaced. Two architectural patterns — a meta-agent decision orchestrator and a role-specialized multi-agent team — are described at the level of what they do and why. Implementation specifics are held back.
1. The Research Claim
The product claim of ASOC lives on a different page.1 This is the research claim.
The primary claim is that outbound sales is a uniquely productive substrate for studying multi-agent systems, and ASOC is the apparatus that turns that substrate into a usable evaluation regime for agent research. The argument is not that outbound is interesting. It is that outbound has properties academic agent benchmarks structurally cannot produce.
Agent actions are graded against real economic outcomes — picked-up call, verified identity, accepted transfer, booked meeting, closed deal — with graded signal at multiple latencies from seconds to weeks. The human who takes the warm transfer is competent to label the agent's work and is doing so anyway as part of their job. The volume is real, dense enough to be informative. The regulatory envelope (TCPA, GDPR, SOC 2) forces honest design choices that capability-maximizing benchmarks let researchers elide.
The contribution is the regime, not the platform. ASOC defines what counts as human-connected (strictly the AMD-confirmed live human, not the telephony provider's coarser answered superset that silently includes voicemail), what counts as a verified transfer, what counts as a human-labeled outcome, and persists each of those events at a granularity that supports per-turn agent evaluation rather than per-task pass/fail.
The claim a program chair should hear: agent benchmarks remain stuck measuring task completion in toy environments because the field has no substrate that produces graded outcomes at scale; outbound sales is one, possibly the only deployed one with this combination of properties, and we can show what an evaluation regime built on that substrate looks like in practice.
Two secondary claims fall out of the primary.
The first is the warm-transfer protocol. Live AI-to-human handoff with full lead context, completed in seconds, under regulatory constraint, with the receiving human having to trust what they are handed, is a structurally novel problem the field has not catalogued. The agent must compress an open-ended conversation into a context object the human can act on in the few seconds before they speak; the human's first move after the handoff is itself a label on the quality of that compression. Warm transfer is both an architectural artifact worth describing in its own right and the cleanest single point in the system where the substrate-and-eval claim is operationalized. Every transfer produces a labeled trace of agent reasoning, agent action, and immediate human consequence. § 4 unpacks it.
The second is that the orchestration architecture ASOC has converged on under production load is non-obvious from the agent-orchestration literature, and is what the substrate requires to produce signal at the rate research needs. The patterns are emergent from real failure modes. The architectural argument is that this is what production- grade multi-agent systems actually look like once outcomes are taken seriously, and that the deltas from academic agent stacks are themselves findings.
A compliance-shapes-architecture point is true and worth a sentence here — the system cannot quietly hoard data, cannot quietly retry, cannot quietly bypass consent, and the architecture reflects all three constraints — but is treated as a constraint the primary claim inherits rather than as a separate contribution.
2. Outbound Sales as a Research Substrate
Outbound sales creates a rare combination of dense outcome signals, embedded human evaluation, operational scale, and regulatory realism that makes it unusually suitable for studying agent reliability and adaptation in production environments.
The argument has five load-bearing properties, a negative case, and a transferability claim. A reviewer should be able to attack each one.
Outcome Density
Most agent environments have sparse supervision — a final success/failure, a benchmark score, a delayed human rating. Outbound is different. A single outbound interaction emits multiple outcome signals at multiple temporal resolutions: pickup or no pickup, voicemail detection, human engagement duration, objection category, interruption patterns, transfer acceptance, meeting scheduling, follow-up compliance, eventual pipeline conversion, downstream close or loss. Seconds, minutes, hours, weeks. A layered reward surface, naturally occurring.
The important insight is not merely that there are many labels. It is that the labels occur at multiple timescales, which enables short-horizon optimization, long-horizon attribution, and intermediate supervision simultaneously. Standard agent benchmarks supply one of these, rarely two, almost never all three. Code agents often receive only terminal correctness. Chat agents often receive subjective post-hoc ratings. Autonomous web agents frequently lack economically grounded outcomes entirely. Outbound gives behavioral telemetry, human judgment, and economic consequence in the same trajectory. That combination is rare.
Even short outbound calls routinely produce on the order of 10–50 structured events and several economically meaningful terminal outcomes.2 The point is not exact cardinality. The point is that the supervision is dense, naturally occurring, production-generated, and does not require synthetic annotation pipelines.
Embedded Human Evaluation
The human who accepts the transfer is not labeling for research. They are labeling because the business depends on it. That sharply changes incentive alignment.
The implicit evaluations are operationally consequential: accepting a transfer, rejecting a lead, continuing a sequence, escalating, marking disqualification, editing generated messaging. The argument is stronger than "humans score outputs." It is:
Outbound systems generate labels as a byproduct of operational necessity rather than explicit annotation labor.
The labels are noisy, heterogeneous, and strategy-dependent. ASOC does not assume universal consistency. The claim is weaker but more practical: operational decisions made repeatedly under economic incentives provide usable supervisory structure even when local judgments vary. The substrate compensates through scale and recurrence — thousands of calls, repeated objection classes, repeated outcomes — so that noisy local judgments aggregate into stable distributions. The substrate does not need perfect labels. It needs sufficiently correlated operational signals at sufficient volume. That is what outbound supplies.
Volume and Statistical Power
At production volume, the substrate becomes qualitatively different from hand-curated eval suites, benchmark datasets, and small internal pilots. Long-tail failure modes emerge, ablations become measurable, distribution shift appears naturally, and rare conversational events become statistically observable.
Many agent failure modes are low-frequency phenomena that only emerge under sustained deployment pressure: conversational drift, prompt exploitation, escalation loops, retry pathologies, latency degradation, reward hacking, compliance edge cases. Small-scale evaluations systematically miss these. Large-scale outbound exposes them continuously.
Regulatory Constraints as Architectural Pressure
Compliance constraints prevent the system from behaving like an unconstrained benchmark agent. This is not a nuisance. It is what makes the environment realistic.
TCPA, GDPR, and SOC 2 rule out unrestricted memory accumulation, silent recontact loops, unbounded retries, opaque data retention, unconstrained experimentation, identity ambiguity, and hidden escalation behavior. The agent cannot optimize purely for task completion. It must optimize subject to operational legitimacy. Most benchmark agents are evaluated in unconstrained action spaces; deployed agents operate under institutional constraints. Outbound therefore becomes a useful substrate precisely because optimization pressure is bounded by governance.
Mid-Task Human Handoff
Most agent systems evaluate fully autonomous completion or post-hoc human scoring. Outbound introduces synchronized mid-trajectory collaboration. The human enters while conversational state is live, user intent is evolving, and agent reasoning remains operationally relevant. That enables study of corrigibility, adaptive delegation, confidence calibration, interruptibility, and shared control under live economic pressure. The structure is qualitatively different from end-of-task review.
The Negative Case
The strongest objection is that outbound is too domain-specific to generalize. The objection is partly correct. Sales conversations have repetitive structures, measurable incentives, constrained goals, and partially scriptable flows. This makes them easier than many open-ended agent domains.
ASOC does not claim that outbound proves general intelligence. The claim is narrower: outbound provides a production environment for studying reliability, adaptation, supervision, and human-agent coordination under real operational constraints. The substrate generalizes better as an evaluation regime than as a task distribution. That distinction is crucial.
Transferability
Dense operational telemetry, hierarchical outcomes, embedded human feedback, compliance-constrained optimization, interruptible delegation, longitudinal adaptation — these likely generalize to customer support, healthcare coordination, recruiting, collections, operations workflows, and enterprise copilots.
Sales-specific persuasion dynamics, lead qualification structures, objection taxonomies, and SDR workflow assumptions probably do not.
The transferable contribution is therefore not sales agents. It is methods for evaluating and adapting agents in environments with continuous operational feedback and human collaboration. That is the real intellectual payload of this section.
3. System Overview
ASOC's research contribution at the module level is not located in any single module. It is located in two architectural patterns woven through them.
The first pattern is a meta-agent decision orchestrator. A reasoning component ingests heterogeneous real-time signals from a plurality of independent automation modules, maintains a per-prospect orchestration-graph state, and emits structured action directives that drive downstream module execution, with a closed-loop outcome-feedback mechanism that improves reasoning quality over time. Cross-module signal ingestion, per-prospect graph state as conditioning input, language-model reasoning at the decision step, structured action-directive dispatch, outcome feedback. Six of the eight modules participate in this pattern as signal-emitters, action-executors, or both.
The second pattern is a role-specialized multi-agent team. Each prospect is engaged by a coordinated set of role-specialized agents — a research role, a caller role, an objection-handler role, a follow-up coordinator role — operating concurrently under a supervisor with a shared prospect-state object as the coordination substrate. This replaces the conventional single-agent AI SDR with a workforce simulation. Five of the eight modules host roles in this pattern.
The eight modules are the substrate; the two patterns are the research. The Warm Transfer module is where both patterns converge in production today, which is why it gets its own section (§ 4).
Module Classification
| Module | Role in the Orchestrator | Role in the Agent Team | State | Verdict |
|---|---|---|---|---|
| 1. Buyer Intent Radar | Signal-emitter | — | In build | Specification complete |
| 2. Prospect Research Agent | Signal-emitter | Research role | Partial | Enrichment live; autonomous role in build |
| 3. Email Orchestrator | Action-executor + signal-emitter | Follow-up coordinator | Live + partial | Engineering at production grade |
| 4. AI SDR Dialer | Action-executor | Caller role | Live | Production grade, one research sub-problem |
| 5. AI-to-Human Warm Transfers | Action-executor | Escalation convergence | Live | Research — § 4 |
| 6. Sales Agent Desktop | Human-operator surface + label loop | Human-AI hybrid receiving surface | Live | Production grade |
| 7. Conversation Intelligence | Signal-emitter | Per-utterance event source | Partial | Production grade, one research sub-problem |
| 8. Revenue Analytics | Outcome-feedback logger | Inter-agent audit trail | Live | Production grade |
The architecture diagram below shows three overlays on the same eight-module substrate: the orchestrator pattern, the agent-team pattern, and the production telephony spine that both rely on.
graph TB
subgraph Spine ["Production Spine (live)"]
LeadDB[(Lead DB)]
Queue[Power Dial Queue]
Tel[Telephony Providers<br/>load-balanced]
AMD[AMD Pipeline]
Bus[Event Bus + Pusher]
Desk[Sales Agent Desktop]
LeadDB --> Queue --> Tel --> AMD --> Bus --> Desk
end
subgraph Orch ["Orchestrator Pattern"]
Meta[Meta-Agent<br/>Reasoning Component]
State[Per-Prospect<br/>Orchestration Graph]
Feedback[Outcome-Feedback Logger]
Meta <--> State
Feedback --> State
end
subgraph Team ["Agent-Team Pattern"]
Sup[Supervisor]
Res[Research Role]
Call[Caller Role]
Obj[Objection Handler]
Foll[Follow-up Coordinator]
Shared[(Shared Prospect State)]
Sup --- Res & Call & Obj & Foll
Res & Call & Obj & Foll <--> Shared
end
Bus -.signals.-> Meta
Meta -.directives.-> Tel
Meta -.directives.-> Desk
Desk -.labels.-> Feedback
Call -.hosted by.-> Tel
Foll -.hosted by.-> Email[Email Orchestrator]
Sup -.escalates to.-> Desk
IntentRadar[Buyer Intent Radar<br/>in build]:::pending -.signals.-> Bus
ResearchAgent[Prospect Research Agent<br/>partial]:::pending -.signals.-> Bus
ConvIntel[Conversation Intelligence<br/>partial]:::pending -.per-utterance.-> Sup
Email -.signals.-> Bus
classDef pending stroke-dasharray: 5 5Module-by-Module Notes
Buyer Intent Radar. Ingests intent signals from third-party sources — web visits, content downloads, ad engagement, funding rounds, hiring surges, technology-adoption events, executive moves — and emits structured intent events onto the shared bus that the meta-agent consumes. Signal payloads carry topic, score delta, source, and decay window. In active development; the signal contract and the event-payload schema are defined, the user-facing surface is not yet shipped. Once shipped, the module operates as a black-box signal-emitter — the rest of the architecture does not change, because the orchestrator pattern is designed so new signal modalities slot in without hand-authored workflow rework.
Prospect Research Agent. Builds per-prospect enrichment dossiers — company facts, persona analysis, technographic data, intent interpretation, competitive context, recent news, talking points — ahead of any outbound action, writing the structured intelligence into the shared prospect-state object so downstream roles read it without redundant re-research. Two roles in one module. As an orchestrator surface, it is an enrichment-class signal-emitter. As an agent-team surface, it hosts the research role: instantiated with a research-specific system prompt, operating asynchronously, writing structured output into the shared state via the inter-agent message protocol. Static enrichment via a third-party data API is live today; the autonomous research role that loops over enrichment tasks and reasons about persona is in build.
Email Orchestrator. Multi-step sequence engine chaining email,
phone, SMS, wait, condition, and channel steps per enrollment. A
cron-driven processor personalizes templates from lead fields, selects
collateral by industry/persona/buying-stage metadata, sends through
multiple providers, classifies replies, and branches on opens and
replies. A natural-language campaign runner accepts a campaign
description and generates the corresponding sequence graph. Two roles
in one module. As an action-executor, it receives SEND_EMAIL
directives from the meta-agent and dispatches them, returning open,
click, reply, and bounce events. As a follow-up coordinator in the
agent-team pattern, it receives the call-complete message and the
conversation history from the supervisor and dispatches the
appropriate post-call follow-up. Classification: engineering at
production grade. The orchestration pattern itself (typed step graph,
conditional branching, deferred wait nodes, deliverability-aware
account rotation) is well-trodden in marketing automation. What makes
the module load-bearing for the research claim is its role in both
patterns.
AI SDR Dialer. Browser-anchored predictive dialer that drives a softphone, runs parallel outbound dials at a configurable ratio through a dual-provider load balancer, performs Answering Machine Detection on each prospect leg, and bridges only AMD-confirmed humans into a per-lead conference with the receiving human — under TCPA timezone and DNC compliance gates. Two roles in one module. As an action-executor, it receives voice-call directives under compliance validation. As a caller role in the agent-team pattern, it conducts the real-time voice conversation with talking points loaded from the shared state and emits per-utterance state-update messages to the supervisor. Classification: engineering at production grade, with one research-grade sub-problem. The dialer as a system is a known pattern; the production-grade artifacts are the load-balancer's target-split parser, carrier stickiness, and per-receiver permanent DID with inbound routing. The research sub-problem inside the module is the AI-to-human latency-stitching pattern that pre-establishes the human-side bridge during the agent-side decision window so that the handoff completes inside the AMD verdict window rather than after it. The pattern is not in the human-in-the-loop literature; the telemetry that proves it works is the kind of artifact a research note exists to surface. § 4 covers it.
AI-to-Human Warm Transfers. § 4 is the deep-dive. One paragraph teaser: live AI-to-human handoff under regulatory constraint, with the receiving human having to trust what they are handed in the few seconds before they speak, is a structurally novel problem the literature has not catalogued; ASOC operates two distinct topologies for it; the human's first move after the handoff is itself a label on the quality of the upstream agent's compression. The warm-transfer module is where both architectural patterns converge in production today.
Sales Agent Desktop. The browser-anchored operating environment for the human operator. Softphone, dialpad, DID selection, warm- transfer reception with lead and campaign context, in-call controls, lead intelligence panel, lead timeline, real-time SMS inbox, disposition and wrap-up form with meeting scheduler, click-to-redial, audio-status diagnostic, recent-disconnects self-diagnostic, and a power-dial control strip with a data-backed ratio recommendation that gates on sample size. Two roles in one module. Under the orchestrator pattern, the desktop is the human-operator surface that receives escalation dispatches and is also the source of the highest-quality outcome-feedback signals — every disposition, every meeting, every objection note becomes a label on the upstream agent's work, persisted at per-event granularity. Under the agent-team pattern, the desktop is the surface for the human-AI hybrid embodiment in which any role position may be filled by a human operator while the shared state and the inter-agent message protocol remain unchanged. Classification: engineering at production grade. The interesting artifacts inside the module — the audio-diagnostic that treats a live call as ground truth, the wrap-up idle-timeout state machine with activity-token reset, the offline-banner and live-state veto wiring — are silent-failure prevention rather than research, but they are the conditions under which the labeling apparatus can produce trustworthy data.
Conversation Intelligence. Today, post-call analysis over recorded transcripts, producing a quality score, sentiment, objection-handling score with handled and missed lists, agent-vs-prospect talk-ratio, key-moments extraction, and coaching suggestions. In development: per-utterance event emission during live calls to support real-time objection handling in the agent-team pattern. Classification: engineering at production grade, with one research-grade sub-problem. The post-call module itself is the standard LLM-over-transcript pattern. The research sub-problem worth surfacing is the strict definition of human-connected as a labeling primitive, distinct from the telephony provider's coarser answered notion. The provider's notion includes voicemail pickups and IVR trees; ASOC defines human-connected strictly as a downstream consequence of an AMD human verdict. That choice is small in code but load-bearing: it sets the unit of agent evaluation as the verified human conversation, not the dial attempt, and it persists enough raw provider payload that the labeling primitive remains auditable downstream — which is exactly the kind of audit trail both architectural patterns depend on.
Revenue Analytics. Supervisor-facing dashboards over the event store: a live-ops view, per-ratio dial analytics with a recommendation engine, a speed-to-hello widget that A/B tests bridge paths in production, a per-campaign call-history filter and export, and a human connect rate leaderboard tied to the strict human-connected semantics. Classification: engineering at production grade. The dashboards themselves are standard supervisor-tool engineering. What makes the module load-bearing for the research claim is the per-event telemetry it consumes — persisted at a granularity (and, where appropriate, without foreign-key constraints) so that the substrate cannot lose labels to schema-evolution races. That stream is the apparatus both patterns' outcome-feedback and audit-trail requirements actually run on. Read-only by design, so the labeling apparatus is not contaminated by the observability tooling that consumes it.
The label loop is the substrate-eval primary claim from § 1 made concrete: human operator → dispositions, meetings, recordings → event bus → outcome-feedback logger and tamper-evident audit trail. Every arrow in the architecture diagram terminates here.
4. Deep Dive: Warm Transfer
This section is pending the Q4 interview. It will unpack the warm-transfer protocol as the convergence point of the two architectural patterns: the latency vs. completeness trade-offs, the context handoff as a compression problem, the receiving human's first move as a label on the compression quality, the two coexisting handoff topologies (AI-fronted conversation with tool-triggered bridge, vs. power-dial AMD-gated bridge), and the session state machine that mediates between them.
Target length: 600–900 words. The most technically dense section of the paper.
5. Orchestration
This section is pending the Q5 interview. It will describe how the eight modules coordinate: the queue and event-bus topology, the per-prospect graph state as the conditioning input to the meta-agent, the inter-agent message protocol as the coordination substrate for the agent-team pattern, fingerprint-gated state transitions for race-prone hand-offs, and the deliberate observability primitive of preserving raw event payloads separately from derived state.
Target length: 400–600 words.
6. Failure Modes and Learnings
This section is pending the Q6 interview. It will describe failure modes observed in production and the architectural learnings extracted from them — the kind of evidence the field rarely sees published. Treatment by failure category (race conditions in handoff paths, parameter divergence between telephony API surfaces, label-loss under schema evolution), not by fix mechanism.
Target length: 500–700 words. Honest about what did not work.
7. Compliance as Architecture
This section is pending the Q7 interview. It will describe how the regulatory envelope (SOC 2 Type II, GDPR, TCPA) shapes what the system can compute, store, and act on — not as a compliance checklist but as constraint pressure that produces architectural patterns a capability-maximizing system would not converge on.
Target length: 300–500 words.
8. Open Research Problems
This section is pending the Q8 interview. It will list 3–5 research problems the substrate has surfaced — the publishable open questions the system has revealed but not resolved.
Target length: 500–800 words. One paragraph per question.
Production Status
ASOC is in production. The orchestrator and agent-team patterns are live in the modules marked Live in the § 3 table; the modules marked Partial or In build are under active development and described above at the level of their architectural role rather than as implemented surfaces. The system is SOC 2 Type II compliant, GDPR compliant, and TCPA compliant. Integrations are live with Salesforce and HubSpot.
ASOC is operated and commercialized by BeyondCodes. The architectural research described in this note is conducted by DataFrontier. The commercial framing — modules, integrations, compliance posture, deployment — lives at /products/asoc.
Cite this work
@misc{datafrontier_asoc_2026,
title = {{ASOC}: Outbound Sales as a Substrate for Agent Research},
author = {{DataFrontier Team}},
year = {2026},
note = {Ongoing research. DataFrontier Innovations.
ASOC is operated by BeyondCodes.},
url = {https://datafrontier.co/research/asoc}
}Comments and collaboration: research@datafrontier.co
Footnotes
-
See /products/asoc for the commercial framing. This page is a research write-up of the architecture and the open problems the substrate has surfaced. ↩
-
Range estimated from production instrumentation; exact distribution depends on call duration, prospect engagement, and the configuration of the instrumentation pipeline at the time of measurement. ↩