Key Takeaways:
- Many production AI workflows run on small, specialized models that are faster and more economical than frontier models
- Companies increasingly stitch together pipelines where a large model plans and smaller models execute repetitive steps
- Cost and latency are primary drivers: lightweight models lower unit economics and allow higher-volume automation
- Fine-tuned small models can outperform general-purpose systems on narrow tasks, especially when paired with domain data
- The mix-and-match approach might reshape AI budgets toward orchestration, data quality, and evaluation rather than single-model bets
There’s a growing disconnect between the attention frontier reasoning systems receive and the models that actually push work through corporate pipelines. In a recent column in the Wall Street Journal, the author highlights a pattern executives and engineers are now reporting: the models powering day-to-day outcomes in contact centers, sales operations, ad delivery, and content processing are frequently smaller, cheaper, and more specialized than the marquee systems garnering headlines. As the piece explains, the “AI factory” being assembled inside enterprises looks less like one giant brain and more like a conveyor belt where conventional software moves data from station to station and compact models do focused, repetitive jobs.
This factory pattern has a few consistent traits. First, teams separate “thinking” from “doing.” A larger, more capable model may draft an overall plan or outline an approach. From there, smaller models—often fine-tuned on proprietary signals—handle the bulk of extraction, classification, summarization, retrieval, or validation. Second, engineers route tasks based on difficulty and cost. If a question or document looks simple, a lightweight model handles it. If the system detects ambiguity or higher stakes, it escalates to a more capable model. Third, the pipeline is built on top of existing application logic and data stores, which makes it easier to monitor, cache, and audit.

Multiple examples in the Journal’s reporting underscore how this architecture is taking shape. Startups like Aurelian apply generative tools to automate responses to nonemergency 911 calls, while Hark Audio uses AI to identify, clip, and collect moments from podcasts at significant scale. Sales-enablement provider Gong processes large libraries of recorded calls to answer questions like “Why am I losing deals?” and produces structured reports that would otherwise take human analysts many hours. And Airbnb reportedly uses AI, including models from Alibaba, to resolve customer-service tickets faster than human agents. Even Meta’s advertising systems lean on smaller models in production, with larger systems used to transfer knowledge and guide targeting methods.
Of course, readers of Tehrani.com shouldnt be surprised as we wrote about this almost shift two years ago in a post titled Unified Office Sees the Future of AI in SLMs, not LLMs:
While there is tremendous interest in large language models or LLMs which are fed huge amounts of information, to then provide generalist AI – Unified Office has tapped into smaller language models. These are vertical-specific and solve a problem most companies have. Specifically, whisper coaching – allowing an agent to hear ways in which they can improve a call – while still speaking to the customer.
This can be useful in cases where call center agents turnover often or are needed to ramp up quickly. When asked about a specific situation where this could be useful, Founder and CEO Ray Pasquale responded to us in an in-person interview, “A car dealership salesperson could be told, “Don’t forget to offer the service special.” He continued, “Humans can’t scale… Our technology is very close to real-time. We can do this because we use our own CPUs.”
The economic logic is straightforward. Token-based pricing and latency constraints put real pressure on unit costs in high-volume environments. As the column points out, small models are often orders of magnitude cheaper per million tokens than top-tier reasoning systems. They can also be steered with prompting or fine-tuning to perform a very specific job with high consistency, reducing the need for long chains of expensive reasoning. In production, that combination—lower cost and predictable performance—matters more than theoretical peak scores on a benchmark.
Technical leaders quoted in the Journal piece describe this division of labor in clear terms. “The reality is, for many of the operations that we need computing for today, we don’t need large language models,” says Kyle Lo of the Allen Institute for AI. Gong co-founder Eilon Reshef explains the routing strategy this way: “You might use the cheapest LLM to find out if a conversation is relevant, a reasonably cheap LLM to find the right information inside it, and then maybe a more-expensive frontier model to come up with the action document.” Hark Audio chief executive Don MacKinnon notes that giant general-purpose systems aren’t an efficient way to leverage unique proprietary data or incorporate editorial feedback, which is why his team fine-tuned a compact model on a large library of human-curated podcast clips.
Research trends are heading in the same direction. As summarized in the Journal article, work from Nvidia and Georgia Tech argues that many emerging “agent” applications rely on models performing a small number of specialized tasks repetitively and with little variation. From that vantage point, small models can be “sufficiently powerful, inherently more suitable, and necessarily more economical” for many agentic roles. In practice, this matches what enterprise teams report: the real lift comes from data engineering, retrieval, evaluation, and routing—while the model itself becomes one interchangeable component of a larger system.
For operators, a few implications stand out. First, orchestration is now a core competency. Teams need clear policies for when to use which model, how to escalate, and how to cache partial results. Second, data quality is strategic. Small models do their best work when grounded in well-structured context, whether through retrieval systems or fine-tuning. Third, evaluation moves from one-time benchmarking to continuous monitoring with task-level metrics: accuracy, latency, cost per task, and user-level outcomes. Finally, governance and risk management should be built into the pipeline—human-in-the-loop review for sensitive actions, red-teaming for prompt injection and jailbreaks, and logging for audits.
This isn’t to say frontier systems don’t matter. They remain valuable for complex reasoning, planning, and synthesis—particularly when a task spans heterogeneous data or calls for creative problem solving. Many teams use a two-tier structure: a large model drafts a plan or structures a final narrative, while small models grind through entity extraction, retrieval, summarization, and consistency checks. Over time, companies might shift more steps to compact models as they collect feedback and fine-tune, lowering cost while improving reliability.
Budget allocation might also shift. Instead of concentrating spend on a single frontier provider, leaders may distribute investments across three buckets: model diversity and routing, data pipelines and tooling, and evaluation plus safety. That mix tends to reduce vendor lock-in and creates a path to capture savings as small models improve. It also opens the door to open-source options where legal and compliance frameworks permit, particularly for on-prem or data-sensitive workflows.
The broader takeaway from the Journal’s reporting is pragmatic. The conversation about “AGI” and ever-larger parameter counts will continue, but the practical gains many companies are seeing come from disciplined engineering: clear task decomposition, high-quality retrieval, and small, specialized models tuned for throughput and consistency. That approach might not dominate the headlines, but it is shaping how work gets done.
If that trajectory holds, organizations could see AI programs mature the way earlier software automation did: from proofs of concept centered on a single tool to robust production systems built from modular parts. In that world, competitive advantage might come less from betting on the “smartest” model and more from building the most reliable factory.
If you liked this post, you’ll love one of the the leading global business communications and technology events since 1999, the ITEXPO #TECHSUPERSHOW, Feb 10-12, 2026 Fort Lauderdale, Florida.
Don’t forget the collocated MSP Expo – just for managed service providers!
Aside from his role as CEO of TMC and chairman of ITEXPO #TECHSUPERSHOW Feb 10-12, 2026, Rich Tehrani is CEO of RT Advisors and a Registered Representative (investment banker) with and offering securities through Four Points Capital Partners LLC (Four Points) (Member FINRA/SIPC). He handles capital/debt raises as well as M&A. RT Advisors is not owned by Four Points.
The above is not an endorsement or recommendation to buy/sell any security or sector mentioned. No companies mentioned above are current or past clients of RT Advisors.
The views and opinions expressed above are those of the participants. While believed to be reliable, the information has not been independently verified for accuracy. Any broad, general statements made herein are provided for context only and should not be construed as exhaustive or universally applicable.
Portions of this article may have been developed with the assistance of artificial intelligence, which may have contributed to ideation, content generation, factual review, or editing






