White Paper Archives - Unigen

Guide to On-Prem AI Transcription Servers

Brett Patrick — Tue, 05 May 2026 18:05:28 +0000

Executive Summary: On-Premises AI Transcription for Contact Centers

What is the challenge with cloud-based call center transcription?

While enterprise call centers and BPOs rely heavily on speech-to-text AI for quality assurance and compliance, cloud-based services introduce three critical vulnerabilities:

Data Security Risks: Sensitive customer voice files must leave secure corporate boundaries for processing.
Predictable Cost Spikes: Operational pricing scales linearly and unpredictably alongside shifting call volumes.
Strict Regulatory Demands: Complex frameworks like GDPR, HIPAA, and PCI-DSS mandate strict, auditable governance over how audio and biometric customer data is stored.

What is the secure alternative to cloud transcription?

An On-Premises AI Transcription Server moves the entire processing architecture back in-house. Running entirely within your local infrastructure, it achieves localized data sovereignty without sacrificing speed.

How does the Unigen server optimize localized speech-to-text?

Built on the Poundcake-LLM infrastructure, the system utilizes high-efficiency hardware to completely bypass the open internet:

Advanced AI Hardware: Driven by Unigen AI modules and powered by energy-efficient EdgeCortix SAKURA-II accelerators, the server delivers an industry-leading 6 TOPS per watt.
Simultaneous High Volume: Seamlessly runs resource-intensive OpenAI Whisper (medium and large) models across 32 concurrent real-time streams.
Unmatched TCO: Reduces local operational costs to an amortized rate of approximately $0.006 per minute, per channel.
Native Multilingual Support: Out-of-the-box support for English, Spanish, German, Japanese, and Dutch ensures cloud-level accuracy while guaranteeing that every byte of audio data remains safely enclosed inside your physical facility.

Poundcake LLM and Amaretti E1.S GenAI Module

Why Is AI Transcription Essential for Call Centers?

The global speech analytics market was valued at $4.94 billion in 2025 and is projected to grow from $5.70 billion in 2026 to $15.31 billion by 2034, growing at a 13.15% Compound Annual Growth Rate (CAGR) .

Image Source: Fortune Business Insights

The growth of this market should come as no surprise. As many business owners can attest to, voice interactions are where the most complex (and often the most sensitive) customer issues are resolved.

For call centers handling thousands of daily interactions, AI transcription (the automated conversion of speech into text) is the backbone of modern operations because it allows businesses to:

Ensure compliance recording for financial regulators (MiFID II, Dodd-Frank)
Monitor quality across 100% of calls
Provide real-time coaching to call center employees
Analyze customer sentiment
Resolve disputes

Without accurate, timely transcription, these capabilities are impossible to deliver at scale.
Yet despite strong AI adoption in contact centers, a significant portion have not yet deployed speech analytics, primarily citing cost unpredictability, unclear ROI, and concerns about privacy and data security . This gap between adoption intent and actual deployment represents the core opportunity for a more cost-effective, easier-to-deploy solution.

The on-premises deployment model remains dominant in this market, accounting for approximately 70% of speech analytics market revenue (representing a segment value of $3.99 billion in 2026, growing to $10.71 billion by 2034). This trend is primarily driven by strict data privacy requirements in financial services, healthcare, government, and legal sectors .

Challenges with Cloud-Based Transcription

Security and Data Exposure

Voice recordings contain some of the most sensitive data a business handles, including customer financial details, health information, personal identifiers, and proprietary business conversations. Transmitting this data to third-party cloud providers creates exposure at every stage (transmission, processing, and storage).

The risks are not theoretical. In 2023, medical transcription provider Perry Johnson & Associates (PJ&A) suffered a breach that exposed 8.95 million patient records after hackers retained access to its systems for 36 days.

Image Source: Endecom Business IT Solutions

The breach impacted Cook County Health (1.2 million patients) and Northwell Health, New York’s largest healthcare provider. This incident demonstrated the risk of entrusting voice data to third-party transcription vendors.

Regulatory Complexity

Voice data occupies a uniquely sensitive position across multiple regulatory frameworks:

General Data Protection Regulation (GDPR): Under GDPR, voice recordings constitute personal data and can qualify as biometric data (Article 9 special category) when processed for speaker identification[2]. Sending voice data to cloud providers triggers additional compliance obligations including:
- Data Processing Agreements (Article 28 GDPR)
- Cross-border transfer safeguards
- Vendor security assessments
Health Insurance Portability and Accountability Act (HIPAA): Under HIPAA, patient voice recordings are protected health information.
Payment Card Industry Data Security Standard (PCI-DSS): Under PCI-DSS, call recordings containing payment card data must be encrypted and access-controlled, and CVV data must never be stored in any form.

Consequences for Regulatory Non-Compliance

The consequences of failing to comply can be severe. For example, Meta received a €1.2 billion fine in May 2023, the largest GDPR penalty ever, because of data transfers between the EU and the US that did not comply with regulations.

In August 2024, Uber was fined €290 million by the Dutch Data Protection Authority for transferring European driver data to the US without adequate safeguards. GDPR fines can reach up to 4% of worldwide annual turnover or €20 million, whichever is greater.

Top 10 Largest Individual GDPR Fines

Data Controller	Fine	Year
Meta Platforms Ireland Limited	€1.2B	2023
TikTok Technology Limited	€530M	2025
Meta Platforms, Inc.	€405M	2022
Meta Platforms Ireland Limited	€390M	2023
TikTok Limited	€345M	2023
LinkedIn	€310M	2024
Uber Technologies Inc., Uber B.V.	€290M	2024
Meta Platforms Ireland Limited	€265M	2022
Meta Platforms Ireland Limited	€251M	2024
WhatsApp Ireland Ltd.	€225M	2021

Source: GDPR Enforcement Tracker

Expanding and Unpredictable Costs

Cloud transcription pricing appears modest at per-minute rates, but costs escalate rapidly at call center scale. The following table illustrates costs for a typical enterprise workload of 32 concurrent channels operating 24 hours per day across 30 days per month (approximately 43,200 minutes/month).

Provider	Model/Tier	Cost Per-Minute /Channel	Monthly Cost (43.2K min)
AWS Transcribe	Standard	$0.015-$0.024	~$648
Google Cloud V2	Standard	$0.016	~$608
Azure Speech	Real-time	$0.0167	~$721
Deepgram Nova-3	Pay-as-you-go	$0.0077	~$293
Unigen On-Prem	Whisper Large	~$0.006*	~$259

*Amortized cost per minute per channel based on hardware lease/purchase over 36 months. Unlike cloud pricing, this cost does not increase with usage.

Hidden costs further inflate cloud bills: data egress charges ($0.08-$0.23/GB), feature add-ons for speaker diarization and personally identifiable information (PII) redaction, medical transcription surcharges (3-5x base rates), and custom model endpoint hosting fees. At enterprise scale, the three major hyperscalers (AWS, Google, and Azure) typically cost from $6,000 to $8,000 a month for 32 concurrent channels operating in real time. This represents an annual cost of roughly $72,000 to $96,000 in perpetuity.

Solution: On-Prem AI Transcription Server

One solution is using an on-prem server for AI transcription. The Unigen On-Prem AI Transcription Server contains all speech processing within an air-gapped, on-premises environment. Voice data never leaves your facility. The system runs OpenAI Whisper, the industry’s leading open-source speech recognition model, on purpose-built AI accelerators, delivering cloud-quality accuracy at a fraction of the power consumption and cost of GPU-based alternatives.

How the On-Prem AI Transcription Server Works

The server integrates directly into your call center’s telephony infrastructure. Audio streams from your private branch exchange (PBX), SIP trunks, or contact center platform are routed to the transcription server over your internal network. The Whisper model processes each audio stream in real time, producing timestamped transcripts with speaker diarization. Without any data leaving your network, transcripts are delivered back to your analytics platform, quality management system, or compliance archive.

The system supports 32 concurrent transcription streams using 32 Unigen AI modules (with one SAKURA-II accelerator per module), with higher performance systems being release later this year. The SAKURA-II delivers 60 TOPS at just 10 watts, yielding a power efficiency of 6 TOPS per watt, which is approximately 3x more efficient than the NVIDIA T4 GPUs commonly used for speech workloads[1].

Multilingual Support with Dialect Adaptation

The Unigen transcription server supports five production languages out of the box: English, Spanish, German, Japanese, and Dutch. Whisper’s multilingual architecture, trained on over 5 million hours of labeled and pseudo-labeled audio, provides strong baseline accuracy across all five languages.

However, production call center audio presents challenges where clean speech benchmarks do not capture regional dialects, accented speech, telephony-quality audio (8 kHz), background noise, and domain specific terminology. The Unigen platform addresses these through on-premises fine tuning with LoRA (Low Rank Adaptation), which trains only 1-5% of model parameters while achieving accuracy near full fine-tuning. This approach enables:

Spanish dialect adaptation: Caribbean, Argentine, Mexican, and Castilian variants each present distinct phonological patterns. LoRA adapters can be trained and swapped per-call to match the caller’s dialect.
German regional handling: Standard German is well-handled by the base model, while Swiss German and Austrian variants benefit significantly from fine-tuning. Research shows Whisper achieves approximately 21.6% word error rate on Swiss German without fine-tuning.
Japanese dialect support: Standard Tokyo Japanese performs well out of the box, while regional dialects (Kansai-ben, Tohoku) require targeted fine-tuning. Research demonstrates that fine-tuning Whisper for Japanese can reduce character error rates by more than 50%.
Dutch and Flemish: The platform handles both Netherlandic Dutch and Belgian Flemish, with LoRA adapters addressing documented accuracy variations between regional dialects, particularly for speakers from West Flanders and Limburg.

Fine tuning can be performed on-premises using as little as 8 hours of labeled dialect data, making customer-specific adaptation practical without sending any audio data offsite.

GDPR Compliance by Design

On-premises transcription dramatically simplifies compliance with the GDPR and associated national implementations. Rather than managing a complex web of third-party Data Processing Agreements, cross-border transfer mechanisms, and vendor audit requirements, on-premises processing collapses the compliance surface area to a single internal data processing operation.

How On-Prem Addresses Key GDPR Requirements

GDPR Requirement	Cloud Challenge	On-Prem Advantage
Data Minimization (Art. 5)	Audio may be retained by cloud provider for model improvement	Full control over data retention and deletion schedules
Cross-Border Transfers (Art. 44-49)	Requires SCCs, transfer impact assessments, adequacy decisions	Eliminated entirely, data never leaves the jurisdiction
Right to Erasure (Art. 17)	Must coordinate deletion across cloud provider systems	Direct, verifiable deletion from local storage
Data Processing Agreements (Art. 28)	Required with every cloud processor in the data chain	No third-party processors, internal processing only
Breach Notification (Art. 33-34)	Dependent on cloud provider’s detection and notification	Internal monitoring and immediate incident response
DPIA Requirement (Art. 35)	Complex assessment of third-party processing risks	Simplified assessment with full infrastructure control

The system also supports compliance with additional regulatory frameworks relevant to multinational call center operations: HIPAA (healthcare call centers handling Protected Health Information), PCI-DSS 4.0 (financial services call centers processing payment card data), and CCPA (California consumer privacy requirements, which explicitly classify audio recordings as personal information).

Transcription Performance

OpenAI Whisper has established itself as the de facto standard for open-source automatic speech recognition. In September 2025, MLCommons selected Whisper Large-v3 as the official ASR benchmark model for MLPerf Inference v5.1, further validating its position as an industry reference.

Accuracy Across Target Languages

Whisper’s word error rates on clean, read-speech datasets provide a performance floor. Real-world call center audio (8 kHz telephony, background noise, diverse accents) typically shows higher error rates, which fine-tuning significantly improves.

Language	Whisper Medium	Whisper Large-v2	Whisper Large-v3
English	4-5% WER	3-4% WER	2.7-5% WER
Spanish	5-7% WER	4-6% WER	4-5% WER
German	6-8% WER	5-7% WER	5-6% WER
Japanese (CER)	8-12% CER	6-9% CER	5-8% CER
Dutch	8-12% WER	7-10% WER	6-9% WER

WER = Word Error Rate (lower is better). CER = Character Error Rate (used for Japanese). Benchmarks from FLEURS and Common Voice datasets; actual call center performance varies.

On real-world 8 kHz telephony audio (the standard encoding for call centers), a 2025 Voicegain benchmark across 40 call center recordings found Whisper Large-v3 achieved 86.2% accuracy (13.8% WER), competitive with AWS Transcribe at 87.7% accuracy (12.3% WER) and significantly ahead of Google Video at only 68.4% accuracy.

Hardware: Power Efficiency as Competitive Advantage

The Unigen On-Prem AI Transcription Server leverages EdgeCortix SAKURA-II accelerators, which deliver dramatically better power efficiency than the NVIDIA GPUs used by virtually all competing on-premises transcription solutions.

Accelerator	INT8 TOPS	Power (W)	TOPS/Watt	Typical Cost
Unigen AI	60	10	6	<$1,000
NVIDIA T4	130	70	1.86	$2,000-$3,000
NVIDIA L4	242	72	3.37	$2,500-$3,500
NVIDIA A100 PCIe	624	250	2.50	$10,000-$15,000

For 32 concurrent Whisper streams, the Unigen server’s estimated total power consumption is approximately 400-500 watts (32 SAKURA-II chips across 32 Unigen AI modules at roughly 256W, plus host CPU and system overhead). An equivalent GPU-based setup would require multiple NVIDIA T4 or A100 cards, consuming 1,000-2,500 watts. This 3-5x reduction in power consumption translates directly to lower operating costs and simplified power and cooling infrastructure requirements.

Benefits of Unigen AI Transcription Server

Cost Predictability

Cloud transcription costs are linear and perpetual: at typical hyperscaler rates, a 32-channel workload costs approximately $72,000-$96,000 per year, indefinitely. On-premises costs are front loaded with hardware CapEx plus installation, then they flatten to operational expenses such as power, which runs $500 to $900 a year for a 400 to 500W system, and partial IT staff allocation. By year three, on-premises total cost of ownership is typically 30-50% lower than cloud. By year five, the gap widens further.

Zero Data Exposure

The entire platform runs on-premises and is fully air-gapped. Source audio, transcripts, fine-tuned models, and all intermediate processing data never leave your environment. This eliminates IP exposure, third-party vendor risk, and the compliance burden of managing external data processors.

Operational Reliability

On-premises systems operate independently of internet connectivity, cloud provider health, and third-party rate limits. Major cloud providers experience multi-hour regional outages multiple times per year. The Unigen server delivers consistent, predictable performance unaffected by network congestion, geographic distance, or external service disruptions. Modules are hot-swappable, so there is no downtime during hardware upgrades.

Customizable AI Models

The system continuously learns from approved improvements, enabling your organization to build proprietary fine-tuned transcription models over time. Industry-specific vocabularies (financial terminology, medical nomenclature, product names), company-specific jargon, and regional dialect adaptations all become part of your internal intellectual property—not shared with outside vendors or cloud providers. Companies can deploy Whisper medium or large models, selecting the optimal trade-off between accuracy and throughput for their specific workload.

Reduced Latency

Due to the modular nature of the Unigen solution, which uses multiple AI modules, latency (wait time) for the next AI module to be ready to transcribe a new incoming call can be reduced compared to relying on a smaller number of large GPUs in a cloud server or needing to add another cloud server to handle increased load. Additionally, the same principles that improve operational reliability also apply to latency: hosting the server on-prem or nearby in a colocation center helps minimize transcription delays during a conversation.

Scalable Architecture

If capacity needs to grow, additional transcription servers can be added at a fixed cost. AI modules can be upgraded when higher-performance solutions are introduced, without replacing the entire server. The E1.S form factor supports hot-swappable modules, enabling capacity changes and hardware upgrades with zero downtime.

Conclusion

AI powered speech transcription is rapidly becoming essential infrastructure for enterprise call centers and BPOs, but the path to deployment must balance accuracy, cost, security, and regulatory compliance. Cloud based transcription services create ongoing exposure of sensitive voice data, unpredictable costs that scale linearly with call volume, and a mounting compliance burden across GDPR, HIPAA, PCI-DSS, and regional privacy regulations.

Unigen’s On-Prem AI Transcription Server gives enterprises a secure, private, and financially stable way to adopt state-of-the-art multilingual transcription without sacrificing performance. Companies can bring AI transcription safely in house by running Whisper on power efficient EdgeCortix SAKURA-II accelerators. This allows them to accelerate their speech analytics capabilities, safeguard customer data, ensure GDPR compliance across European operations, and keep costs low and predictable.

About Unigen AI Transcription Server: Poundcake-LLM

AI Capabilities

OpenAI Whisper Medium and Large models (up to 1.5B parameters)
32 concurrent real-time transcription streams
5 production languages: English, Spanish, German, Japanese, Dutch
On-premises dialect fine-tuning via LoRA adapters
Approximately $0.06/min/channel amortized cost

Technology

AIC EB202-CP Chassis, Motherboard, 2 x E3.S Boxes, Dual Power Supply
AMD Genoa CPU with 16-48 Cores and AVX Media Decoding
8-16 Unigen E1.S or E3.S AI Modules (up to 32 EdgeCortix SAKURA-II Processors)
256GB DDR5 Unigen RDIMMs
960GB Boot Drive (Data Drives Available)
2 x 1.92TB E1.S Unigen Data Drives
25GbE Networking
Less than 1200 Watts total power consumption
Ubuntu 22.04 Operating System

Compliance Support

GDPR-compliant air-gapped deployment (no cross-border data transfers)
HIPAA-ready infrastructure for healthcare call centers
PCI-DSS compatible architecture for financial services
Active Directory, LDAP, and SSO integration
Role-based access control and audit logging

About Unigen Corporation

Founded in 1991, Unigen is an established global leader in the design and manufacture of OEM products including SSDs, DRAM modules, NVDIMMs, Enterprise IO, and AI solutions. Unigen also offers a full array of Electronics Manufacturing Services (EMS), including design, quick-turn prototyping, new product introduction, volume production, supply chain management, assembly & test, and aftermarket services. Headquartered in Newark, California, the company operates state-of-the-art manufacturing facilities (ISO-9001/14001/13485 and IATF 16949) in the heart of Silicon Valley as well as offshore in Vietnam and Malaysia. Unigen offers its products and services to customers worldwide targeting a broad range of end markets including automotive, computing and storage, embedded, medical, AI, robotics, clean energy, defense, aerospace, and IoT. Learn more about Unigen’s products and services at unigen.com.

Glossary

Air-Gapped: A security measure in which a computer, network, or system is physically isolated from unsecured or public networks (such as the internet), reducing the risk of unauthorized access, data leakage, or cyberattacks.
BPO (Business Process Outsourcer): A company that performs specific business tasks (such as customer service, technical support, or back-office operations) on behalf of other organizations.
Compound Annual Growth Rate (CAGR): the annual rate of return that shows how an investment grows from its beginning value to its ending value over time, assuming reinvested profits.
CCPA: The California Consumer Privacy Act, a state privacy law that gives California residents rights over their personal information, including audio recordings.
GDPR: The General Data Protection Regulation, the EU’s comprehensive data protection law governing how personal data is collected, processed, and stored.
HIPAA: The Health Insurance Portability and Accountability Act, US federal law protecting the privacy and security of patient health information.
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that trains a small number of additional parameters on top of a pre-trained model, enabling dialect and domain adaptation without retraining the full model.
PCI-DSS: The Payment Card Industry Data Security Standard, a set of security standards designed to ensure that all companies processing credit card information maintain a secure environment.
Personally Identifiable Information (PII): any data that can distinguish, trace, or locate an individual’s identity, such as names, social security numbers, or biometric records.
Private Branch Exchange (PBX): a private telephone network used within companies to manage internal calls and connect to the public switched telephone network (PSTN) for external calls.
SIP (Session Initiation Protocol): A signaling protocol used for initiating, maintaining, and terminating real-time communication sessions including voice calls.
Speaker Diarization: the process of partitioning audio recordings into segments based on speaker identity, essentially answering “who spoke when”.
Whisper: An open-source automatic speech recognition model developed by OpenAI, capable of multilingual transcription across 99 languages.
WER (Word Error Rate): A standard metric for evaluating speech recognition accuracy, calculated as the number of insertions, deletions, and substitutions divided by the total number of words in the reference transcript.

Sources

The post Guide to On-Prem AI Transcription Servers appeared first on Unigen.

Guide to On-Prem AI Coding Servers

Brett Patrick — Wed, 11 Feb 2026 23:22:38 +0000

AI coding assistants have become essential to modern software development, but cloud-based tools create risks for small and medium businesses (SMBs). These risks include exposing source code, increasing operational costs, and complicating compliance. An on-prem AI Coding Server provides the speed and productivity of cloud AI tools while keeping all source code, training data, and fine-tuning models securely inside a company’s environment. With predictable costs, air-gapped deployment, and a developer-friendly UX, software teams can write AI code safely and confidently.

Why AI Coding Tools Are Becoming Essential

It’s no secret that software teams are becoming increasingly reliant on AI coding assistants. In fact, according to SecondTalent, 41% of all code in 2025 is AI generated or AI assisted and 76% of professional developers either use AI coding tools or are planning to adopt them soon. These AI coding assistants are popular for good reasons. They dramatically increase software development speed, improve accuracy, and free engineers from having to do repetitive tasks. For example, one study found that developers using GitHub Copilot completed tasks up to 55% faster. As reliance on these AI tools grows, companies will require air-gapped coding servers capable of running generative AI models with a minimum 20 billion parameters.

Image Source: SecondTalent

While the main cloud-based coding large language models (LLMs) that exist today (like Anthropic’s Claude) are capable of sophisticated code generation, Unigen is charting a different path. Unigen is building a new generation of multi-module servers using AI modules with inference silicon that can provide efficient performance in a low-power, economical, Open Compute Platform (OCP) format. These systems can support generative AI LLMs, such as OpenAI’s GPT-OSS-20B.

Image Source: Fusion Chat

The Issue with Cloud-Based Coding

There are several issues corporations must consider before allowing their software teams to use the Cloud for their coding.

Security

Every line of code is intellectual property (IP). Letting this code outside of a company’s shared cloud environment can directly expose a company’s deepest secrets to a host of others who may not have the best intentions.

In fact, even major providers have stumbled. In 2024, Microsoft Copilot inadvertently exposed private GitHub repositories from 16,000+ organizations, leaking over 300 credentials and 100 internal software packages through Bing’s caching system.

Image Source: Grok

IP Contamination

When generative models are trained on or interact with mixed datasets, there is a risk of the company’s data being linked to open sources with unclear licenses. This can lead companies to be exposed by using open source without a clean title to the IP.

According to TechTarget, “coding assistants might generate large chunks of licensed open source code verbatim, which leads to IP contamination in the new codebase.”

Expanding Costs

The cost of using cloud AI computing resources is continuing to grow without any bounds or limits. On top of that, many teams find out that their production code costs much more than they were expecting prior to the completion of a critical project.

According to Ficus Technologies, in 2025, companies spent $30,000 to $80,000 per year just to keep models running at scale. Additionally, there are hidden costs like MLOps operations and model retraining.

Image Source: Ficus Technologies

Solution: A Secure On-Prem Alternative

One solution to these issues is to contain the AI coding to an air-gapped, on-prem server. This solution protects source code and IP from being intercepted by bad actors. Additionally, an on-prem server can have its code managed and contained to prevent it from using any other data or Gen AI solutions except those that are managed by the corporation and its IT experts. Finally, through a wholly owned or leased server, the costs are well understood at the outset, eliminating surprise token spikes or compute overage charges. The result is a cloud-quality AI coding platform that delivers security, control, and financial clarity.

How the Unigen AI Coding Server Works

IT Setup and Configuration of AI Coding Server

Deployment begins with standard enterprise hardware procedures. IT teams unpack and mount the server hardware, then install the server and connect it to the local private network according to organizational standards. The configuration of network security follows established enterprise practices, including the implementation of firewall rules, VLAN segmentation, and port allow listing to ensure the server operates within defined security boundaries. The system arrives with a pre-installed operating system and software stack designed to simplify diagnostics and streamline the AI coding environment installation process.

The server integrates directly with your organization’s identity management infrastructure, supporting Active Directory, Lightweight Directory Access Protocol (LDAP), and single sign-on (SSO) solutions to maintain consistent authentication protocols across your environment. IT administrators establish resource allocation policies that define parameters such as concurrent user limits and token consumption thresholds, ensuring optimal performance and fair resource distribution. Access privileges are granted based on specific project requirements, allowing granular control over who can utilize the AI coding capabilities. Security policies and data governance rules are configured to align with organizational compliance requirements, while code repository access and permissions are established to control how the AI interacts with existing codebases. Following configurations, IT performs connectivity testing and initial validation to ensure the system is ready for production use.

User Experience

The developer experience is designed to be intuitive and familiar, requiring minimal training or adjustment to existing workflows.

On-prem AI Coding Server Process Flow

Onboarding

Users begin by installing a lightweight extension on their preferred integrated development environment on their local client machine. Authentication occurs through a login process using credentials provided by IT for the AI Coding Server, leveraging either SSO or standard enterprise credentials to maintain security consistency. Once authenticated, developers interact with the Coding Agent through a chat interface reminiscent of ChatGPT or Cursor, positioned conveniently within their integrated development environment (IDE) workspace.

AI Coding Configuration

Developers configure their workspace preferences and model parameters according to their specific needs and coding style. The system requires explicit permission grants for accessing the tools and code involved in the current workflow, including file system access, Git repository interaction, and terminal command execution. This permission model ensures transparency and maintains security boundaries while enabling the AI to function effectively.

Code Generation Planning

When a developer poses a question or task to the Unigen AI Code Agent, the system demonstrates its capabilities primarily in Python, TypeScript, and JavaScript. The AI analyzes the existing codebase context and tools to understand the project structure and conventions. It then creates a comprehensive plan of action outlining the optimal approach to accomplish the requested task.

Human Review

Before any changes are applied, developers can preview proposed modifications in a side by side view, allowing careful review of what the AI intends to change. Developers maintain complete control, approving changes or requesting adjustments through additional prompts to refine the output.

Built-In Testing and Automation

The system automatically generates unit test cases to verify code correctness and functionality, but all code execution requires explicit user approval at every step. This human-in-the-loop approach ensures developers maintain oversight throughout the coding process. Generated code undergoes security scanning, with results presented for review before integration into the codebase. This iterative process of requesting, reviewing, and refining continues as needed, creating a collaborative relationship between developer and AI agent.

Benefits of Unigen AI Coding Server

Unigen’s AI Coding Server offers a set of advantages designed with SMB software engineering teams in mind.

Poundcake LLM and Tiramisu E3.S AI Module

The system continuously learns from approved improvements, enabling your company to build proprietary fine-tuned AI coding agents over time. To further enrich this knowledge base, the platform supports the integration of code from pre-approved outside sources and libraries, allowing you to leverage industry-standard frameworks within your secure environment. This means your organization’s workflows, coding standards, and architectural preferences become part of your internal IP and aren’t shared with outside vendors or cloud models. Because the entire platform runs on-prem and is fully air-gapped, source code never leaves your environment, eliminating IP exposure and compliance risk.

Cost predictability is another key benefit. Unlike cloud-based AI tools with unpredictable token usage or abrupt pricing changes, Unigen provides a stable cost structure whether you lease or purchase the hardware. Development performance matches popular cloud tools such as Cursor, Windsurf, and Kiro, but with unlimited tokens and no API errors caused by network or rate-limit issues.

The platform significantly improves engineering efficiency, saving both senior and junior developers 10+ hours per week through automated unit testing, documentation, and CI/CD assistance. Productivity increases of 26% – 40% are common, contributing to a strong return on investment, often estimated at 30x or more. If a customer wants to expand their capabilities, they can add additional coding servers at a fixed cost or upgrade their AI modules when higher performance solutions are introduced. Modules are hot-swappable so there is no downtime during upgrade.

Finally, companies can customize the system’s AI agents or deploy specialized LLMs up to 20B parameters, enabling tailored workflows for unique software development needs.

Conclusion

The use of AI coding assistants is expected to grow significantly as the technology matures, but security, IP control, and cost stability must be prioritized. Unigen’s AI Coding Server gives companies a secure, private, and financially stable way to adopt state-of-the-art AI coding capabilities without sacrificing performance or developer experience. By bringing AI safely in-house, companies can accelerate software development, safeguard their IP, and keep costs low and predictable.

About Unigen’s Secure AI Coding Server: Poundcake-LLM

AI Capabilities

Up to 20B-parameter Generative AI (LLM/VLM)
240 tokens/sec with 16 Unigen Tiramisu modules

Technology

AIC EB202-CP Chassis, Motherboard, 2 x E3.S Boxes, Dual Power Supply
AMD Genoa CPU with 16-48 Cores and AVX Media Decoding
8 – 16 Unigen E3.S Tiramisu AI Modules (up to 32 EdgeCortix SAKURA-II Processors)
256GB DDR5 Unigen RDIMMs
960GB Boot Drive (Data Drives Available)
2 x 1.92TB E1.S Unigen Data Drives
25GbE Networking
Less than 1200 Watts total power consumption
Ubuntu 22.04 Operating System

About Unigen Corporation

Glossary

Air-Gapped: A security measure in which a computer, network, or system is physically isolated from unsecured or public networks (such as the internet). This separation reduces the risk of unauthorized access, data leakage, or cyberattacks.
ChatGPT: An AI-powered language model developed by OpenAI that can understand and generate human-like text. It is used for tasks such as drafting content, answering questions, generating code, and assisting with research.
Cursor: An AI-enhanced code editor that predicts your next edit, answers questions about your codebase, and writes or modifies code using natural-language prompts.
Generative AI (GenAI): a type of artificial intelligence designed to create new content such as text, images, music or even code by learning patterns from existing data.
Git Repository: A version-controlled storage location that contains a project’s files as well as the complete history of changes. It enables collaborative development, tracking of modifications, branching, and rollback to previous versions.
Integrated Development Environment (IDE): A software that combines commonly used developer tools into a compact GUI (graphical user interface) application. It is a combination of tools like a code editor, code compiler, and code debugger with an integrated terminal.
Intellectual Property (IP): Creations of the mind, including inventions, literary and artistic works, designs, symbols, names, and images used in commerce. IP is protected by law (e.g., patents, copyrights, trademarks) to provide recognition and financial benefit to creators.
JavaScript: A dynamic, high-level programming language commonly used to build interactive and responsive features on websites and web applications.
Kiro: AWS’s AI-powered Integrated Development Environment (IDE). Unlike tools such as Cursor or Windsurf, Kiro is specification-driven: it converts prompts into requirements, designs, and validated code.
Large Language Models (LLMs): Advanced machine-learning models trained on vast amounts of text data to understand, generate, and manipulate natural language. They support tasks such as summarization, reasoning, coding, translation, and conversational interaction.
Lightweight Directory Access Protocol (LDAP): A protocol used to access and manage directory information over a network. It provides a lightweight alternative to the X.500 directory service, enabling centralized authentication, user management, and resource lookup.
On-Premises (On-Prem): Software, hardware, or infrastructure that is installed and operated within an organization’s physical location, offering increased control over data, privacy, and customization compared to cloud-hosted solutions.
Open Compute Project (OCP): An open-source initiative that develops and shares designs for energy-efficient, scalable data center hardware. OCP promotes innovation and cost savings in server, storage, and networking infrastructure.
Python: A high-level, general-purpose programming language known for its readability, simplicity, and extensive ecosystem. It is widely used in fields such as web development, automation, data science, and AI.
Single Sign-On (SSO): An authentication method that enables users to log in once and gain access to multiple systems or applications without re-entering credentials.
TypeScript: A typed superset of JavaScript that adds static type checking and modern language features. It compiles to JavaScript and improves maintainability and reliability in large codebases.
User Experience (UX): The overall quality of a user’s interaction with a product, system, or service, including ease of use, accessibility, efficiency, and satisfaction.
VLAN Segmentation: A network design technique that divides a physical network into multiple virtual local area networks (VLANs). This enhances performance, improves security, and isolates traffic between groups of devices.
Windsurf: A company that provides an AI-powered code editor designed to help developers write, understand, and modify code more efficiently through intelligent automation.

The post Guide to On-Prem AI Coding Servers appeared first on Unigen.

How On-Farm AI Is Giving Farmers an Edge in Early Disease Detection

Brett Patrick — Tue, 14 Oct 2025 17:59:35 +0000

The Problem: Delayed Illness Detection in Livestock

As most people in the agriculture industry know, livestock living in close quarters creates an environment where diseases can spread quickly. For swine producers, they must constantly be on the lookout for the spread of respiratory illness, which according to MACSO, accounts for 60% of swine deaths globally.

Traditional methods to detect disease like conducting visual inspections and checkups from the vet often fail to catch the early and more subtle signs of disease. Typically, by the time a swine’s symptoms are noticeable, the illness has already begun to spread, causing massive ripple effects.

Economic Impact

In China, swine diseases are projected to cause 1.40 – 2.07% losses to GDP ($186 – 286 billion). In the US, a Porcine Reproductive and Respiratory Syndrome Virus (PRRS) outbreak would cost $50 billion in direct and indirect losses.

Livestock and Human Health Impact

Not only is the spread of disease costly, but it also undermines farmers’ efforts to reduce antibiotics use to decrease the risk of bacteria that are resistant, also called antimicrobial resistance. In addition to posing a threat to swine, it can also be a risk to humans for crossover diseases that humans treat with antibiotics. Humans would also be at greater risks from consuming contaminated meat products.

Why Traditional Methods of Illness Detection are Impractical

Visual Inspection

Managing visual inspections across multiple large barns, which often hold thousands of swine, makes human observation alone an ineffective way to consistently spot a sick animal. Additionally, there aren’t enough trained personnel to complete these visual inspections of livestock. The sheer number of farms and farm workers is declining, while livestock populations continue to rise.

Frequent Vet Visits

Periodic checkups from a veterinarian can be helpful, but this method of disease detection often fails to catch illness early because checkups are too infrequent. The cost of increasing the frequency of these visits will add up quickly, and costs may be passed down through the supply chain to the end customer.

On-Farm AI Is Giving Farmers an Edge in Early Disease Detection

A new solution is being proposed to solve the problem of swine producers needing to balance the rising demand for pork with fewer veterinary resources and increasing pressure to reduce antibiotic use. Adding an on-premise AI inference server and AI-enabled audio sensors to the barn gives farmers the tools to detect illness early and act quickly without relying on cloud connectivity.

Image Source: EdgeIR

Training AI to Hear What Farmers Can’t

These audio-based AI sensors listen for signs of distress, coughing patterns, or environmental changes that may indicate signs of illness. Integrated with an AI inference server, the system enables early illness detection and real-time herd monitoring. In fact, MACSO studies have shown that this method can identify respiratory problems in pigs days before human detection would be possible.

AI That Stays on the Farm

To power this smart monitoring system, audio-based AI sensors are deployed across the barn, and they connect locally to an AI inference server. By using an on-premises solution and avoiding cloud dependency, farms gain reliability, speed, and data privacy, especially in areas with limited internet access.

In addition to detecting disease, the same AI inference server can also support other farm management applications, including feed optimization, environmental control, and behavioral tracking.

Turning Real-Time Insights into Results

For large-scale operations, early detection and rapid response can deliver measurable returns. Lower mortality rates, reduced antibiotic usage, and fewer disruptions all contribute to a stronger bottom line. By keeping compute and analytics on-site, this new AI solution helps farms act faster, operate more efficiently, and make better-informed decisions.

Key Takeaways

On-farm AI is changing how farmers detect and manage livestock disease, making it faster, more accurate, and less expensive. By using MACSO’s Integrated Audio Sensors and Unigen’s Edge Computing, we are proud to offer a solution to help farmers detect animal illnesses earlier.

Early disease detection helps stop outbreaks before they spread
Audio-based AI can detect illness days earlier than visual inspections
On-premise AI systems work without relying on cloud connectivity
Faster detection means lower mortality, reduced antibiotic use, and higher profitability

Quick Answers

What is on-farm AI?
On-farm AI refers to artificial intelligence systems that operate directly at the farm, using local sensors and servers to analyze data in real time without sending it to the cloud.

How does AI detect illness in livestock?
AI-enabled audio sensors listen for coughing, distress sounds, and changes in herd behavior. These patterns are analyzed by an on-site AI inference server to identify early signs of disease.

Why is early disease detection important for farmers?
Catching illness early reduces disease spread, lowers mortality rates, minimizes antibiotic use, and helps farmers avoid costly outbreaks.

Does on-farm AI require internet access?
No. Because the AI inference server runs on-premises, farms can monitor livestock health reliably even in areas with limited or no internet connectivity.

Is this technology only for swine farms?
While this article focuses on swine production, similar AI-based monitoring systems can be applied to poultry, cattle, and other livestock operations.

About Unigen Corporation

Founded in 1991, Unigen is an established global leader in the design and manufacture of OEM products including SSDs, DRAM modules, NVDIMMs, Enterprise IO and AI solutions. Unigen also offers a full array of Electronics Manufacturing Services (EMS), including design, quick-turn prototyping, new product introduction, volume production, supply chain management, assembly & test, and aftermarket services. Headquartered in Newark, California, the company operates state-of-the-art manufacturing facilities (ISO-9001/14001/13485 and IATF 16949) in the heart of Silicon Valley as well as offshore in Vietnam. Unigen offers its products and services to customers worldwide targeting a broad range of end markets including automotive, computing and storage, embedded, medical, AI, robotics, clean energy, defense, aerospace, and IoT. Learn more about Unigen’s products and services at unigen.com.

About the Author

Jeffrey Schmitz is Senior Director of Strategic Accounts at Unigen Corporation, where he leads initiatives in AI-enabled edge computing, memory, and storage solutions. With a focus on OEM partnerships and real-world applications of AI, he’s bringing Unigen’s innovative technologies into industries such as agriculture, healthcare, and smart cities.

The post How On-Farm AI Is Giving Farmers an Edge in Early Disease Detection appeared first on Unigen.

Improving the Efficiency of Medical Image Analysis with AI at the Edge

Brett Patrick — Tue, 03 Jun 2025 23:05:49 +0000

By EOVIsion.ai, GenUI, and Unigen

AI Medical image Analysis is transforming healthcare by enhancing diagnostic speed, accuracy, and cost-efficiency without replacing medical professionals. Traditional cloud-based workflows for analyzing large pathology images are often slow, expensive, and resource-intensive. A new on-premise AI inference approach leverages high-performance, low-power servers positioned directly within hospitals and diagnostic centers to dramatically reduce latency, lower costs, and strengthen data privacy. By processing ultra-high-resolution images locally and integrating an intuitive user interface, this solution enables healthcare providers to deliver faster diagnoses (often in under 30 minutes) while maintaining clinical oversight and improving patient outcomes.

Advancements in medical image analysis

Medical image analysis represents one of the most revolutionary advancements in healthcare, significantly improving the diagnostic and treatment capabilities of a myriad of conditions. The field has seen rapid technological innovation over the last 100 years, from the first X-ray in 1895 to more sophisticated 3D imaging methods like CT scans and MRI technology. Today, with the advancement of Artificial Intelligence (AI), the capabilities of healthcare professionals can be enhanced (NOT REPLACED!) to improve productivity and diagnostic accuracy. We are already seeing the benefits of AI technology and we are just beginning to scratch the surface.

The problem: slow and expensive medical image analysis

Every day, pathologists and researchers spend thousands of hours screening images for abnormalities or cells that could be signs of a disease such as cancer. The time of these specialists is not only a very expensive resource, but this process also creates a time latency to provide a diagnosis for the physician and the patient. The current solution to this issue requires sending large images to data centers, where the images are tiled into thousands of smaller images and then inspected by powerful servers, and the data stored in their data lakes. This process is also problematic since it involves an additional latency to transmit the data and involves extra costs for data storage and spinning up and taking down costly GPU server instances in “The Cloud” to perform each of the computational steps.

A new approach to medical image analysis is emerging, aiming to establish the foundation for how clinical decisions are made across the healthcare industry. One such platform enables the detection and identification of anomalous cells with speed and precision, empowering healthcare institutions to improve diagnosis and deliver more focused treatments. While the platform delivers results efficiently through its secure cloud-based system, some institutions may opt to incorporate new server technology for on-premise AI inference to further reduce latency, strengthen data privacy, or manage long-term cloud-related costs. These options can provide additional value depending on institutional needs and operational preferences.

A more efficient solution: on-premise AI inference for medical image analysis

A new solution being proposed to solve the problem of slow and expensive medical diagnostic imaging is the addition of an inference server with high performance AI Inference modules that are currently being adopted for high volume security camera operations which can now be adapted for on-premise operations in hospitals and diagnostic centers.

AIC’s EB202-CP Server and Unigen’s Biscotti AI E1.S Module

These servers, which use less than 20% of the power of a GPU server, can be co-located near the diagnostic imaging microscope to avoid the issues of transmission and costs associated with data ingress/egress and cloud storage. This solution, which costs less than 20% of a GPU server, can significantly shorten the time for a diagnosis from hours or even days to less than 30 minutes.

By reducing the latency of sending huge files over the internet, it allows a simpler path by copying them to a local server. This also affords the luxury of a large memory capacity server and an ultrafast NVMe Solid State Drive storage device, which can tile 100Kx100K pixel images into smaller 640×640 pixel frames and process them on the Inference AI modules through 32 lanes of PCIe bandwidth directly to memory, which significantly reduces the time for computation. It also can be used by a diagnostician to directly interface with the server through a VPN (Virtual Private Network) to securely view the images on the server and see if there are anomalous cells (highlighted by red Yolo squares) or an all-clear signal for a healthy image.

Increasing accessibility through intuitive user interface

In addition to dramatically reducing the latency and cost of medical diagnosis, this product also makes it easy for a physician to see the result. The addition of a custom built User Interface (UI) dedicated to medical health specialists takes this from a device to a complete solution. Through the addition of a state of the art visual agent, this solution can provide an intuitive presentation of the information that allows access and navigation through the images managed for human beings.

Conclusion

In conclusion, when companies with disparate technological core competencies (like EOVIsion.ai, GenUI, and Unigen) work together using the latest tools towards a common cause, the result can be much greater than the sum of its parts. Our goal isn’t simply about building the latest and greatest technology. We are passionate about designing solutions that can impact people’s everyday lives. By using our combined technology, we are not just improving the efficiency of medical diagnostic imaging. We’re making sure every patient can get the timely and accurate diagnosis they deserve.

Key Takeaways: On-Prem AI for Medical Image Analysis

Faster Diagnoses: On-premise AI inference reduces turnaround time from hours or days to under 30 minutes.
Lower Costs: High-performance inference servers use less than 20% of the power and cost of traditional GPU cloud servers.
Improved Data Privacy: Local processing minimizes data transmission and reduces cloud storage and egress fees.
Enhanced Clinical Productivity: AI highlights anomalous cells to support pathologists, increasing efficiency without replacing expertise.
User-Friendly Interface: A dedicated medical UI simplifies image navigation and interpretation for healthcare professionals.

About Unigen Corporation

Glossary

AI Medical Image Analysis: The use of artificial intelligence to examine medical images (e.g., pathology slides, CT scans) to detect abnormalities and support diagnosis.
Inference Server: A server optimized to run trained AI models and generate predictions from new data.
GPU (Graphics Processing Unit): A specialized processor often used in AI workloads due to its high parallel processing capability.
On-Premise: Infrastructure hosted locally within a hospital or facility rather than in the cloud.
Latency: The delay between sending data and receiving results.
NVMe SSD: A high-speed solid-state storage device that enables rapid data access and processing.
PCIe (Peripheral Component Interconnect Express): A high-bandwidth hardware interface used to connect components like AI modules to a server.
Tiling: Breaking large medical images into smaller segments for AI processing.
VPN (Virtual Private Network): A secure connection that allows remote access to systems and data.
YOLO (You Only Look Once): A real-time object detection algorithm often used to identify and highlight features within images.

The post Improving the Efficiency of Medical Image Analysis with AI at the Edge appeared first on Unigen.

TPU Inference Servers for Efficient Data Centers

Brett Patrick — Wed, 18 Sep 2024 16:23:20 +0000

According to the Annual Electricity Report from the International Energy Agency (IEA): “After globally consuming an estimated 460 terawatt-hours (TWh) in 2022, data centres’ total electricity consumption could reach more than 1000 TWh in 2026. This demand is roughly equivalent to the electricity consumption of Japan.” A large fraction of this power can be attributed to the growth of Artificial Intelligence where training and inference operations require trillions of calculations. While it is true that the Graphics Processing Units (GPUs), which use their Floating Point Units (FPU) , are required for the training portion of AI, the FPUs are not required for inference, where Tensor Processing Units (TPUs) are a better fit and are up to 85% more power efficient. Since inference is up to 90% of the work done by an AI data center, inference-only servers using TPUs may help to relieve some of the power consumption concerns highlighted by the IEA.

Summary

This whitepaper explores the concept of developing data centers that are solely focused on AI inference. Up to 90% of AI operations are inference vs 10% training. Training requires specialized processing to create the neural networks that are then used for inference operations. Training is the primary driver for the power requirements mentioned by the IEA above. On the flipside, inference can be done much more power efficiently. The benefits on developing inference-only datacenters can be significant:

Reduced initial cost for inference servers compared with training servers
Reduced Total Cost of Ownership (TCO) over the lifetime for inference servers
Inference servers with TPUs can be air-cooled, avoiding expensive and difficult to deploy liquid cooling schemes
Data centers with air-cooled servers use far less resources, reducing strain on local power and water

The sections below will compare the different requirements for cooling, electrical systems, HVAC, power and the infrastructure between training servers and inference servers.

The Problem

Training vs. Inference

Training requires semiconductors with floating-point units (FPUs). These are used for complex calculations that take many clock cycles to complete. GPUs were originally used for calculating graphics shaders and then later they were used for general purpose GPU functionality. GPUs are more specialized than CPUs but are fairly large logical blocks that use a significant amount of energy. In contrast, inference requires integer 8 calculations. These are far simpler and extremely repetitive Tensor/Matrix multiplications that can be performed by logical blocks designed specifically for the operation that takes place in very few clock cycles.

Tensor Processing for inference is more efficient per clock cycle vs. GPU or CPU Processing

Image Source: Opengenus Foundation

Training is used to create a neural network. Then, once it has been created, inference is used for all subsequent tasks (e.g. a model for identifying cars or a Large Language Model (LLM) for handling records in an accounting office). A GPU that is used for training can have over two thousand shader units/FPUs. Tensor Processing Units (also called neuromorphic units, AI Accelerators, etc.) can have 10k modules or greater available for matrix multiplication. Current TPUs have performance of up to 50 trillion operations per second (TOPS) with 12 TOPS/Watt. Current GPUs (although required for training) can be less efficient at inference tasks and may only provide 1 to 2 TOPS/Watt, requiring 6 to 12 times as much power to achieve the same inference performance.

Cooling and Power

Die Cooling

Due to the increased power usage of GPUs during both training and inference operations, they require an extraordinary amount of cooling to remain at operational temperatures. Expending 250 Watts of heat from a 20x20mm die area is a difficult technical problem to solve and generally requires some kind of liquid cooling (or even immersion cooling).

Image Source: AnandTech

Conversely, inference-only silicon may expend only 4 or 5 Watts from a 5x5mm die area due to lower frequencies. With only 2% of the power being used in a quarter of the die area, the heat per square millimeter is significantly lower and can be reduced further by using conventional air-cooled heat sinks.

Servers/Racks

Current training servers using GPUs may need to cool up to 8 GPUs at 700 Watts/GPU for a modern enterprise GPU built for general purpose operation in a 4U server chassis. At 5600 Watts for the GPUs and an additional amount for the CPUs and memory, this can add up to 8KW/server. In a 48U rack configuration, that amounts to 96K Watts/racking unit for 12 servers. The chilled liquid is pumped through the racks and servers and then it is sent to refrigerated chillers to be recirculated. The chillers use either evaporated water from the infrastructure or specific equipment that also needs to be cooled to operate efficiently, providing 192KTOPS/rack. All of this requires custom heat sinks, custom cooling liquid, custom motherboards, and custom dual power for the racks since it exceeds the normal budget of 65KW for a 48U racking unit.

An inference server with 64 TPUs uses up to 250 Watts for up to 3200 TOPS (configuration with Unigen Biscotti AI modules). That heat is spread out over the entire width of a 1U or 2U server and can be cooled efficiently with laminar flow fans blowing air over E1.S or E3.S enclosures. These can be arranged in either hot or cold containment aisle arrangements, with the heat managed through returns to simpler air conditioning or less costly evaporative cooling units. Using less than 750 Watts per server, a 48U rack would use less than 35KW for 153KTOPS. At well under 65KW, this setup can use standard power and does not require any custom server hardware, significantly reducing overall costs.

Heating, Ventilation and Air Conditioning (HVAC)

Once the heat has been drawn away from the training servers and racks, the liquid or air needs to be cooled. With liquid cooling, a data center requires advanced piping and pumps to bring the heated liquid to heat exchangers in the HVAC cooling systems, which then need to be cooled using evaporative cooling or refrigeration. Standard air-cooling can use vents and returns to channel heated air to standard air conditioners at a significantly reduced capital expense. (source)

Image Source: Western Cooling Efficiency Center, UC Davis

Image Source: Medium

Cost

By using standard off-the-shelf commodity servers built by high-quality and high-volume manufacturers, the cost of the AI server hardware can be lowered dramatically. The main components of a storage server and an AI inference server are nearly the same. They both use a standard server motherboard, standard server CPUs/heatsinks (AMD or Intel) with 16 cores, standard DDR5 RDIMMs, a standard 1U or 2U chassis with high-quality, lower-power (500 Watt) power supplies, and cables/interfaces/backplanes for Open Compute Project EDSFF (E1.S or E3.S) 4-lane PCIe storage. The only difference for the AI servers is that the SSDs are swapped out for AI modules using the identical 9.5mm E1.S or 7.5mm E3.S enclosures. All of these are cooled by standard fans. Since all these components are standard and produced in large volumes for storage and other segments, costs can be greatly reduced. For example, the chassis is less expensive because the machines that cut and bend the metal can be amortized over a large quantity, making each one less expensive. The same is true for many of the other components.

In comparison, a GPU server will use custom chassis, kilowatt power supplies, custom heatsinks for liquid-cooled processors, and sometimes immersion tanks. Additionally, the cost of GPUs and extra DRAM memory needed to support each of the GPUs increases the overall cost. To be able to host these GPUs, custom motherboards, special power cables, and interfaces capable of hosting the extra power are required. Since most of these components are custom built specifically for individual server types, they cannot interchange components with high volume servers. This adds dramatically to both the cost and the lead time to deliver the servers.

The last element of hardware cost is the replacement parts that may be required. Custom GPU servers do not use off-the-shelf components which are readily stocked and available. If a component fails, it must be ordered and may need to be built as a one-off, which would be time consuming and very expensive.

Data Center Power Footprint

The overall power consumed by a data center using inference only servers vs. training servers can be looked at by combining the following:

*Hot or Cold Aisle Containment and Vent/Returns

A training server data center may require up to five times the amount of power, and thereby requires significantly more infrastructure for liquid cooling, HVAC, and power lines. Additionally, training servers need up to five times more backup power to keep it running in the case of power failure. Not only is this more expensive in terms of operating cost, but other utilities are involved such as higher water usage and transformers to deliver the required current.

Image Source: Data Center Frontier

This may be acceptable for a custom built facility, but for smaller colocation data centers, and especially for on premises usage in a corporate facility’s data center room, it can be a showstopper.

Infrastructure (Weight, Financing, Location)

By reducing the total cooling required by switching designs from a training data center with liquid cooled GPU servers to an inference data center with air-cooled TPUs, a set of additional savings can occur.

First, the weight that the racks need to support is reduced, lowering the cost of the racks themselves. The weight-bearing floor infrastructure can also be reduced in size and mechanical strength because the weight of both the servers and the supporting racking structures has been lessened.

Second, the complexity of the piping for the chilled liquid is eliminated. This piping is both costly to purchase and to maintain, not to mention that it requires extra space to route through the data center.

Third, finding a suitable location for the data center becomes easier. The power requirements from the local utilities will decrease by up to 80% and the strain on the power grid will be significantly reduced.

Finally, the backup power supplies that are required can be up to 80% less, both in immediate battery power and generators that come online in the case of local power failure due to weather or maintenance issues.

Conclusion

The benefits of developing inference-only data centers can be significant through the reduced initial cost when compared to training, as well as the reduced Total Cost of Ownership (TCO) over time. There is also significant savings to the environment since inference servers with TPUs can be air-cooled thereby consuming far less resources, reducing the strain on local power and water.

There is no doubt that Artificial Intelligence will continue to grow at an exponential rate as corporations and humans find new beneficial use cases across all market segments. The physical, environmental and financial footprint of deploying inference-only data centers provides an opportunity to keep up with this growth in a responsible and sustainable way.

About Unigen

What is an E1.S/E3.S?

E1.S and E3.S are mechanical electrical standards created by the Open Compute Project committee. They can both interface directly with a server’s PCIe bus through a 4-lane connector. An E3.S has the option of using an 8-lane connector, which draws up to 20 Watts. They were initially intended for use of SSD storage enterprise technology with the integration of storage controllers, NAND, DRAM and power loss protection (PLP) circuitry. Although originally intended for storage, the standard lends itself directly to any technology that can use the PCIe bus. This includes AI, networking, and security processors. The website for the Open Compute Project can be found here.

Upgrades?

One of the side benefits of using the E1.S or E3.S is the easy path to upgrades and maintenance. This standard allows for hot swap/plugging in of modules so if required, a module can be directly removed, replaced, and even upgraded without stopping the operation of a server if the drivers for the new module have been previously installed. This is simply not possible for liquid-cooled modules without significant custom design. Also, since most GPU add-in-cards use PCIe connectors (x8 or x16) directly on the motherboard, these are not plug ‘n play sockets that would allow removal without shutting off the server’s power.

Glossary:

CPU: Central Processing Unit (sometimes referred to as Intel x86)
GPU: Graphics Processing Unit. First used to describe a processor with multiple Floating Point Units for processing transform and lighting algorithms, later referred to as shader units.
TPU: Tensor Processing Unit
Tensor Calculations: Matrix multiplication, usually using Integer 8 processing or Integer 4 for large language models.
Inference AI: In the field of artificial intelligence (AI), inference is the process that a trained machine learning model uses to draw conclusions from new data.
Training AI: AI training is the process that enables AI models to make accurate inferences using large data sets (either language or images).
Server: A large enterprise computer capable of running complex calculations and accessed by multiple individuals concurrently.
Racking Units: A cabinet or vertical metal structure used to hold servers with capabilities for delivering power, cooling, and networking.
Data Center: A single function building constructed to house many servers fitted into racking units. A small data center might house 100 or 200 servers. A very large data center may house 100,000 servers in a single massive building.
Colocation Data Center: A custom designed data center where space, power, and networking are rented to corporations who wish to house their individual or multiple racks without having to maintain a space themselves. They are also used by smaller companies or companies who need a smaller number of servers located closer to their users.
HVAC: Heating, Ventilation, and Air Conditioning
Plug n’ Play: The ability to plug a module into a computer and have it be instantly recognized by the computer without the need for key strokes.
Hot Swap: The ability to both remove a module or insert a module into a computer that is powered on, without first turning off and removing the computer from a racking unit. Without this hot-swap ability, either removing or inserting the module could damage both the module and the computer through a short circuit.
EDSFF: The Enterprise and Data Center Standard Form Factor (EDSFF), previously known as the Enterprise and Data Center SSD Form Factor, is a family of solid-state drive (SSD) form factors for use in data center servers.
Neural Network: A neural network is a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human brain. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain.
TCO: Total Cost of Ownership (TCO) can be calculated as the initial purchase price plus costs of operation across the asset’s lifespan.

The post TPU Inference Servers for Efficient Data Centers appeared first on Unigen.