| Aspect | On-Premise AI (Your Own Infrastructure) | External Cloud AI (OpenAI, Google Vertex, AWS Bedrock, etc.) |
|---|---|---|
| Cost Structure | Fixed & predictable • One-time hardware/software investment • Ongoing electricity & maintenance (usually lower than cloud at scale) • No per-token fees |
Pay-per-token model • Costs increase exponentially as usage grows • High-volume enterprise usage can reach millions of dollars per year • Hidden costs from rate limits, caching tiers, fine-tuning fees |
| Privacy & Data Security | 100% control – data never leaves your network • No external transmission • Full compliance with internal policies & regulations • Company secrets, intellectual property, & client data remain protected |
All prompts & documents must be sent to the external provider • Creates real security & legal risks of leaking: – Company trade secrets – Client confidential data – Personal identifiable information (PII) • Even with "enterprise" agreements, data may be used for training or retained temporarily |
| Vendor Lock-in Risk | No lock-in • You own the models, data, and infrastructure • Switch vendors or go fully open-source anytime |
Severe lock-in over time • External provider owns the context & fine-tuning history of your internal data • Moving away means losing years of learned patterns & paying to re-train elsewhere • Switching costs become prohibitive |
| Component | Technology | Notes |
|---|---|---|
| Hardware | Nvidia TU104GL [Tesla T4] 16GB. Intel Xeon Platinum 8259CL (4 core) 16GB. | This is an entry level hardware for testing prototype system. Per prompt : ~10 token/second. For enterprise setup, recommend upgrade in section "Hardware Options". |
| OS | Ubuntu | |
| Web Server | Nginx | |
| Web Framework | Flask | For enterprise setup, recommend upgrade to Django (Python) or Blazor (.NET) for improved scalability. |
| Inference Framework | Ollama | For enterprise setup, recommend upgrade to LocalAI or vLLM (continuous batching) for improved scalability. |
| Reasoning Model | DeepSeek-R1-0528-Qwen3-8B-Q4_K_M (4-Bit Quantization) | This model is used to handle complex tasks in the workflow. It is a distilled model that combines the best of DeepSeek R1 and Qwen 3.0 with a great balance of speed and quality on entry level hardware. For enterprise setup, recommend upgrade in section "Model Options". |
| General Purpose Model | Qwen2:0.5B-Instruct-Q4_0 (4-Bit Quantization) | This model is used to handle simple tasks in the workflow due to the small parameter size of 0.5B and fast performance. |
| Embedding Model | Nomic-Embed-Text:Latest | This model is used with the vector DB for data ingestion and querying. Extensible to other open source models |
| RAG Framework | LlamaIndex | Enable LLM to process company internal data sources |
| Agent Framework | LlamaIndex | Enable LLM to request actions, such as calling web API, or retrieving latest data from external websites, to automate business workflows |
| Vector DB | Chroma | For enterprise setup, recommend upgrade to PostgreSQL w/ pgvector for improved scalability. |
| Component | Option 1: 4x Mac Studio M3 Ultra Cluster | Option 2: Nvidia HGX H20 (4x H20 GPUs) |
|---|---|---|
| Lower cost. Simple setup. Support ~250 users | Higher cost. Complex setup. Support ~500 users | |
| Total VRAM / Unified Memory | 2TB Unified Memory (per unit 512GB) | 384GB VRAM (per unit 96GB) |
| GPU/Chip Price USD (Est.) | ~$11,500 per unit | ~$15,000 per unit |
| Additional Build Components per Unit | Minimal (Mac Studio is complete workstation; only minor extras like 10GbE NIC if needed) | ~$13,000–$30,000 (full server build: motherboard / CPU / RAM / storage / PSU / cooling / case for 4× H20 node) |
| Networking & Integration (total for cluster) | ~$2,000–$5,000 (10GbE switch, cables, basic rack/stand) | ~$5,000–$12,000 (25/100GbE switch, cabling, rackmount) |
| Total Hardware Cost USD (Est.) (from scratch) | ~$50,000–$75,000 (4 units + extras; realistic pricing incl. assembly/markup) | ~$100,000–$200,000+ (1 full server + networking; incl. server build costs) |
| Performance Profile | Decent Inference, no fine-tuning. | Fast Inference & fine-tuning |
| Performance Metric (GPU -> VRAM) | ~819 GB/s | ~3 TB/s |
| Performance Metric (Per prompt) | ~12-15 tokens/second | ~15-25 tokens/second |
| Performance Metric (Total cluster throughput) | ~150-400 tokens/second (w/ batching) | ~400-1200 tokens/second (w/ batching) |
| Enterprise Support | Limited (Self-Managed) | Great (NVIDIA/Vendor) |
| Availability (HK) | Available | Available (Compliant) |
| Model | Details (8-bit Quantization) | Context Window (tokens) |
|---|---|---|
| DeepSeek R1 Distill 70B | A strong distilled variant of the flagship DeepSeek-R1 series, offering excellent reasoning, math, code, and agentic capabilities at a much more practical inference footprint. (~70–90 GB VRAM) | 128,000 |
| Qwen2.5-72B-Instruct | One of the strongest open-source models available in 2026, excelling in reasoning, math, code, long-context understanding, and tool-calling performance. (~75–95 GB VRAM) | 128,000 |
| Llama-3.3-70B-Instruct | A highly reliable, widely supported model with excellent general reasoning, instruction following, and agentic task performance across various applications. (~70–90 GB VRAM) | 128,000 |
| Nemotron-4 ~70B distill | NVIDIA's top-tier open reasoning model (distilled/pruned variant), delivering outstanding agent performance and complex reasoning for robust automation. (~75–95 GB VRAM) | 128,000 |
| Mixtral-8x22B-Instruct-v0.1 | An highly efficient Mixture-of-Experts (MoE) model known for its fast inference, strong reasoning capabilities, and effective tool-calling performance. (~60–75 GB VRAM) | 64,000–128,000 |
| Milestone | Description & Objectives |
|---|---|
| Business Requirements Gathering & Stakeholder Alignment (Pre-Technical Phase) |
Discuss with key stakeholders (executives, department heads, IT, finance, end-users) to determine the most high-potential use cases, understand management's vision for AI integration, define success criteria, agree on realistic timelines, and estimate required investments and resources. |
| Establish Foundational AI Capabilities (Starting from No AI) |
Build basic infrastructure and skills to support AI initiatives. The goal is to create awareness, secure buy-in, and set up the technical backbone without overwhelming the team. |
| Data Preparation: Cleaning, Structuring & Access Controls (Data Readiness) |
Transform raw company data into secure, compliant, AI-ready formats. Key activities: inventory/assess all data sources, clean (remove duplicates, fix error), structure unstructured documents, implement access controls/data governance, build data pipelines, ensure regulatory compliance. |
| Security & Compliance (Secure Framework) |
Implement security framework with data classification, role-based access, and audit trails. Establish incident response and regular audits. Success: 100% access control, zero privacy violations, full auditability of all AI data processing activities. |
| Introduce Passive AI Chatbot (Employee Access to Basic Query Response) |
Deploy a simple chatbot that responds to user queries on demand, like answering FAQs or retrieving info from internal docs. This is "passive" as it only activates when prompted. |
| Upgrade to Passive AI Assistance (Enhanced Support Tools) |
Evolve the chatbot into a more capable assistant by integrating into applications and business workflows, and handles complex tasks passively, such as summarizing reports or suggesting resources, but still only on user initiation. |
| Develop Proactive AI Agent (Anticipatory Actions) |
Introduce AI agentic workflow and transform the assistant into a proactive entity (AI agent) that anticipates needs, such as sending reminders, flagging issues, or automating routine workflows without explicit prompts. |
| Enable AI Agents Collaboration (Autonomous Multi-Agent System) |
Create an AI agent discovery service, and a network of AI agents that can discover and interact with each other to solve complex problems, like coordinating tasks across departments or optimizing processes collaboratively. |
| Business | Use cases |
|---|---|
| Head-Hunting Firm |
|
| Law Firm |
|
| Hospital / Healthcare Provider |
|
| Government Agency |
|
| Financial Companies (Banks, Investment Firms, Insurance) |
|
| Technology Companies |
|
| Stock Exchanges |
|
Vector (capital) = [4,7,1,0,3,...]
-> Q (query)
-> K (key)
-> V (value)
Vector (of) = [2,7,2,4,6,...]
-> Q (query)
-> K (key)
-> V (value)
Vector (italy) = [8,1,4,9,3,...]
-> Q (query)
-> K (key)
-> V (value)