A prototype on-premise Generative AI system. Protecting 100% of your company sensitive data and intellectual property. See 'About' for details
Retrieval-Augmented Generation (RAG) - Connects the LLM to internal knowledge bases for accurate, up-to-date, and context-aware responses, without having to re-train or fine-tune the entire model.
Powered by battle-tested open-source frameworks and the latest open-source LLMs, utilizing RAG and Agents, protecting 100% of your company sensitive data and intellectual property behind your firewall. Contact contact@ai-evolutions.com to integrate into your company

Advantage of this On-Premise Generative AI platform
On-Premise AI vs External Cloud AI - Cost & Privacy Comparison
Aspect On-Premise AI (Your Own Infrastructure) External Cloud AI (OpenAI, Google Vertex, AWS Bedrock, etc.)
Cost Structure Fixed & predictable
• One-time hardware/software investment
• Ongoing electricity & maintenance (usually lower than cloud at scale)
• No per-token fees
Pay-per-token model
Costs increase exponentially as usage grows
• High-volume enterprise usage can reach millions of dollars per year
• Hidden costs from rate limits, caching tiers, fine-tuning fees
Privacy & Data Security 100% control – data never leaves your network
• No external transmission
• Full compliance with internal policies & regulations
• Company secrets, intellectual property, & client data remain protected
All prompts & documents must be sent to the external provider
• Creates real security & legal risks of leaking:
  – Company trade secrets
  – Client confidential data
  – Personal identifiable information (PII)
• Even with "enterprise" agreements, data may be used for training or retained temporarily
Vendor Lock-in Risk No lock-in
• You own the models, data, and infrastructure
• Switch vendors or go fully open-source anytime
Severe lock-in over time
• External provider owns the context & fine-tuning history of your internal data
• Moving away means losing years of learned patterns & paying to re-train elsewhere
• Switching costs become prohibitive
Technical Overview
This prototype on-premise Generative AI system utilizes the following open source frameworks and models. This setup is cost efficient, customizable to meet specific requirements, and maximizes the ROI for your company.
Component Technology Notes
Hardware Nvidia TU104GL [Tesla T4] 16GB. Intel Xeon Platinum 8259CL (4 core) 16GB. This is an entry level hardware for testing prototype system. Per prompt : ~10 token/second.

For enterprise setup, recommend upgrade in section "Hardware Options".
OS Ubuntu
Web Server Nginx
Web Framework Flask For enterprise setup, recommend upgrade to Django (Python) or Blazor (.NET) for improved scalability.
Inference Framework Ollama For enterprise setup, recommend upgrade to LocalAI or vLLM (continuous batching) for improved scalability.
Reasoning Model DeepSeek-R1-0528-Qwen3-8B-Q4_K_M (4-Bit Quantization) This model is used to handle complex tasks in the workflow. It is a distilled model that combines the best of DeepSeek R1 and Qwen 3.0 with a great balance of speed and quality on entry level hardware.

For enterprise setup, recommend upgrade in section "Model Options".
General Purpose Model Qwen2:0.5B-Instruct-Q4_0 (4-Bit Quantization) This model is used to handle simple tasks in the workflow due to the small parameter size of 0.5B and fast performance.
Embedding Model Nomic-Embed-Text:Latest This model is used with the vector DB for data ingestion and querying. Extensible to other open source models
RAG Framework LlamaIndex Enable LLM to process company internal data sources
Agent Framework LlamaIndex Enable LLM to request actions, such as calling web API, or retrieving latest data from external websites, to automate business workflows
Vector DB Chroma For enterprise setup, recommend upgrade to PostgreSQL w/ pgvector for improved scalability.
Hardware Options
Below options assume running a model with ~70B parameters in 8-bit quantization
Component Option 1: 4x Mac Studio M3 Ultra Cluster Option 2: Nvidia HGX H20 (4x H20 GPUs)
Lower cost. Simple setup. Support ~250 users Higher cost. Complex setup. Support ~500 users
Total VRAM / Unified Memory 2TB Unified Memory (per unit 512GB) 384GB VRAM (per unit 96GB)
GPU/Chip Price USD (Est.) ~$11,500 per unit ~$15,000 per unit
Additional Build Components per Unit Minimal (Mac Studio is complete workstation; only minor extras like 10GbE NIC if needed) ~$13,000–$30,000 (full server build: motherboard / CPU / RAM / storage / PSU / cooling / case for 4× H20 node)
Networking & Integration (total for cluster) ~$2,000–$5,000 (10GbE switch, cables, basic rack/stand) ~$5,000–$12,000 (25/100GbE switch, cabling, rackmount)
Total Hardware Cost USD (Est.) (from scratch) ~$50,000–$75,000 (4 units + extras; realistic pricing incl. assembly/markup) ~$100,000–$200,000+ (1 full server + networking; incl. server build costs)
Performance Profile Decent Inference, no fine-tuning. Fast Inference & fine-tuning
Performance Metric (GPU -> VRAM) ~819 GB/s ~3 TB/s
Performance Metric (Per prompt) ~12-15 tokens/second ~15-25 tokens/second
Performance Metric (Total cluster throughput) ~150-400 tokens/second (w/ batching) ~400-1200 tokens/second (w/ batching)
Enterprise Support Limited (Self-Managed) Great (NVIDIA/Vendor)
Availability (HK) Available Available (Compliant)
Model Options
Model Details (8-bit Quantization) Context Window (tokens)
DeepSeek R1 Distill 70B A strong distilled variant of the flagship DeepSeek-R1 series, offering excellent reasoning, math, code, and agentic capabilities at a much more practical inference footprint. (~70–90 GB VRAM) 128,000
Qwen2.5-72B-Instruct One of the strongest open-source models available in 2026, excelling in reasoning, math, code, long-context understanding, and tool-calling performance. (~75–95 GB VRAM) 128,000
Llama-3.3-70B-Instruct A highly reliable, widely supported model with excellent general reasoning, instruction following, and agentic task performance across various applications. (~70–90 GB VRAM) 128,000
Nemotron-4 ~70B distill NVIDIA's top-tier open reasoning model (distilled/pruned variant), delivering outstanding agent performance and complex reasoning for robust automation. (~75–95 GB VRAM) 128,000
Mixtral-8x22B-Instruct-v0.1 An highly efficient Mixture-of-Experts (MoE) model known for its fast inference, strong reasoning capabilities, and effective tool-calling performance. (~60–75 GB VRAM) 64,000–128,000
AI Integration Roadmap
This roadmap outlines a step-by-step progression from zero AI capabilities to a sophisticated ecosystem where AI teammates collaborate autonomously. Each milestone builds on the previous, focusing on incremental adoption to minimize disruption while maximizing value.

When this roadmap is fully implemented, every employee will be empowered with their own AI agent — an intelligent, proactive partner that works alongside them 24/7. This AI agent will dramatically increase individual capabilities by handling repetitive tasks, providing instant insights, accelerating decision-making, and enabling focus on high-value creative and strategic work. Ultimately driving sustainable competitive advantage and long-term revenue growth for the company.
Milestone Description & Objectives
Business Requirements Gathering & Stakeholder Alignment
(Pre-Technical Phase)
Discuss with key stakeholders (executives, department heads, IT, finance, end-users) to determine the most high-potential use cases, understand management's vision for AI integration, define success criteria, agree on realistic timelines, and estimate required investments and resources.
Establish Foundational AI Capabilities
(Starting from No AI)
Build basic infrastructure and skills to support AI initiatives. The goal is to create awareness, secure buy-in, and set up the technical backbone without overwhelming the team.
Data Preparation: Cleaning, Structuring & Access Controls
(Data Readiness)
Transform raw company data into secure, compliant, AI-ready formats. Key activities: inventory/assess all data sources, clean (remove duplicates, fix error), structure unstructured documents, implement access controls/data governance, build data pipelines, ensure regulatory compliance.
Security & Compliance
(Secure Framework)
Implement security framework with data classification, role-based access, and audit trails. Establish incident response and regular audits. Success: 100% access control, zero privacy violations, full auditability of all AI data processing activities.
Introduce Passive AI Chatbot
(Employee Access to Basic Query Response)
Deploy a simple chatbot that responds to user queries on demand, like answering FAQs or retrieving info from internal docs. This is "passive" as it only activates when prompted.
Upgrade to Passive AI Assistance
(Enhanced Support Tools)
Evolve the chatbot into a more capable assistant by integrating into applications and business workflows, and handles complex tasks passively, such as summarizing reports or suggesting resources, but still only on user initiation.
Develop Proactive AI Agent
(Anticipatory Actions)
Introduce AI agentic workflow and transform the assistant into a proactive entity (AI agent) that anticipates needs, such as sending reminders, flagging issues, or automating routine workflows without explicit prompts.
Enable AI Agents Collaboration
(Autonomous Multi-Agent System)
Create an AI agent discovery service, and a network of AI agents that can discover and interact with each other to solve complex problems, like coordinating tasks across departments or optimizing processes collaboratively.
This on-premise Generative AI system can be enhanced to support below potential use cases :
Business Use cases
Head-Hunting Firm
  • Intelligent Resume & Candidate Screening
  • Automated Interview Question Generation & Candidate Scoring
  • AI-Powered Predictive Talent Sourcing & Outreach
Law Firm
  • Contract Review & Redaction
  • Legal Research & Case Law Summarization
  • Rapid Proposal & Pitch Automation for New Business
Hospital / Healthcare Provider
  • Clinical Decision Support & Risk Prediction
  • Automated Medical Report Summarization
  • Personalized Patient Acquisition & Retention Marketing
Government Agency
  • Sensitive Document Classification & Redaction
  • Policy Impact Simulation & Scenario Analysis
  • Predictive Resource Allocation for Economic Development
Financial Companies (Banks, Investment Firms, Insurance)
  • Real-time Fraud Detection & Anti-Money Laundering (AML)
  • Trade Surveillance & Market Abuse Detection
  • Predictive Trading & Investment Strategy Optimization
Technology Companies
  • Automated Code Review, Bug Detection & Technical Debt Prioritization
  • Intelligent Internal Documentation Generation & Knowledge Base Maintenance
  • AI-Driven Product Feature Prioritization & Revenue Impact Forecasting
Stock Exchanges
  • AI-Enhanced Trade Surveillance & Market Abuse Detection
  • Automated IPO Document Generation & Due Diligence Acceleration
  • AI-Driven Financial Product Innovation & Premium Revenue Streams
Notes :
Explore a simple example of the question "Capital of Italy", and how GPT-style LLM predicts the answer "Rome"
Step 1: Tokenization & Embedding Lookup
Each word in the question "capital" "of" "italy" is converted into a token. For each token, lookup the LLM Token Embedding Table to find a vector. This vector is a multi-dimensional structure, and it contains the learned embeddings for the token. This is the initial state of the token vector. At this stage we have 3 token vectors :

Embedding Vector (capital) = [4,7,1,0,3,...]
Embedding Vector (of) = [2,7,2,4,6,...]
Embedding Vector (italy) = [8,1,4,9,3,...]
Step 2: Positional Encoding
Before passing the Embedding Vector to the transformers layers, Positional Encoding is added to each Embedding Vector. This gives the model a sense of sequence of the Vectors, since transformers by themselves do not know the order of the tokens "capital of italy"

Vector (capital) = Embedding Vector (capital) + Positional (1)

Vector (of) = Embedding Vector (of) + Positional (2)

Vector (italy) = Embedding Vector (italy) + Positional (3)

With the Positional Encoding added, now the model knows Vector (capital) comes first, Vector (of) second, and Vector (italy) third, so the transformer layers can respect sequence
Step 3: Transformer Layer - Self-Attention
Step 3a - Q, K, V vector projection - At each transformer layer, there are multiple attention heads, and each attention head contains 3 trained weights matrices :

Wq - this projects a token vector into a "query space" -> Q = what context the token is seeking
Wk - this projects a token vector into a "key space" -> K = what context it can provide
Wv - this projects a token vector into a "value space" -> V = the information of the context

Each Wq, Wk, Wv are applied to each token vector to project a Q, K, and V vector :
Vector (capital) = [4,7,1,0,3,...]
                   -> Q (query)
                   -> K (key)
                   -> V (value)
Vector (of) = [2,7,2,4,6,...]
              -> Q (query)
              -> K (key)
              -> V (value)
Vector (italy) = [8,1,4,9,3,...]
               -> Q (query)
               -> K (key)
               -> V (value)

Step 3b - Q-K similarity check - For each token vector, it compares Q vector against the K vector of every other token vector before it, including itself

Vector (capital) Q -> compare against Vector (capital) K -> calculate Score (capital,capital)

Vector (of) Q -> compare against Vector (capital) K -> calculate Score (of,capital)
Vector (of) Q -> compare against Vector (of) K -> calculate Score (of,of)

Vector (italy) Q -> compare against Vector (capital) K -> calculate Score (italy,capital)
Vector (italy) Q -> compare against Vector (of) K -> calculate Score (italy,of)
Vector (italy) Q -> compare against Vector (italy) K -> calculate Score (italy,italy)

Step 3c - Softmax Normalization - Convert raw scores into attention weights for computation in the next stage

Score (capital,capital) -> Softmax Normalization -> Weight (capital,capital)

Score (of,capital) -> Softmax Normalization -> Weight (of,capital)
Score (of,of) -> Softmax Normalization -> Weight (of,of)

Score (italy,capital) -> Softmax Normalization -> Weight (italy,capital)
Score (italy,of) -> Softmax Normalization -> Weight (italy,of)
Score (italy,italy) -> Softmax Normalization -> Weight (italy,italy)

Step 3d - Self-Attention Vector - After the comparision, generates a new context‑aware representation by weighting and combining the V vectors of all tokens before it

Self-Attention vector (capital) = Weight (capital,capital) * V (capital)

Self-Attention vector (of) = Weight (of,capital) * V (capital) + Weight (of,of) * V (of)

Self-Attention vector (italy) = Weight (italy,capital) * V (capital) + Weight (italy,of) * V (of) + Weight (italy,italy) * V (italy)

This process of comparing each token vector Q (what context is seeking) against every other token vector K (what context can provide) before it, and generating a self-attention output vector , is the Self-Attention process, and this allows every token vector (word) in the prompt, to have a context and meaning, based on it's relationship with every other token vector (word) in the prompt

This describes the process for one attention head. Each transformer layer can have multiple attention heads, and this process is repeated independently for each attention head, and the outputs are combined
Step 4: Transformer Layer - Residual Connections & Layer Normalization
Step 4a - Residual Connections - combines the original Embedding Vector (capital) and the Self-Attention Vector (capital). This ensures the model keeps both the original meaning and the contextual enrichment

ResidualOutput Vector (capital) = Original Embedding Vector (capital) + Self-Attention Vector (capital)

Step 4b - Layer Normalization - ResidualOutput Vector (capital) is normalized across dimensions. This stabilizes the values and prepares them for the next step

ResidualOutput Vector (capital) -> Normalization -> Normalized Vector (capital)

Repeat above steps to generate Normalized Vector (of) and Normalized Vector (italy)
Step 5: Transformer Layer - Feed-Forward Network (FFN)
Step 5a - Linear expansion - Apply the Expanded Weight Matrix to the Normalized Vector (capital), and expand the vector into a higher-dimensional space

Expanded Vector (capital) = Expanded Weight matrix * Normalized Vector (capital)

The Expanded Vector (capital) now contains enriched representations (syntax, semantics, positional cues, ...)

Step 5b - Bias addition - Apply the transformer layer Bias Vector to the Expanded Vector (capital), and shifts the Expanded Vector (capital) with a baseline offsets. This enables richer signal to propagate into the next stage

Expanded Vector w/ Bais (capital) = Expanded Vector (cpaital) + Bias Vector

Step 5c - Activation GELU (Gaussian Error Linear Unit) - GELU smoothly gates each dimension of Expanded Vector w/ Bias (capital), preserving strong signals while attenuating weaker ones, thereby refining the representation for the next stage

EV = Expanded Vector w/ Bias (capital)
CDF = Cumulative Distribution Function

GELU Smoothed Vector (capital) = EV * CDF(EV)

Step 5d - Linear compression - Apply the Compressed Weight Matrix to compress the vector back to the original dimensions, and generate the FFN (Feed-Forward Network) Vector, so it can be processed by the next stage

FFN Vector (capital) = Compressed Weight Matrix * GELU Smoothed Vector (capital) + Bias Vector

Repeat above steps to generate FFN Vector (of) and FFN Vector (italy)
Step 6: Transformer Layer - Residual Connections & Layer Normalization
Step 6a - Residual Connections - combines the original Normalized Vector (capital) and the FFN Vector (capital). This ensures the model keeps both the original meaning and the contextual enrichment

ResidualOutput Vector (capital) = Original Normalized Vector (capital) + FFN Vector (capital)

Step 6b - Layer Normalization - ResidualOutput Vector (capital) is normalized across dimensions. This stabilizes the values and prepares them for the next step

ResidualOutput Vector (capital) -> Normalization -> Normalized Vector (capital)

Repeat above steps to generate Normalized Vector (of) and Normalized Vector (italy)
Step 7: Stacking Layers
An LLM has multiple transformer layers, with each layer adding Depth and Hierarchical Learning to refine the representation

Depth - Each transformer layer allow the model the capture increasingly complex relationships (syntax, semantics, context)

Hierarchical Learning - Each transformer layer learn local patterns (word proximity, short dependencies). Deeper layers learn global patterns (long-range dependencies, abstract meaning)

For each transformer layer, the following steps are repeated :

Input : Vector (capital), Vector (of), Vector (italy)

Step 3: Transformer Layer - Self-Attention

Step 4: Transformer Layer - Residual Connections & Layer Normalization

Step 5: Transformer Layer - Feed-Forward Network (FFN)

Step 6: Transformer Layer - Residual Connections & Layer Normalization

Output : Normalized Vector (capital), Normalized Vector (of), Normalized Vector (italy)
Step 8: Final Layer & Output Projection
Step 8a - Final Layer - After stacking multiple transformer layers (Step 7), the last layer produces the final hidden representation of each token in the sequence :

Final Hidden Vector (capital) = Refines and integrates information of Normalized Vector (capital) of all transformer layers

Final Hidden Vector (of) = Refines and integrates information of Normalized Vector (of) of all transformer layers

Final Hidden Vector (italy) = Refines and integrates information of Normalized Vector (italy) of all transformer layers

Step 8b - Output Projection - Projects the Final Hidden Vector into the LLM vocabulary space, to calculate the Logit Vector, which is the relevance score of each vocabulary to the Final Hidden Vector :

Logit Vector (capital) = Vocabulary Weight Matrix * Final Hidden Vector (capital) + Bias Vector

Logit Vector (of) = Vocabulary Weight Matrix * Final Hidden Vector (of) + Bias Vector

Logit Vector (italy) = Vocabulary Weight Matrix * Final Hidden Vector (italy) + Bias Vector
Step 9: Softmax & Token Selection
In a GPT (Generative Pre-Trained Transformer) model, it only use the last token's Logit Vector to predict the next vocabulary. Since "italy" is the last token in the question "Capital of Italy", only Logit Vector (italy) will be used to predict the next vocabulary.

Note that althrough Logit Vector (capital) and Logit Vector (of) are not used to predict the next vocabulary, they have contributed to the final representation of Logit Vector (italy) , through the transformer layers

Step 9a - Softmax Probabilities - A softmax function is applied to the Logit Vector (italy) to convert raw scores into probabilities. This produces a probability distribution across the entire vocabulary in the LLM :

Probability Distribution (all vocabulary) = SoftMax [Logit Vector (italy)]

Step 9b - Prediction - Predicts the next vocabulary with the highest probability
Inside Probability Distribution (all vocabulary), it contains the probability of each vocabulary in the LLM relative to Logit Vector (italy). For example, the distribution might look like :

Probability (rome) = 0.55

Probability (milan) = 0.18

Probability (naples) = 0.12

Probability (turin) = 0.09

Probability (florence) = 0.06

...

Since Probability (rome) has the highest probability, Rome is selected as the next predicted vocabulary, answering the question "Capital of Italy"
Explore a simple example of LLM agentic workflow
Step 1: User Request
Step 2: Orchestration Layer process user request
Step 3: Orchestration Layer provides Tool Registry to LLM
Step 4: LLM Reasoning with execution plan & tool call
Step 5: Orchestration Layer validates & execute tool call
Step 6: LLM validates results, and provides next tool call
Step 7: Complete execution plan & tool call
Step 8: LLM Final Reasoning & Synthesis
Step 9: Orchestration Layer final output to user
Step 10: Memory/Context updated