AI Evolutions

Powered by battle-tested open-source frameworks and the latest open-source LLMs, utilizing RAG and Agents, protecting 100% of your company sensitive data and intellectual property behind your firewall. Contact contact@ai-evolutions.com to integrate into your company

Advantage of this On-Premise Generative AI platform

Cost Efficiency - Unlimited prompts and tokens limited only by hardware — no per-use fees.
Data Privacy & Security - Runs entirely on the internal network. All prompts and data remain private and fully controlled. Essential for regulated industries.
Model Flexibility - Support multiple models, and easily integrate the latest and most suitable open-source LLMs to match business needs.
Retrieval-Augmented Generation (RAG) - Connects the LLM to internal knowledge bases for accurate, up-to-date, and context-aware responses.
Agentic Capabilities - Enables autonomous agents to execute actions and automate complex business workflows.
Anywhere Access - Extensible for integration into Microsoft Teams or WhatsApp (business account) via webhooks, or company internal systems via web API, empowering employees to access AI capabilities seamlessly across platforms — anytime, anywhere. Scenarios :
- Management - On business trips, executives direct AI to generate YTD performance summaries, forecasts, and projections for strategic decisions and communication
- Sales - At client sites, sales teams instruct AI to create tailored market research and proposals to accelerate deal wins
- Support - Upon system alerts, engineers direct AI to analyze logs, identify root causes, and recommend solutions for faster resolution

On-Premise AI vs External Cloud AI - Cost & Privacy Comparison

Aspect	On-Premise AI (Your Own Infrastructure)	External Cloud AI (OpenAI, Google Vertex, AWS Bedrock, etc.)
Cost Structure	Fixed & predictable • One-time hardware/software investment • Ongoing electricity & maintenance (usually lower than cloud at scale) • No per-token fees	Pay-per-token model • Costs increase exponentially as usage grows • High-volume enterprise usage can reach millions of dollars per year • Hidden costs from rate limits, caching tiers, fine-tuning fees
Privacy & Data Security	100% control – data never leaves your network • No external transmission • Full compliance with internal policies & regulations • Company secrets, intellectual property, & client data remain protected	All prompts & documents must be sent to the external provider • Creates real security & legal risks of leaking: – Company trade secrets – Client confidential data – Personal identifiable information (PII) • Even with "enterprise" agreements, data may be used for training or retained temporarily
Vendor Lock-in Risk	No lock-in • You own the models, data, and infrastructure • Switch vendors or go fully open-source anytime	Severe lock-in over time • External provider owns the context & fine-tuning history of your internal data • Moving away means losing years of learned patterns & paying to re-train elsewhere • Switching costs become prohibitive

Technical Overview
This prototype on-premise Generative AI system utilizes the following open source frameworks and models. This setup is cost efficient, customizable to meet specific requirements, and maximizes the ROI for your company.

Component	Technology	Notes
Hardware	Nvidia TU104GL [Tesla T4] 16GB. Intel Xeon Platinum 8259CL (4 core) 16GB.	This is an entry level hardware for testing prototype system. Per prompt : ~10 token/second. For enterprise setup, recommend upgrade in section "Hardware Options".
OS	Ubuntu
Web Server	Nginx
Web Framework	Flask	For enterprise setup, recommend upgrade to Django (Python) or Blazor (.NET) for improved scalability.
Inference Framework	Ollama	For enterprise setup, recommend upgrade to LocalAI or vLLM (continuous batching) for improved scalability.
Reasoning Model	DeepSeek-R1-0528-Qwen3-8B-Q4_K_M (4-Bit Quantization)	This model is used to handle complex tasks in the workflow. It is a distilled model that combines the best of DeepSeek R1 and Qwen 3.0 with a great balance of speed and quality on entry level hardware. For enterprise setup, recommend upgrade in section "Model Options".
General Purpose Model	Qwen2:0.5B-Instruct-Q4_0 (4-Bit Quantization)	This model is used to handle simple tasks in the workflow due to the small parameter size of 0.5B and fast performance.
Embedding Model	Nomic-Embed-Text:Latest	This model is used with the vector DB for data ingestion and querying. Extensible to other open source models
RAG Framework	LlamaIndex	Enable LLM to process company internal data sources
Agent Framework	LlamaIndex	Enable LLM to request actions, such as calling web API, or retrieving latest data from external websites, to automate business workflows
Vector DB	Chroma	For enterprise setup, recommend upgrade to PostgreSQL w/ pgvector for improved scalability.

Hardware Options
Below options assume running a model with ~70B parameters in 8-bit quantization

Component	Option 1: 4x Mac Studio M3 Ultra Cluster	Option 2: Nvidia HGX H20 (4x H20 GPUs)
	Lower cost. Simple setup. Support ~250 users	Higher cost. Complex setup. Support ~500 users
Total VRAM / Unified Memory	2TB Unified Memory (per unit 512GB)	384GB VRAM (per unit 96GB)
GPU/Chip Price USD (Est.)	~$11,500 per unit	~$15,000 per unit
Additional Build Components per Unit	Minimal (Mac Studio is complete workstation; only minor extras like 10GbE NIC if needed)	~$13,000–$30,000 (full server build: motherboard / CPU / RAM / storage / PSU / cooling / case for 4× H20 node)
Networking & Integration (total for cluster)	~$2,000–$5,000 (10GbE switch, cables, basic rack/stand)	~$5,000–$12,000 (25/100GbE switch, cabling, rackmount)
Total Hardware Cost USD (Est.) (from scratch)	~$50,000–$75,000 (4 units + extras; realistic pricing incl. assembly/markup)	~$100,000–$200,000+ (1 full server + networking; incl. server build costs)
Performance Profile	Decent Inference, no fine-tuning.	Fast Inference & fine-tuning
Performance Metric (GPU -> VRAM)	~819 GB/s	~3 TB/s
Performance Metric (Per prompt)	~12-15 tokens/second	~15-25 tokens/second
Performance Metric (Total cluster throughput)	~150-400 tokens/second (w/ batching)	~400-1200 tokens/second (w/ batching)
Enterprise Support	Limited (Self-Managed)	Great (NVIDIA/Vendor)
Availability (HK)	Available	Available (Compliant)

Model Options

Model	Details (8-bit Quantization)	Context Window (tokens)
DeepSeek R1 Distill 70B	A strong distilled variant of the flagship DeepSeek-R1 series, offering excellent reasoning, math, code, and agentic capabilities at a much more practical inference footprint. (~70–90 GB VRAM)	128,000
Qwen2.5-72B-Instruct	One of the strongest open-source models available in 2026, excelling in reasoning, math, code, long-context understanding, and tool-calling performance. (~75–95 GB VRAM)	128,000
Llama-3.3-70B-Instruct	A highly reliable, widely supported model with excellent general reasoning, instruction following, and agentic task performance across various applications. (~70–90 GB VRAM)	128,000
Nemotron-4 ~70B distill	NVIDIA's top-tier open reasoning model (distilled/pruned variant), delivering outstanding agent performance and complex reasoning for robust automation. (~75–95 GB VRAM)	128,000
Mixtral-8x22B-Instruct-v0.1	An highly efficient Mixture-of-Experts (MoE) model known for its fast inference, strong reasoning capabilities, and effective tool-calling performance. (~60–75 GB VRAM)	64,000–128,000

AI Integration Roadmap
This roadmap outlines a step-by-step progression from zero AI capabilities to a sophisticated ecosystem where AI agents collaborate autonomously. Each milestone builds on the previous, focusing on incremental adoption to minimize disruption while maximizing value.

When this roadmap is fully implemented, every employee will be empowered with their own AI agent — an intelligent, proactive partner that works alongside them 24/7. This AI agent will dramatically increase individual capabilities by handling repetitive tasks, providing instant insights, accelerating decision-making, and enabling focus on high-value creative and strategic work. Ultimately driving sustainable competitive advantage and long-term revenue growth for the company.

Milestone	Description & Objectives
Business Requirements Gathering & Stakeholder Alignment (Pre-Technical Phase)	Discuss with key stakeholders (executives, department heads, IT, finance, end-users) to determine the most high-potential use cases, understand management's vision for AI integration, define success criteria, agree on realistic timelines, and estimate required investments and resources.
Establish Foundational AI Capabilities (Starting from No AI)	Build basic infrastructure and skills to support AI initiatives. The goal is to create awareness, secure buy-in, and set up the technical backbone without overwhelming the team.
Data Preparation: Cleaning, Structuring & Access Controls (Data Readiness)	Transform raw company data into secure, compliant, AI-ready formats. Key activities: inventory/assess all data sources, clean (remove duplicates, fix error), structure unstructured documents, implement access controls/data governance, build data pipelines, ensure regulatory compliance.
Security & Compliance (Secure Framework)	Implement security framework with data classification, role-based access, and audit trails. Establish incident response and regular audits. Success: 100% access control, zero privacy violations, full auditability of all AI data processing activities.
Introduce Passive AI Chatbot (Employee Access to Basic Query Response)	Deploy a simple chatbot that responds to user queries on demand, like answering FAQs or retrieving info from internal docs. This is "passive" as it only activates when prompted.
Upgrade to Passive AI Assistance (Enhanced Support Tools)	Evolve the chatbot into a more capable assistant by integrating into applications and business workflows, and handles complex tasks passively, such as summarizing reports or suggesting resources, but still only on user initiation.
Develop Proactive AI Agent (Anticipatory Actions)	Introduce AI agentic workflow and transform the assistant into a proactive entity (AI agent) that anticipates needs, such as sending reminders, flagging issues, or automating routine workflows without explicit prompts.
Enable AI Agents Collaboration (Autonomous Multi-Agent System)	Create an AI agent discovery service, and a network of AI agents that can discover and interact with each other to solve complex problems, like coordinating tasks across departments or optimizing processes collaboratively.

This on-premise Generative AI system can be enhanced to support below potential use cases :

Business	Use cases
Head-Hunting Firm	Intelligent Resume & Candidate Screening Automated Interview Question Generation & Candidate Scoring AI-Powered Predictive Talent Sourcing & Outreach
Hospitality (Hotels, Restaurants, Resorts)	Hyper-Personalized Guest Experience & Acquisition Marketing Automated Guest Service & Operations Coordination AI-Powered Dynamic Pricing & Revenue Management Optimization
Law Firm	Contract Review & Redaction Legal Research & Case Law Summarization Rapid Proposal & Pitch Automation for New Business
Hospital / Healthcare Provider	Clinical Decision Support & Risk Prediction Automated Medical Report Summarization Personalized Patient Acquisition & Retention Marketing
Government Agency	Sensitive Document Classification & Redaction Policy Impact Simulation & Scenario Analysis Predictive Resource Allocation for Economic Development
Financial Companies (Banks, Investment Firms, Insurance)	Real-time Fraud Detection & Anti-Money Laundering (AML) Trade Surveillance & Market Abuse Detection Predictive Trading & Investment Strategy Optimization
Technology Companies	Automated Code Review, Bug Detection & Technical Debt Prioritization Intelligent Internal Documentation Generation & Knowledge Base Maintenance AI-Driven Product Feature Prioritization & Revenue Impact Forecasting
Stock Exchanges	AI-Enhanced Trade Surveillance & Market Abuse Detection Automated IPO Document Generation & Due Diligence Acceleration AI-Driven Financial Product Innovation & Premium Revenue Streams

Notes :

Explore a simple example of the question "Capital of Italy", and how GPT-style LLM predicts the answer "Rome"

Step 1: Tokenization & Embedding Lookup
Each word in the question "capital" "of" "italy" is converted into a token. For each token, lookup the LLM Token Embedding Table to find a vector. This vector is a multi-dimensional structure, and it contains the learned embeddings for the token. This is the initial state of the token vector. At this stage we have 3 token vectors :

Embedding Vector (capital) = [4,7,1,0,3,...]
Embedding Vector (of) = [2,7,2,4,6,...]
Embedding Vector (italy) = [8,1,4,9,3,...]

↓

Step 2: Positional Encoding
Before passing the Embedding Vector to the transformers layers, Positional Encoding is added to each Embedding Vector. This gives the model a sense of sequence of the Vectors, since transformers by themselves do not know the order of the tokens "capital of italy"

Vector (capital) = Embedding Vector (capital) + Positional (1)

Vector (of) = Embedding Vector (of) + Positional (2)

Vector (italy) = Embedding Vector (italy) + Positional (3)

With the Positional Encoding added, now the model knows Vector (capital) comes first, Vector (of) second, and Vector (italy) third, so the transformer layers can respect sequence

↓

Step 3: Transformer Layer - Self-Attention
Step 3a - Q, K, V vector projection - At each transformer layer, there are multiple attention heads, and each attention head contains 3 trained weights matrices :

Wq - this projects a token vector into a "query space" -> Q = what context the token is seeking
Wk - this projects a token vector into a "key space" -> K = what context it can provide
Wv - this projects a token vector into a "value space" -> V = the information of the context

Each Wq, Wk, Wv are applied to each token vector to project a Q, K, and V vector :

Vector (capital) = [4,7,1,0,3,...]
                   -> Q (query)
                   -> K (key)
                   -> V (value)

Vector (of) = [2,7,2,4,6,...]
              -> Q (query)
              -> K (key)
              -> V (value)

Vector (italy) = [8,1,4,9,3,...]
               -> Q (query)
               -> K (key)
               -> V (value)

Step 3b - Q-K similarity check - For each token vector, it compares Q vector against the K vector of every other token vector before it, including itself

Vector (capital) Q -> compare against Vector (capital) K -> calculate Score (capital,capital)

Vector (of) Q -> compare against Vector (capital) K -> calculate Score (of,capital)
Vector (of) Q -> compare against Vector (of) K -> calculate Score (of,of)

Vector (italy) Q -> compare against Vector (capital) K -> calculate Score (italy,capital)
Vector (italy) Q -> compare against Vector (of) K -> calculate Score (italy,of)
Vector (italy) Q -> compare against Vector (italy) K -> calculate Score (italy,italy)

Step 3c - Softmax Normalization - Convert raw scores into attention weights for computation in the next stage

Score (capital,capital) -> Softmax Normalization -> Weight (capital,capital)

Score (of,capital) -> Softmax Normalization -> Weight (of,capital)
Score (of,of) -> Softmax Normalization -> Weight (of,of)

Score (italy,capital) -> Softmax Normalization -> Weight (italy,capital)
Score (italy,of) -> Softmax Normalization -> Weight (italy,of)
Score (italy,italy) -> Softmax Normalization -> Weight (italy,italy)

Step 3d - Self-Attention Vector - After the comparision, generates a new context‑aware representation by weighting and combining the V vectors of all tokens before it

Self-Attention vector (capital) = Weight (capital,capital) * V (capital)

Self-Attention vector (of) = Weight (of,capital) * V (capital) + Weight (of,of) * V (of)

Self-Attention vector (italy) = Weight (italy,capital) * V (capital) + Weight (italy,of) * V (of) + Weight (italy,italy) * V (italy)

This process of comparing each token vector Q (what context is seeking) against every other token vector K (what context can provide) before it, and generating a self-attention output vector , is the Self-Attention process, and this allows every token vector (word) in the prompt, to have a context and meaning, based on it's relationship with every other token vector (word) in the prompt

This describes the process for one attention head. Each transformer layer can have multiple attention heads, and this process is repeated independently for each attention head, and the outputs are combined

↓

Step 4: Transformer Layer - Residual Connections & Layer Normalization
Step 4a - Residual Connections - combines the original Embedding Vector (capital) and the Self-Attention Vector (capital). This ensures the model keeps both the original meaning and the contextual enrichment

ResidualOutput Vector (capital) = Original Embedding Vector (capital) + Self-Attention Vector (capital)

Step 4b - Layer Normalization - ResidualOutput Vector (capital) is normalized across dimensions. This stabilizes the values and prepares them for the next step

ResidualOutput Vector (capital) -> Normalization -> Normalized Vector (capital)

Repeat above steps to generate Normalized Vector (of) and Normalized Vector (italy)

↓

Step 5: Transformer Layer - Feed-Forward Network (FFN)
Step 5a - Linear expansion - Apply the Expanded Weight Matrix to the Normalized Vector (capital), and expand the vector into a higher-dimensional space

Expanded Vector (capital) = Expanded Weight matrix * Normalized Vector (capital)

The Expanded Vector (capital) now contains enriched representations (syntax, semantics, positional cues, ...)

Step 5b - Bias addition - Apply the transformer layer Bias Vector to the Expanded Vector (capital), and shifts the Expanded Vector (capital) with a baseline offsets. This enables richer signal to propagate into the next stage

Expanded Vector w/ Bais (capital) = Expanded Vector (cpaital) + Bias Vector

Step 5c - Activation GELU (Gaussian Error Linear Unit) - GELU smoothly gates each dimension of Expanded Vector w/ Bias (capital), preserving strong signals while attenuating weaker ones, thereby refining the representation for the next stage

EV = Expanded Vector w/ Bias (capital)
CDF = Cumulative Distribution Function

GELU Smoothed Vector (capital) = EV * CDF(EV)

Step 5d - Linear compression - Apply the Compressed Weight Matrix to compress the vector back to the original dimensions, and generate the FFN (Feed-Forward Network) Vector, so it can be processed by the next stage

FFN Vector (capital) = Compressed Weight Matrix * GELU Smoothed Vector (capital) + Bias Vector

Repeat above steps to generate FFN Vector (of) and FFN Vector (italy)

↓

Step 6: Transformer Layer - Residual Connections & Layer Normalization
Step 6a - Residual Connections - combines the original Normalized Vector (capital) and the FFN Vector (capital). This ensures the model keeps both the original meaning and the contextual enrichment

ResidualOutput Vector (capital) = Original Normalized Vector (capital) + FFN Vector (capital)

Step 6b - Layer Normalization - ResidualOutput Vector (capital) is normalized across dimensions. This stabilizes the values and prepares them for the next step

ResidualOutput Vector (capital) -> Normalization -> Normalized Vector (capital)

Repeat above steps to generate Normalized Vector (of) and Normalized Vector (italy)

↓

Step 7: Stacking Layers
An LLM has multiple transformer layers, with each layer adding Depth and Hierarchical Learning to refine the representation

Depth - Each transformer layer allow the model the capture increasingly complex relationships (syntax, semantics, context)

Hierarchical Learning - Each transformer layer learn local patterns (word proximity, short dependencies). Deeper layers learn global patterns (long-range dependencies, abstract meaning)

For each transformer layer, the following steps are repeated :

Input : Vector (capital), Vector (of), Vector (italy)

Step 3: Transformer Layer - Self-Attention

Step 4: Transformer Layer - Residual Connections & Layer Normalization

Step 5: Transformer Layer - Feed-Forward Network (FFN)

Step 6: Transformer Layer - Residual Connections & Layer Normalization

Output : Normalized Vector (capital), Normalized Vector (of), Normalized Vector (italy)

↓

Step 8: Final Layer & Output Projection
Step 8a - Final Layer - After stacking multiple transformer layers (Step 7), the last layer produces the final hidden representation of each token in the sequence :

Final Hidden Vector (capital) = Refines and integrates information of Normalized Vector (capital) of all transformer layers

Final Hidden Vector (of) = Refines and integrates information of Normalized Vector (of) of all transformer layers

Final Hidden Vector (italy) = Refines and integrates information of Normalized Vector (italy) of all transformer layers

Step 8b - Output Projection - Projects the Final Hidden Vector into the LLM vocabulary space, to calculate the Logit Vector, which is the relevance score of each vocabulary to the Final Hidden Vector :

Logit Vector (capital) = Vocabulary Weight Matrix * Final Hidden Vector (capital) + Bias Vector

Logit Vector (of) = Vocabulary Weight Matrix * Final Hidden Vector (of) + Bias Vector

Logit Vector (italy) = Vocabulary Weight Matrix * Final Hidden Vector (italy) + Bias Vector

↓

Step 9: Softmax & Token Selection
In a GPT (Generative Pre-Trained Transformer) model, it only use the last token's Logit Vector to predict the next vocabulary. Since "italy" is the last token in the question "Capital of Italy", only Logit Vector (italy) will be used to predict the next vocabulary.

Note that althrough Logit Vector (capital) and Logit Vector (of) are not used to predict the next vocabulary, they have contributed to the final representation of Logit Vector (italy) , through the transformer layers

Step 9a - Softmax Probabilities - A softmax function is applied to the Logit Vector (italy) to convert raw scores into probabilities. This produces a probability distribution across the entire vocabulary in the LLM :

Probability Distribution (all vocabulary) = SoftMax [Logit Vector (italy)]

Step 9b - Prediction - Predicts the next vocabulary with the highest probability
Inside Probability Distribution (all vocabulary), it contains the probability of each vocabulary in the LLM relative to Logit Vector (italy). For example, the distribution might look like :

Probability (rome) = 0.55

Probability (milan) = 0.18

Probability (naples) = 0.12

Probability (turin) = 0.09

Probability (florence) = 0.06

...

Since Probability (rome) has the highest probability, Rome is selected as the next predicted vocabulary, answering the question "Capital of Italy"

Explore a simple example of LLM agentic workflow

Step 1: User Request

↓

Step 2: Orchestration Layer process user request

↓

Step 3: Orchestration Layer provides Tool Registry to LLM

↓

Step 4: LLM Reasoning with execution plan & tool call

↓

Step 5: Orchestration Layer validates & execute tool call

↓

Step 6: LLM validates results, and provides next tool call

↓

Step 7: Complete execution plan & tool call

↓

Step 8: LLM Final Reasoning & Synthesis

↓

Step 9: Orchestration Layer final output to user

↓

Step 10: Memory/Context updated