Architecture

Technical Deep Dive

Full technical specification. Written for engineers, operators, and architects evaluating the platform.

Layer 1

Hardware

Compute

GPU Nodes

NVIDIA GPU cluster — dedicated VRAM per inference session, no multi-tenant GPU sharing on active workloads

CPU Configuration

Multi-core x86-64 host CPUs for orchestration, pre/post-processing, and API routing layers

System RAM

High-capacity ECC RAM. Inference context loaded into RAM at session start — no swap, no disk writes during active processing

Storage

Model Storage

NVMe SSD for model weights and vector index persistence. Read-optimized, not used as inference scratch space

Client Data

No client payload written to disk. Prompt content, context windows, and intermediate vectors are held in RAM only and discarded at session teardown

Audit Records

Structured receipt log (session ID, timestamp, token count, model version) — no content, client-held copy on request

Physical Hosting

Provider

Dedicated hardware in a Tier-grade datacenter facility — not shared cloud. No hypervisor tenant isolation required because the hardware itself is not shared

Regional Gateways

Regional PoP nodes (dedicated hardware in a Tier-grade datacenter facility) used as WireGuard ingress endpoints — geographically distributed entry with centralized compute

Power & Redundancy

Tier-grade facility power, UPS, cooling, and redundant uplinks — operator-level SLA applied to the physical layer

Layer 2

Network Topology

Client
──TLS──▶
Regional PoP
──WireGuard──▶
Private Backbone
──100 GbE NIC-to-NIC──▶
GPU Cluster

External Surface

Ingress

2.5 GbE external-facing link per regional gateway. All non-WireGuard traffic is dropped at iptables before it reaches the application layer

Port Policy

NGINX configured to return 444 (silent drop, no response) on all connections not originating from an authenticated WireGuard peer. No banner, no headers, no TCP RST — the port appears closed

Tunnel Protocol

WireGuard over UDP — modern cryptography (Curve25519, ChaCha20-Poly1305, BLAKE2s), stateless handshake, no persistent session tables

Internal Fabric

Topology

NIC-to-NIC direct attach — nodes are connected at the physical layer with no intermediate switch, router, or hub. No device exists in the path that can be tapped, port-mirrored, or compromised

Throughput

100 GbE at wire speed (~12.5 GB/s) — exceeds NVMe write throughput (~7 GB/s), which is the architectural basis for RAM-only processing: moving data between nodes over the wire is faster than writing it to disk

Security Properties

No shared broadcast domain — ARP poisoning and VLAN hopping attacks have no surface. No switch firmware attack vector. Passive interception requires physical cable access, not network access

Segmentation

Compute, storage, and management planes on separate direct-attach links. No lateral movement path between planes — they are physically separate cables, not logical VLANs on a shared switch

Layer 3

Software Runtime

Operating System & Base

Debian/Ubuntu LTS — minimal installs, no GUI, no unnecessary daemons
NVIDIA CUDA drivers — matched to model runtime requirements, pinned versions
WireGuard kernel module — native, no userspace overhead
NGINX — API gateway and external surface, 444 policy enforced at this layer

Inference Runtime

llama.cpp — GGUF quantised models, GPU-offloaded layers, minimal memory footprint per active session
FastAPI — async Python API layer between NGINX and the inference backend
RAG pipeline — sentence-transformers for embedding, FAISS / Qdrant for vector retrieval, injected at context assembly before inference
Session isolation — each inference request runs in a scoped context; no shared KV-cache between clients

Orchestration & Management

Django — client management, audit receipt generation, admin surfaces (management network only, not reachable from compute plane)
Python scripting — node health checks, model version pinning, WireGuard peer provisioning automation
Bash — system automation, boot provisioning, iptables reload on config change, log rotation
No Kubernetes — deliberate. Orchestration overhead adds attack surface not justified by current scale

Continue

Next: what workloads and clients this environment is designed to run.

Targets →