Architecture

Technical Deep Dive

Full technical specification. Written for engineers, operators, and architects evaluating the platform.

Supported Workloads

Designed workloads.

The platform is purpose-built for AI inference workloads that require privacy, low latency, and predictable throughput — not general-purpose cloud compute.

LLM Inference

Quantised large language model inference via llama.cpp. GGUF model files loaded from NVMe into GPU VRAM at startup — subsequent requests served from resident weights with no reload latency.

Streaming token output over persistent HTTP connection
Context window configurable per client agreement
No cross-session KV-cache leakage — context isolated per request

RAG Pipelines

Retrieval-augmented generation using client-supplied document corpora. Embeddings are generated on-cluster and indexed into a vector store scoped to the client's corpus — the index persists per deployment as required for retrieval to function, but is isolated per client with no shared index between tenants. Raw document content is not retained after indexing.

sentence-transformers for embedding generation
FAISS or Qdrant for vector index depending on corpus size
Client corpus isolated — no shared index between tenants

Computer Vision / ASL Recognition

GPU-accelerated frame inference for sign language recognition models. Video frames processed in-memory, classification returned per frame or sequence — no video stored at rest. This is the primary workload driving SignaVision's own Deaf accessibility platform, which means the infrastructure is actively used in production for this purpose — not offered speculatively.

Direct ingest from application layer over WireGuard tunnel
PyTorch / ONNX model runtimes supported
Primary workload for SignaVision's own accessibility platform

Batch & Embedding Jobs

Scheduled or triggered batch jobs for document ingestion, model fine-tuning preparation, and corpus embedding generation. Runs during off-peak inference windows to avoid throughput contention.

Job submission via authenticated API endpoint
Output delivered to client — not retained on-cluster after delivery
Audit receipt generated per job with hash of output payload

Integration Model

Integration profile.

The platform assumes a technical operator on the client side. Access is machine-to-machine by default — no browser-based UI, no managed console. Clients connect programmatically and own their integration layer.

Access Pattern

Connection

WireGuard peer provisioned per client. Public key exchange completed during onboarding. Tunnel is persistent and encrypted end-to-end

API Surface

REST endpoints via FastAPI over the tunnel. OpenAI-compatible completion endpoint available for drop-in SDK compatibility

Authentication

Mutual: WireGuard peer key at network layer + API token at application layer. Both required for any accepted request

Integration Requirements

Client Side

WireGuard client (Linux, macOS, Windows, mobile all supported). Any HTTP client capable of hitting a local tunnel endpoint

No SDK Required

Standard HTTP/JSON — works with curl, Python requests, any OpenAI-compatible client, or direct TCP from application code

Latency profile

Regional gateway minimises WireGuard round-trip. First-token latency primarily determined by model size and context length, not network overhead

Ideal Fit

Organizations that retain control over their data and execution
Teams capable of integrating and operating workloads programmatically
Environments requiring private infrastructure for compliance or policy
Use cases where data cannot leave organizational control

Not a Fit For

Consumer-facing products requiring multi-tenant browser sessions with no engineering team
Workloads requiring persistent training runs or full model fine-tuning at scale
Clients who need a managed UI dashboard rather than an API
Burst-only workloads where cost beats privacy — public cloud inference is cheaper at small scale

Who It's For

Organizations where data cannot leave.

Law firms

Client privilege

Routing client communications or case materials through external inference infrastructure creates discovery exposure and privilege risk. Execution stays inside the firm's control boundary.

Healthcare providers

Patient data compliance

PHI cannot be processed on shared cloud infrastructure without contractual and technical controls that most providers cannot verify. This keeps patient data inside the clinical environment — no external processor in the chain.

Government agencies

Data sovereignty

CUI, ITAR, and jurisdictional data residency requirements often prohibit processing on foreign-owned or multi-tenant infrastructure. Dedicated private compute eliminates the compliance ambiguity.

Universities and research institutions

Restricted datasets

IRB agreements, export control obligations, and funding body data use agreements routinely restrict where research data can be processed. Cloud inference APIs cannot satisfy these contractual requirements.

Financial and engineering teams

IP protection

Sending proprietary models, trading logic, or engineering specifications to external inference APIs creates IP exposure that legal review rarely approves. Private infrastructure removes the external dependency entirely.

Accessibility and Deaf education programs

Sensitive communication data

Signing video from students — particularly minors — carries consent and privacy constraints that third-party cloud processing cannot meet. Data is evaluated in-session and never routed externally.

Continue

Next: the principles and philosophy behind how this platform is designed.

Design Principles →