Zero-Knowledge Proofs at Lunal

Overview

Lunal uses zero-knowledge (ZK) proofs to complement TEE attestations and prove that Lunal's infrastructure operates correctly while preserving customer privacy. This documentation explains our rationale for using ZK and the specific use cases we've implemented.

Why Lunal Uses ZK

The Fundamental Limitation of Attestation

Confidential computing attestation proves identity, not behavior.

A TEE attestation cryptographically answers: "What software was loaded?" It does not answer: "Did that software behave correctly?" Behavior can be implied from identity. If you know exactly what code was loaded and trust that code, you can reason about what it should do. But implication is not proof. The attestation itself says nothing about what actually happened at runtime.

This distinction matters enormously when things go wrong. When things security is breached, debugging, incident response, and security audits all ultimately require inspecting behavior: What did the system actually do? What data did it process? What decisions did it make? Attestation alone cannot answer these questions. It can only confirm that the right code was present when the system started.

The Trust Challenge in Lunal's Architecture

This limitation is particularly acute in Lunal's architecture. Lunal TEEs run two distinct types of software:

  1. Customer applications: software that customers load and run inside the TEE
  2. Lunal software: software that Lunal runs inside the customer's TEE for operations such as attestation, performance monitoring, and autoscaling

Both become part of the TEE's trusted boundary. While customers maintain full control over their own applications, they must extend trust to Lunal's software running alongside their workloads. Even though Lunal's software is open source and externally audited, this still requires customers to trust that the software behaves as documented every time it runs.

From Trust to Proof

ZK is the nuclear option for proving behavior. Unlike attestation (which proves identity) or logging (which can be forged), ZK proofs are cryptographically bounded. They mathematically guarantee that a specific computation occurred correctly, with no room for tampering or misrepresentation.

With ZK, instead of asking customers to trust that Lunal's software behaves correctly, we can prove it every time the software runs.

What Are Zero-Knowledge Proofs?

A zero-knowledge proof (ZKP) is a cryptographic protocol that allows one party (the prover) to prove to another party (the verifier) that a statement is true, without revealing any information beyond the validity of the statement itself.

Imagine you want to prove you know the solution to a Sudoku puzzle without revealing the solution. With a ZKP, you could mathematically demonstrate that you possess a valid solution, and the verifier would be cryptographically certain you're telling the truth, yet they'd learn nothing about the actual numbers in your grid. ZK proofs satisfy three properties:

  1. Completeness: If the statement is true, the verifier will be convinced.

  2. Soundness: If the statement is false, no cheating prover can convince the verifier.

  3. Zero-Knowledge: If the statement is true, the verifier learns nothing beyond that fact. No information about the underlying data leaks.

The Cost of Cryptographic Certainty

However, ZK is not a general-purpose solution. If your security architecture is a castle, ZK is a super expensive precision cannon: devastatingly effective where deployed, but extremely expensive to fire and limited in ammunition.

Integration complexity. ZK requires representing computations as arithmetic circuits. This is a fundamentally different programming model that requires specialized expertise and significant engineering investment.

Performance overhead. Proof generation is computationally expensive. ZK cannot be used in hot paths. It is limited to background processes where proving times of seconds (not milliseconds) are acceptable.

Scope limitations. Not everything can be proven. System calls, network operations, file I/O, and hardware interactions cannot be represented as arithmetic circuits. ZK proves the processing of data, not its collection.

This means ZK must be deployed strategically at high-value chokepoints rather than comprehensively across an entire system.

Use Cases

1. Autoscaling Metrics Extraction

The Challenge

Lunal needs to measure system metrics for autoscaling decisions, but hypervisors have limited visibility into guest VMs:

MetricHypervisor VisibilityLimitation
MemoryAllocated memory onlyCannot see actual usage within guest
CPUGlobal utilizationCannot break down by individual processes
DiskStorage allocationBlind to filesystem utilization
NetworkPacket flowCannot see application-level metrics

The Solution

Lunal runs telemetry extraction software inside the TEE and uses ZK proofs to verify autoscaling decisions without exposing sensitive process data.

Example: Memory-Based Autoscaling

Private Inputs:
- Per-process memory usage
- Swap activity
- Page fault rates

Public Outputs:
- Scaling decision (scale out: yes/no)
- Threshold percentage (e.g., 87%)

ZK Constraint:
IF p90_memory_usage > 85% of allocated_memory
THEN scaling_out_required = true

The proof verifies the scaling logic executed correctly without revealing which processes are running or their individual memory consumption.

Example: CPU-Based Autoscaling

Private Inputs:
- Per-process CPU usage
- Process ownership (Lunal vs. customer)

Public Outputs:
- Aggregate utilization band (e.g., "70-80%")
- Scaling decision
- Lunal overhead confirmation (< 10%)

ZK Constraints:
IF customer_process_cpu_p95 > 80%
THEN trigger_scale_out = true

AND lunal_overhead < 10% of total_cpu

This proves that scaling decisions are based on actual customer workload needs, not Lunal's own overhead.

Example: Disk-Based Autoscaling

Private Inputs:
- Filesystem usage per mount point
- Inode usage
- I/O queue depths

Public Outputs:
- Storage pressure indicator
- Scaling decision

ZK Constraints:
IF disk_usage > 85% OR inode_usage > 90%
THEN trigger_scale_out = true

IF io_queue_depth > threshold
THEN performance_issue = true

2. Software Inclusion Proofs

The Challenge

When customers license Lunal software for on-premise deployment, they run it alongside proprietary applications Lunal cannot access. TEEs provide a single measurement hash for all software, but since Lunal cannot reproduce measurements that include customer software, we have no way to verify which Lunal components were actually included.

The Solution

Lunal uses Merkle trees to generate inclusion proofs that verify specific software components were loaded correctly.

How It Works

Deployment Structure:
├── Customer Application A (proprietary)
├── Customer Application B (proprietary)
├── Lunal Component X (can prove this)
├── Lunal Component Y (can prove this)
└── Customer Application C (proprietary)

Merkle Tree:
         Root Hash (in TEE attestation)
              /          \
            ...          ...
           /   \        /   \
        Hash(A) Hash(B) Hash(X) Hash(Y) ...

ZK Inclusion Proof:
- Public Input: Root hash from TEE attestation
- Private Input: Component X binary + Merkle path
- Public Output: "Component X included"

Important Note: Proving software was loaded correctly does not prove the software ran correctly. It only verifies the initial state.

3. Performance Monitoring and SLAs

The Challenge

Lunal runs application services (load balancers, API gateways, proxies) inside TEEs to maintain privacy guarantees. We need to monitor these services for operational health and SLA compliance without exposing customer request data or traffic patterns.

The Solution

ZK proofs verify service performance and SLA compliance with zero visibility into actual customer traffic.

Example: API Gateway Health Monitoring

Private Inputs:
- HTTP status codes per request
- Error logs
- Connection states
- Backend health checks

Public Outputs:
- Health score: 99.95%
- SLA compliance: yes
- Reporting period: 2024-01-15 to 2024-01-22

ZK Constraints:
5xx_error_rate < 0.1%
backend_availability > 99%
connection_success_rate > 99.5%
rate_limit_violations = 0

The proof confirms the gateway is healthy and meeting SLAs without revealing which endpoints were called, request payloads, or customer identifiers.

Example: Latency SLA Verification

Private Inputs:
- Request latencies (array of timestamps)
- Request types (GET, POST, etc.)
- Timestamp pairs (request_start, request_end)

Public Outputs:
- SLA status: COMPLIANT
- Reporting period: Week 3, January 2024
- Breach count: 0

ZK Constraints:
p95_latency < 100ms (SLA threshold)
p99_latency < 500ms (ceiling limit)
sample_size > 10,000 requests

Example: Percentile Calculation Without Data Exposure

Lunal calculates latency percentiles for internal performance improvements:

Private Inputs:
- Individual request latencies: [23ms, 45ms, 67ms, ...]
- Request count: 50,000

Public Outputs:
- p50: 45ms
- p95: 98ms
- p99: 234ms

ZK Proof:
Verifies correct percentile calculation without revealing:
- Individual request latencies
- Request patterns or distributions
- Traffic volume fluctuations
- Customer-specific performance characteristics

This allows Lunal to optimize platform performance while maintaining complete customer privacy.

Caveats and Limitations

Performance Constraints

Most ZK proofs cannot be used in the hot path of customer requests due to performance overhead. All Lunal ZK use cases operate asynchronously and out-of-band from customer traffic. Therefore, Lunal's usage of ZK is limited to background processes where proving times on the order of magnitude of seconds are acceptable. The verification of such proofs take order of magnitude 10s of ms.

Arithmetic Circuit Limitations

ZK can only prove operations that can be represented as arithmetic circuits. This creates fundamental constraints on what we can verify.

System interactions that cannot be represented as arithmetic circuits:

In practice, this means Lunal's ZK proofs verify the processing and analysis of data rather than the collection of that data itself. Lunal can prove "given these CPU usage measurements, Lunal's autoscaling logic correctly determined scaling was needed" but cannot prove "these CPU usage measurements were collected correctly from the system." The data collection would still occur under the trust boundary of the TEE though, which provides acceptable guarantees.