Profile guide

Base ISL Patterns

These profiles summarize the input sequence length distributions that drive prefill pressure, cache reuse opportunity, and replay behavior. Use them as visual shorthand when choosing a trace shape for planning runs.

Figure 3

Benchmarking workflow

Figure 3: Benchmarking workflow. Capture traces, Normalize schema, Run baselines, Replay segments, Set thresholds, Sweep sensitivity, and Output sizing and cost model.
Benchmarking starts with representative traces and ends with sizing guidance that can survive configuration, cache-tier, and network sensitivity changes.
Efficiency model

Effective efficiency

Successful output volume only tells part of the story. The useful throughput signal has to be divided by all of the system costs that consume the run envelope.

Effective Efficiency Useful work divided by the full cost envelope successful outputs completed, valid, delivered responses numerator system cost envelope accelerator time + memory pressure + network + storage I/O + power When any denominator bucket grows, effective efficiency falls even if raw accelerator utilization looks healthy.
Effective efficiency drops as memory, network, storage I/O, and power costs expand around accelerator time.
Figure 2

KV cache hierarchy latencies

Log-scale bar chart showing GPU HBM at about 0.001 ms, CPU DRAM at about 0.05 ms, local flash at about 1 ms, and shared storage at about 10 ms.
Retrieval speed versus recompute becomes the dominant decision once state leaves HBM.