The Client-Side Inference Pipeline
The Client-Side Inference Pipeline
Everything in this doc runs on the user's machine, before encryption. It's the bridge between human-readable text and the encrypted 16-dimensional input that enters the MPC cluster.
Architecture
The Tokenizer
What It Does
Takes a string and produces token IDs that the transformer model understands.
all-MiniLM-L6-v2 uses WordPiece tokenization (same as BERT):
- Common words stay intact: "love" →
[2293] - Uncommon words split: "unconstitutionally" →
["un", "##con", "##stitution", "##ally"] - Special tokens added:
[CLS](101) at start,[SEP](102) at end
Implementation Choice: Disable Padding
The tokenizer.json from sentence-transformers ships with padding enabled. This means every input gets padded to a fixed length with [PAD] tokens. I disabled this because:
- Wasted computation: Padding a 5-token sentence to 128 tokens means 123 useless tokens flowing through the transformer. At ~0.7s per inference, that's pure overhead.
- tract handles variable shapes: I re-optimize the tract model for each input's actual sequence length, so padding provides no benefit.
- Cleaner API: Without padding, the attention mask is always all-1s, and token count accurately reflects the input text.
let mut inner = Tokenizer::from_file(path)?;
inner.with_padding(None); // No padding
inner.with_truncation(None); // No truncationThe ONNX Model
all-MiniLM-L6-v2
A 22M-parameter sentence transformer from Microsoft. It takes tokenized text and produces 384-dimensional embeddings where similar sentences are close in vector space.
ONNX inputs (discovered by inspecting the model, not from docs):
| Input | Type | Shape |
|---|---|---|
input_ids | i64 | [1, seq_len] |
attention_mask | i64 | [1, seq_len] |
token_type_ids | i64 | [1, seq_len] |
token_type_ids is all zeros for single-sentence input. I initially tried passing only 2 inputs (as the documentation suggested), but the model requires all 3.
Output: [1, seq_len, 384] — per-token hidden states.
Mean Pooling
The transformer gives per-token embeddings, not a sentence embedding. To get a single vector:
sentence_embedding = sum(hidden_states[t] for t where mask[t]==1) / count(mask==1)
This averages all real token embeddings, ignoring padding. It's the standard approach for sentence-transformers models.
I implemented mean pooling as a standalone function so it can be tested with synthetic data (no model needed):
pub fn mean_pool(hidden_states: &[f32], seq_len: usize, embed_dim: usize, mask: &[i64]) -> Vec<f32>tract and Dynamic Shapes
tract needs concrete tensor shapes to optimize the computation graph. Since different sentences produce different token counts, I re-optimize per call:
let model = tract_onnx::onnx()
.model_for_read(&mut Cursor::new(&self.model_bytes))?
.with_input_fact(0, InferenceFact::dt_shape(i64::datum_type(), tvec!(1, seq_len as i64)))?
.with_input_fact(1, InferenceFact::dt_shape(i64::datum_type(), tvec!(1, seq_len as i64)))?
.with_input_fact(2, InferenceFact::dt_shape(i64::datum_type(), tvec!(1, seq_len as i64)))?
.into_optimized()?
.into_runnable()?;This adds ~0.5s overhead per call. For a production system, I'd either:
- Pre-optimize for common sequence lengths (8, 16, 32, 64, 128) and pick the closest
- Use
tract_onnx::onnx().model_for_path()with symbolic dimensions
For a demo, re-optimization per call is acceptable and simpler.
The Pipeline Struct
The InferencePipeline is the public API. It holds the tokenizer + model and exposes three methods at different stages:
impl InferencePipeline {
// Raw embedding — useful for generating PCA training data
fn embed_f32(&self, text: &str) -> Vec<f32> // 384-dim
// After PCA — useful for debugging
fn embed_pca(&self, text: &str, ...) -> [f64; 16] // 16-dim
// Final output — what gets encrypted for MPC
fn embed_quantized(&self, text: &str, ...) -> [i32; 16] // Q16.16
}Why f64 for PCA?
PCA is done in f64 precision even though the embedding is f32. Two reasons:
- The PCA matrix comes from Python (numpy/sklearn), which uses f64 by default
- We want maximum precision before quantization — quantization to Q16.16 is the lossy step, not the projection
PCA Parameters as Arguments
The pipeline accepts the PCA projection matrix and mean vector as parameters rather than bundling them. This is deliberate — the PCA matrix comes from running sklearn on the training set in Python. Tests use an identity projection that picks the first 16 dimensions.
In the final system, these parameters would be loaded from a file exported by the Python training script.
What the Tests Prove
Semantic Quality
The test_similar_sentences_closer_than_different test verifies that the embedding model produces meaningful representations:
let emb_a = model.embed(&tok.encode("The food was delicious")).unwrap();
let emb_b = model.embed(&tok.encode("The meal was tasty")).unwrap();
let emb_c = model.embed(&tok.encode("The car broke down on the highway")).unwrap();
assert!(cosine_similarity(&emb_a, &emb_b) > cosine_similarity(&emb_a, &emb_c));This isn't just a smoke test — it proves the model carries real semantic information that a downstream classifier can work with.
Quantization Roundtrip
The test_pipeline_full_quantized test verifies that the f32→Q16.16 conversion doesn't destroy the signal:
let raw = pipeline.embed_f32("Hello world");
let quantized = pipeline.embed_quantized("Hello world", &mean, &projection);
for i in 0..16 {
let recovered = to_f64(quantized[i]);
let error = (recovered - raw[i] as f64).abs();
assert!(error < 0.001);
}If the quantization error were too large, the MPC classifier would receive garbage input. This test proves the signal survives the conversion.
What's Still Missing
- Performance optimization: Re-optimizing tract per call is slow. A production version would cache optimized models for common sequence lengths.
The PCA matrix and classifier weights are trained and exported — the pipeline loads them from models/pca.json and models/classifier_weights.json.