Tech Cottage

From Bytes to Gradients: Tracing a Neural Network Through Tenmo, One Layer at a Time

2026-06-30T00:00:00+00:00

When you call loss.backward() in PyTorch, a C++ autograd engine climbs the computation graph in reverse, multiplying Jacobians until every leaf tensor has its gradient filled in. It works. It’s fast. But the graph lives in C++ libraries you never see — torch::autograd::Engine, THPVariable, VariableType — hundreds of thousands of lines built over a decade.

What if you could read every line of the system between loss.backward() and the weight update? That’s the premise of Tenmo, a tensor library and neural network framework written entirely in Mojo. Every autograd dispatch, every SIMD matmul kernel, every GPU launch is in one repository under 100 source files.

This post traces one MNIST training step — matmul → bias_add → relu → matmul → bias_add → relu → matmul → bias_add → cross_entropy — through every layer of the system. We’ll start with raw memory allocation and end with the final parameter update, showing the real code at each stage.

1. The Memory Model — Buffer

Every tensor operation eventually reads or writes a flat array of scalars. In Tenmo, that flat array is a Buffer[dtype] — a CPU-only, shape-agnostic block of memory with one optional feature: reference counting.

struct Buffer[dtype: DType = DType.float32]:
    var size: Int
    var data: Optional[UnsafePointer[Scalar[Self.dtype], MutAnyOrigin]]
    var _refcount: Optional[UnsafePointer[Atomic[DType.uint64], MutAnyOrigin]]
    var external: Bool

A Buffer has two modes. Unshared: a single allocated block of Scalar[dtype] elements with no reference counting. __init__(*, copy:) deep-copies the data — malloc + memcpy. Shared: the allocation layout is [refcount: Atomic(UInt64)] | [data array], and __init__(*, copy:) merely bumps the atomic counter. __del__ decrements; when it hits zero, the combined allocation is freed in one shot.

The shared() method transforms an unshared buffer in-place (line 122 of buffers.mojo):

def shared(mut self):
    if self.is_shared():
        return
    var refcount_size = size_of[Atomic[DType.uint64]]()
    var data_size = self.size * size_of[Scalar[Self.dtype]]()
    var total_size = refcount_size + data_size
    var new_alloc = alloc[UInt8](total_size)
    var refcount_ptr = new_alloc.bitcast[Atomic[DType.uint64]]()
    refcount_ptr[] = Atomic[DType.uint64](1)
    var new_data = (new_alloc + refcount_size).bitcast[Scalar[Self.dtype]]()
    memcpy(dest=new_data, src=self.data, count=self.size)
    self.data.unsafe_value().free()
    self.data = new_data
    self._refcount = refcount_ptr

This allocation layout matters because views share the same Buffer via refcount bump. When we slice a tensor, the new tensor’s NDBuffer points to the same underlying Buffer with a refcount of 2. The memory stays alive as long as any view holds a reference, regardless of Mojo’s aggressive destruction of intermediate tensors.

There’s also a static Buffer.shared(size) constructor that allocates the combined layout from the start, avoiding the O(n) reallocation that the instance shared() method performs. This is the fast path used by Gradbox.__init__.

2. Shape + Strides + Views — NDBuffer

A flat Buffer doesn’t know about dimensions. That’s the job of NDBuffer[dtype] — the single source of truth for shape, strides, offset, and device location.

struct NDBuffer[dtype: DType]:
    var shape: Shape
    var strides: Strides
    var offset: Int
    var _contiguous: Bool
    var buffer: Buffer[dtype]      # CPU data
    var device_state: Optional[DeviceState]  # GPU data

The key insight: NDBuffer doesn’t own the data. It points into a Buffer at some offset, interpreting the flat memory through strides. A contiguous tensor (3, 4) with strides (4, 1) and offset 0 maps element (i, j) to buffer[i*4 + j]. A transposed view of the same tensor has strides (1, 4) and offset 0 — element (i, j) maps to buffer[i*1 + j*4].

Zero-copy slicing uses share():

def share(
    self, new_shape: Shape, new_strides: Strides, new_offset: Int
) -> NDBuffer[Self.dtype]:
    # Enables refcounting on the CPU Buffer (first call does the transform)
    self.buffer.shared()
    # Returns a new NDBuffer pointing at the same Buffer
    return NDBuffer(...)

On GPU, there’s no separate sharing step — DeviceBuffer (Mojo’s GPU built-in) is always refcounted. The device_state is simply copied by pointer.

reshape() exploits this: if the new shape’s max_index fits within the underlying buffer_size, it returns a zero-copy view with new strides and offset. Only when the view would require discontiguous access does it materialize a contiguous copy.

This is the foundation for the “reshape is free” property of the autograd graph. A ReshapeBackward handler (in reshape.mojo) does nothing but reshape the gradient tensor to the parent’s shape — no data transformation, just a new Shape and Strides object.

3. Tensor — The User-Facing Type

The Tensor[dtype] struct bundles an NDBuffer with autograd metadata:

struct Tensor[dtype: DType]:
    var _id: UInt
    var buffer: NDBuffer[Self.dtype]
    var requires_grad: Bool
    var gradbox: Optional[Gradbox[Self.dtype]]
    var ancestors: Optional[Ancestors[Self.dtype]]

Two of these fields deserve a closer look.

Gradbox — this is not Tensor, and that matters. Tensor is 4543 lines of code; Gradbox is 1526. Gradbox doesn’t need reductions, trig, comparisons, or many of the 200-odd operations Tensor supports. It only needs gradient storage shapes, accumulation (add, subtract, zero), reshape, broadcast, and device transfer. That’s it. A lean container specialized for one job.

Technically, Gradbox is a combined heap allocation of [Atomic(UInt64)] | [NDBuffer]. The atomic refcount is independent of the Tensor’s refcount. When Mojo’s ASAP destruction drops an intermediate tensor, the Gradbox survives if other handles (Ancestor copies in the graph) still reference it. This prevents dangling pointers in the autograd graph.

struct Gradbox[dtype: DType]:
    var _ndb_ptr: Optional[UnsafePointer[NDBuffer, MutAnyOrigin]]
    var _refcount: Optional[UnsafePointer[Atomic[DType.uint64], MutAnyOrigin]]

In __init__(shape) (line 33 of gradbox.mojo), it allocates one block, initializes the atomic to 1, and constructs the NDBuffer via move-init. __init__(*, copy:) bumps the atomic via fetch_add[RELAXED](1). __del__ decrements via fetch_sub[RELEASE](1); if the result is 1 (meaning this was the last handle), it destroys the NDBuffer and frees the combined allocation.

When you need to convert between the two, Gradbox.as_tensor() (gradbox.mojo:118) materializes a contiguous copy of the gradient data as a Tensor, and Tensor.as_gradbox() (tensor.mojo:135) consumes the Tensor’s NDBuffer to produce a Gradbox. This metamorphosis between types is explicit — you don’t accidentally use a gradient storage container as a full tensor.

Ancestor — The old Tenmo design stored full Tensor copies at every add_ancestry call, triggering recursive deep copies, gradbox allocations, and heap blocks. The current design uses a lightweight handle:

struct Ancestor[dtype: DType]:
    var _id: UInt
    var requires_grad: Bool
    var gradbox: Optional[Gradbox[Self.dtype]]
    var ndb: Optional[NDBuffer[Self.dtype]]
    var parents: Optional[Ancestors[Self.dtype]]

The ndb field is only populated when needs_parent_data=True — most operations don’t need it. Addition doesn’t need the parent’s buffer; it just passes the gradient through unchanged. Matmul does need the parent’s data (to compute grad × B^T), so needs_parent_data=True is set on its BackwardFnArg.

4. Forward Pass — A Real MNIST Step

With the data structures in hand, let’s trace one batch through the MNIST model. The architecture is 784 → 128 → ReLU → 32 → ReLU → 10, built as a Sequential:

var model = Sequential[dtype]()
model.append(
    Linear[dtype](784, 128).into(),
    ReLU[dtype]().into(),
    Linear[dtype](128, 32).into(),
    ReLU[dtype]().into(),
    Linear[dtype](32, 10).into(),
)

A forward call model(x) dispatches through each layer in sequence. The heaviest operation by far is matmul — three of them per batch, each computing (batch_size, in_features) × (in_features, out_features).

Matmul — The CPU Kernel

The CPU matmul lives in matmul_cpu.mojo, struct MmCpu2d. It selects from 18 tile configurations based on the matrix dimensions (m, n, p):

var tile_m = 128 if m > 256 else (64 if m > 64 else 32)
var tile_n = 64  if n > 64  else 32
var tile_p = 256 if p > 256 else (128 if p > 64 else 64)

For the first layer (64, 784) × (784, 128), m=64, n=784, p=128. Tracing through the selection (matmul_cpu.mojo:87–89):

tile_m = 128 if m > 256 else (64 if m > 64 else 32) — m=64: 64 > 256 false → 64 > 64 false → tile_m=32
tile_n = 64 if n > 64 else 32 — n=784 > 64 → tile_n=64
tile_p = 256 if p > 256 else (128 if p > 64 else 64) — p=128: 128 > 256 false → 128 > 64 true → tile_p=128

Result: MmCpu2d[float32, 32, 64, 128] — the tile_m=32 branch of the 18-way dispatch table.

Note the tile_p=128 choice. The p > 64 check that picks 128 over 256 when p=128 is about L1 cache capacity, not SIMD utilization. Tile_P controls the outer j_tile stride — how many columns of B are loaded per k_tile pass and reused across all rows in the tile. With TILE_N=64 and TILE_P=256, the B j-tile is 64 × 256 × 4 bytes = 64 KB, which overflows L1 data cache (32 KB). With TILE_P=128, it’s 64 × 128 × 4 = 32 KB, fitting perfectly. The inner SIMD unrolled loop (32 columns per iteration) is equally efficient in either case — j_end = min(j_tile + TILE_P, p) caps it at the actual 128 columns regardless of TILE_P, so 4 iterations of 32 columns fully cover the output with no tail.

Inside the selected tile configuration, the hot loop processes columns in groups of simd_unroll = simdwidth × UNROLL (for float32 with AVX2: 8 × 4 = 32 columns per iteration):

# Unrolled SIMD: 4 independent accumulators fill the FMA pipeline
var acc0: SIMD[Self.dtype, simdwidth]
var acc1: SIMD[Self.dtype, simdwidth]
var acc2: SIMD[Self.dtype, simdwidth]
var acc3: SIMD[Self.dtype, simdwidth]

if k_tile == 0:
    acc0 = SIMD[Self.dtype, simdwidth](0)  # C is zeroed, skip load
else:
    acc0 = C_data.load[width=simdwidth](cj)

for k in range(k_tile, k_end):
    var a_ik = SIMD[Self.dtype, simdwidth](A_data[a_row_base + k])
    var b_base = k * B_stride0 + B_offset + j
    acc0 = math.fma(a_ik, B_data.load[width=simdwidth](b_base), acc0)
    acc1 = math.fma(a_ik, B_data.load[width=simdwidth](b_base + simdwidth), acc1)
    acc2 = math.fma(a_ik, B_data.load[width=simdwidth](b_base + simdwidth * 2), acc2)
    acc3 = math.fma(a_ik, B_data.load[width=simdwidth](b_base + simdwidth * 3), acc3)

Each iteration: one broadcast of a_ik (scalar→SIMD), four SIMD loads from B, four FMA instructions. For float32 with simdwidth=8: 32 FMAs per inner iteration. The k_tile==0 optimization skips loading C (it starts zeroed), saving 4 vector reads on the first tile pass.

Rows are parallelized across physical cores using parallelize from Mojo’s standard library — each core processes a contiguous block of TILE_M rows with its own cache-hot k-strip and j-tile.

Bias Add — Broadcast Arithmetic

After matmul, bias addition broadcasts a (128,) vector across the batch dimension. This dispatches through CpuArithmeticOps.broadcast (cpu_arithmetics.mojo) which selects Tier 2: one operand has unit stride in the last dimension, the other broadcasts (stride 0).

# Tier 2: SIMD splat from broadcasting side
var scalar_vec = SIMD[Self.dtype, simd_width](scalar_v)
while j + simd_width <= last_dim:
    var vec = b.buffer.load[simdwidth=simd_width](b_off + j)
    var op_result = simd_op[op_code, Self.dtype, simd_width](vec, scalar_vec)
    buffer.store[simdwidth=simd_width](out_base + j, op_result)
    j += simd_width

A single scalar is splatted into a SIMD register, then the contiguous side is SIMD-loaded and vector-added. This is the same mechanism used by every broadcasting op in the system — bias add, layer norm, cross-entropy sub-ops.

Cross-Entropy — Fused GPU Kernel

The final layer produces logits (64, 10). CrossEntropyLoss dispatches through CrossEntropyFusedKernel on GPU (at tenmo/kernels/crossentropy_fused_kernel.mojo). This fused kernel computes max-reduce, exp, sum-exp, softmax, and NLL in a single GPU launch:

Thread-block-per-row pattern (M = 64 blocks)
Shared-memory tree reduction for max and sum_exp
Register-level log_softmax computation
Single scalar write per block for the loss value

Without this fusion, cross_entropy would trigger ~18 separate kernel launches plus a CPU onehot fallback. The fused kernel reduces it to 1 launch + 4 backward arithmetic ops.

On CPU, cross-entropy uses an analogous fused path that walks rows with SIMD vectorization, computing the max, exp, sum, log, and NLL in a single row loop.

5. The Backward Graph

Every forward operation that needs gradient tracking registers a BackwardFnArg and parent Ancestor handles on the output tensor. Let’s see what happens when we call loss.backward().

What `add_ancestry` Stores

When Multiplicator.forward() registers c = a * b, it creates:

var backwardFnArg = BackwardFnArg[Self.dtype].null_arg(BACKWARD_MULTIPLY)
backwardFnArg.needs_parent_data = True  # backward needs parent buffer
out.add_ancestry(backwardFnArg^, self, other)

The BackwardFnArg is the dispatch key — a type-erased container packing the integer op_code together with a destructor function and copier function for whatever payload it carries. The 58 operation codes are defined as comptime constants in backpropagation.mojo (e.g. BACKWARD_ADD = 0, BACKWARD_MATMUL_2D = 4, BACKWARD_SIGMOID = 7).

add_ancestry() (tensor.mojo:1080) converts each parent Tensor into an Ancestor handle. When needs_parent_data=True, it copies the parent’s NDBuffer and calls buffer.share() to enable refcounting. When False (most ops), it creates the ancestor with no ndb — just the _id, requires_grad flag, and gradbox pointer.

The Backward Pass — Phase by Phase

The backward() method at tensor.mojo:3160 proceeds in three phases:

Phase 1: Seed gradient. output.seed_grad(1.0) allocates the output’s gradbox (if needed) and fills it with 1.0. On GPU, sync=True fences all pending GPU work before the seed — ensuring forward kernel outputs are visible before backward reads them.

Phase 2: DFS graph collection. Starting from the output’s Ancestor, the code walks parent references recursively, building three parallel structures:

var node_list = List[Ancestor[Self.dtype]]
var fanin = Dict[UInt, Int]()
var id_to_index = Dict[UInt, Int]()

# DFS: push root, pop, visit parents
var root = output.to_ancestor()
root.ndb = output.buffer.copy()  # root always gets data
dfs_stack.append(root._id)
while len(dfs_stack) > 0:
    var node_id = dfs_stack.pop()
    if node_id in visited:
        continue
    visited.add(node_id)
    topo_ids.append(node_id)
    if node.has_ancestry():
        for parent in node.ancestry():
            var parent_id = parent._id
            fanin[parent_id] = fanin.get(parent_id, 0) + 1
            if parent_id not in id_to_index:
                node_list.append(parent.copy())
                id_to_index[parent_id] = new_idx
                dfs_stack.append(parent_id)

fanin counts how many children depend on each node. The root has fanin 0. A matmul node may have fanin 0 (no one depends on its gradient) or 1 (a ReLU sits on top).

Phase 3: Reverse topological execution. A ready_queue starts with the root. For each popped node:

Backward.invoke(node, parent_ids) dispatches via a 58-way jump table on op_code to the appropriate backward handler
The handler reads output.gradients(), computes parent gradient contributions, calls parent.update_grad(grad, op_code, extra_arg) to accumulate into each parent’s gradbox
For each parent that received gradient, its _id is appended to parent_ids
Each parent’s fanin is decremented; when it hits 0 and the parent has ancestry, it’s enqueued

Example: Multiply Broadcast Backward

When c = a * b with broadcasting (e.g. a is (3, 1) and b is (1, 4)), the backward handler at multiplication.mojo:85 is aliased to BroadcastBackward. This handler:

Extracts the upstream gradient ∂loss/∂c from the output’s gradbox
Broadcasts/unbroadcasts it to each parent’s shape
If the op is multiplication, scales by the other parent’s values: ∂loss/∂a = ∂loss/∂c * b
Calls ancestor.update_grad(grad_contrib, AddTensor, None) for each parent

The update_grad method at ancestry.mojo:72 dispatches on the op_code parameter:

AddTensor: gradbox += incoming (in-place addition)
ScatterAddTensor: Filler.scatter_add() for sparse gradient accumulation (used by Gather backward)
ZeroGrad: gradbox.zero_grad()

The “Aha” Moment — Reshape Backward

ReshapeBackward (reshape.mojo:13) is the simplest backward in the system:

def backward(output, mut parent_ids, retain_graph=False):
    ref gradbox = output.gradients()
    var ancestor = output.ancestry().get(0)
    if ancestor.requires_grad:
        var reshaped = gradbox.reshape(ancestor.shape())
        ancestor.update_grad(reshaped^, AddTensor, None)

It just reshapes the gradient tensor to the parent’s shape. No data transformation — a new Shape and Strides object, same Buffer, same values. If your forward was (2,6) → reshape(3,4), backward is just gradient(3,4) → reshape(2,6). The gradient values pass through unchanged.

This contradicts the naive intuition that “reshape is a math op that rearranges data”. It’s a metadata op. The backward proves it.

6. The Optimizer — SGD Step

After backward fills every gradient, SGD.step() updates the parameters. The optimizer struct at optim.mojo:10 holds pointers to parameters, velocity buffers (for momentum), and hyperparameters.

struct SGD[dtype: DType, //]:
    var parameters: List[UnsafePointer[Tensor[Self.dtype], MutAnyOrigin]]
    var lr: Scalar[Self.dtype]
    var momentum: Scalar[Self.dtype]
    var weight_decay: Scalar[Self.dtype]
    var velocities: List[Gradbox[Self.dtype]]

The step() method iterates each parameter, checks requires_grad && has_grad(), and runs the update. On CPU, it’s SIMD-vectorized:

def _step_no_momentum[simd_w: Int](self, param_ptr, grad_ptr, num_elements):
    var lr_vec = SIMD[Self.dtype, simd_w](self.lr)
    var wd_vec = SIMD[Self.dtype, simd_w](self.weight_decay)
    for j in range(0, vec_end, simd_w):
        var p_vec = param_ptr.load[width=simd_w](j)
        var g_vec = grad_ptr.load[width=simd_w](j)
        if self.weight_decay > 0:
            g_vec += p_vec * wd_vec
        p_vec -= lr_vec * g_vec
        param_ptr.store[width=simd_w](j, p_vec)

On GPU, the update launches an in-place kernel (sgd_kernel.mojo) without any CPU round-trip. The kernel reads param and grad from GPU memory, applies the update, and writes back — all on-device:

def sgd_step_no_momentum_kernel[dtype: DType](
    param: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    grad: UnsafePointer[Scalar[dtype], ImmutAnyOrigin],
    num_elements: Int, lr: Scalar[dtype], weight_decay: Scalar[dtype],
):
    var gtid = Int(thread_idx.x) + Int(block_idx.x) * Int(block_dim.x)
    var stride = Int(block_dim.x) * Int(grid_dim.x)
    var i = gtid
    while i < num_elements:
        var p = param[i]
        var g = grad[i]
        if weight_decay > 0:
            g += p * weight_decay
        param[i] = p - lr * g
        i += stride

Each thread handles strided elements across the parameter array — a classic GPU element-wise pattern. The momentum variant adds a velocity buffer read/write and the momentum term v = momentum * v + g.

The optimizer supports sparse row-wise updates for embedding layers: when indices are provided, only specific rows of 2D parameters are updated. This was critical for word2vec-style training where only ~10 rows out of 252K receive gradient each step — a 25000× reduction in write traffic.

7. GPU Transfer

Tensor transfer between CPU and GPU goes through DeviceState at device.mojo:229:

CPU → GPU: DeviceState.fill(ndb) copies data from the CPU NDBuffer’s logical view to a GPU device buffer. If the source is contiguous, it’s a direct memcpy to a mapped device buffer. If strided, it iterates via index_iterator() and writes each element.

GPU → CPU: DeviceState.into(shape) calls map_to_host() to bring the GPU buffer to host-accessible memory, then memcpy back to a CPU Buffer.

DType.bool is stored as uint8 internally — a limitation of Mojo’s DeviceBuffer which doesn’t support DType.bool. The datatype comptime field on DeviceState handles the cast transparently.

The stop_grad parameter controls whether a device transfer registers a backward node. With stop_grad=False (default), the transfer creates a DeviceTransferBackward node, so gradients tunnel transparently across device boundaries. With stop_grad=True, no backward node is registered — the destination becomes a new leaf on the target device.

The recommended training pattern transfers model weights to GPU once:

model = model.to_gpu(stop_grad=True)    # weights become GPU leaves
# ... entire training loop on GPU ...
model = model.to_cpu(stop_grad=True)    # persist back to CPU

8. Putting It All Together

The unified MNIST example at examples/mnist_unified.mojo (151 lines) ties everything together:

def train_mnist() raises:
    comptime dtype = DType.float32
    # ... data loading via numpy interop ...

    var model = Sequential[dtype]()
    model.append(
        Linear[dtype](784, 128).into(),
        ReLU[dtype]().into(),
        Linear[dtype](128, 32).into(),
        ReLU[dtype]().into(),
        Linear[dtype](32, 10).into(),
    )
    comptime if has_accelerator():
        model = model.to_gpu(stop_grad=True)

    var opt = SGD(model.parameters(), lr=0.01, momentum=0.9)
    var loss_fn = CrossEntropyLoss[dtype]()

    for epoch in range(epochs):
        train_loader.reset()
        while train_loader.__has_next__():
            ref batch = train_loader.__next__()
            var x = batch.features
            var y = batch.labels
            comptime if has_accelerator():
                x = x.to_gpu(sync=False)
                y = y.to_gpu(sync=False)
            var pred = model(x)
            var loss = loss_fn(pred, y)
            opt.zero_grad()
            loss.backward()
            opt.step()

The loop is under 80 lines. Everything we traced — Buffer allocation, NDBuffer strides, Gradbox refcounting, SIMD matmul, broadcast arithmetic, fused CE kernel, autograd graph traversal, SGD vectorized update — collapses into this tight loop.

The comptime if has_accelerator() pattern is key: on a CPU-only system, the GPU branch compiles away entirely. No runtime dispatch, no dead code. The same source file runs on both platforms.

What the Benchmarks Say

Training the same 4-layer MLP on identical hardware (15 epochs, batch_size=64, all runs sequential):

Platform	Device	Avg Epoch Time	Total Time	Final Val Acc
Tenmo	CPU (Mojo)	5.5s	82.3s	98.14%
Tenmo	GPU (Mojo)	6.0s	90.1s	98.00%
PyTorch	GPU (CUDA)	14.5s	217.2s	98.18%
PyTorch	CPU	15.4s	231.5s	98.12%

2.8× faster than PyTorch CPU, 2.4× faster than PyTorch GPU. The CPU result is the headline: pure Mojo SIMD on a 104K-parameter model saturates the machine¹ before GPU launch overhead pays off. On a model this small, each GPU kernel launch has too few elements to amortize its dispatch cost — the MNIST MLP does 13 kernels per forward/backward step, each with 64 rows or fewer, and the cumulative launch latency exceeds the compute time. We include the GPU number because it’s an honest measurement: Tenmo’s GPU path is correct and matches PyTorch GPU behavior, but small models don’t benefit. The fusion work described in the Cross-Entropy section is exactly the strategy that will close this gap.

Each design choice has a measurable payoff:

Choice	Payoff
Ref-counted Buffer sharing	Reshape is free — no alloc, no copy
SIMD-tiled matmul + FMA + UNROLL=4	32 FMAs per iteration, saturates the CPU
Lightweight Ancestor handles	No Tensor copy in the graph — just `_id` + gradbox
Fused CE GPU kernel	1 launch instead of 18
In-place GPU SGD step	No CPU round-trip for parameter updates
Gradbox independent refcount	Survives Mojo’s ASAP destruction — gradients persist
Comptime graph elimination	Zero backward overhead in eval mode

These aren’t abstract architectural claims. Every line of code is in the repository.

Common Pitfalls

Gradbox lifespan confusion. Gradboxes have their own refcount. If you save tensor.grad() to a variable, it returns a deep copy via Gradbox.detach() — a fresh allocation with independent data. The internal gradbox remains untouched by subsequent zero_grad() calls. The detached copy is safe to use, but it’s not linked to the parameter anymore.

stop_grad=True breaks graph flow. If you transfer weights to GPU with stop_grad=True, the model’s parameters become GPU leaves. Input tensors transferred with stop_grad=False (default) can still carry gradients from the loss back to their CPU origin, but the weights’ gradients accumulate on the GPU parameters. This is usually what you want, but it means model.to_cpu(stop_grad=True) creates new CPU leaves — the GPU weight values are copied, but the CPU copy won’t receive future gradients.

Try It Yourself

The complete source is on GitHub at ratulb/tenmo. To train the MNIST model from this post without building from source:

docker run -it ratulb/tenmo:latest /app/bin/mnist

This runs the MNIST CPU example from examples/mnist.mojo — the same 784→128→ReLU→32→ReLU→10 architecture traced above — compiled into a static binary inside the container. Corresponding PyTorch is script.

“CPU’s SIMD vector units sustain peak arithmetic throughput — no stalls from cache misses or memory bandwidth — because the entire 104K-parameter model (~1 MB) fits in L3 cache, so every cycle does useful FMA. On GPU, the same model dispatches 13 kernels per step with at most 64 rows each; kernel launch latency (~10–50 μs per launch) exceeds the GPU’s compute time, leaving the hardware underutilized. For larger models (millions of parameters), the GPU’s massive parallelism eventually dominates. ↩

From Raw Text to Word Vectors: Building a Tokenizer and Word Embeddings with Tenmo

2026-06-30T00:00:00+00:00

“king − man + woman ≈ queen.”

This single equation — the notion that arithmetic on word vectors reveals semantic relationships — is what made word embeddings famous. It suggests that somewhere inside a high-dimensional vector space, directions like “royalty” and “gender” actually exist as learned features. A computer trained only on raw text, with no dictionary or grammar, can learn that king and queen differ by the same vector as man and woman.

How does that work? And more importantly, how do we build it from scratch?

In this post, we’ll implement the full pipeline using Tenmo — a tensor library and neural network framework built in Mojo with full autograd, SIMD-optimized kernels, and GPU support. We’ll build a tokenizer that converts raw movie reviews into integer IDs, a CBOW training loop with negative sampling, and a similarity probe that lets us query the learned embedding space. The entire implementation lives in a single file — around 750 lines with the model encapsulated in a compact Word2Vec struct — and trains on the IMDB review dataset.

The Problem: Computers Don’t Read

A computer sees strings. "king", "queen", "man", "woman" are just sequences of bytes. Nothing in their byte representation suggests that king and queen are related, or that man and woman share a semantic axis.

To make words computable, we need vector representations — each word mapped to a list of floating-point numbers where distance in vector space corresponds to semantic similarity.

But what kind of vector?

One-Hot Encoding

The simplest approach: assign each word a unique V-dimensional vector with a single 1 and V−1 zeros.

# Pseudo-code for one-hot encoding
var V = 100_000  # vocabulary size
var id = word_to_idx["king"]   # say, 42
var one_hot = Tensor[dtype].zeros(V)
one_hot[42] = 1

The problems are immediate:

Semantically blind. The dot product between any two one-hot vectors is always 0 — they’re orthogonal by construction. King and queen are as unrelated as king and aardvark.
High-dimensional, sparse. A 100K-dimensional vector with a single non-zero element wastes memory and fails in any ML model that expects dense features.
No generalization. The model can’t leverage the fact that king and queen behave similarly in text — they’re treated as completely independent symbols.

Bag-of-Words and TF-IDF

The next refinement: count how often each word appears in a document. A vector of term frequencies is denser than one-hot, but it’s still V-dimensional and ignores word order. TF-IDF improves on raw counts by down-weighting common words (the, a, in), but the representation remains sparse, high-dimensional, and incapable of capturing synonymy.

Co-Occurrence Matrices (GloVe)

GloVe builds a word-word co-occurrence matrix: count how often word i appears near word j across the entire corpus, then factorize that matrix to produce dense vectors. The intuition is simple — words that occur in similar contexts have similar vectors — but the co-occurrence matrix is O(V²), making it impractical for large vocabularies without heavy approximation.

Prediction-Based Embeddings (word2vec)

word2vec flips the problem around. Instead of counting co-occurrences, we train a neural network to predict whether a word appears in a given context. The vectors emerge as a byproduct — the hidden layer weights of this prediction network become the word embeddings.

This is what we’ll implement. But before we can train embeddings, we need to turn raw text into numbers. That means building a tokenizer.

Stage 1: Building a Tokenizer from Scratch

A tokenizer converts text into integer IDs. It’s the gateway between raw strings and any NLP model. Our tokenizer needs to:

Clean raw text — strip HTML, URLs, punctuation artifacts, and digit sequences.
Build a vocabulary — collect every unique word from the training corpus, sort it, and assign each word a unique integer.
Encode new text into those IDs, with a fallback for words not seen during training.

Cleaning Text

The IMDB dataset contains movie reviews with HTML tags (, ), URLs, ratings, and other noise. We clean it in a single pass using Python’s re module — Mojo’s Python interop handles this cleanly:

@staticmethod
def clean_text(raw_text: String) raises -> PythonObject:
    var py = Python.import_module("builtins")
    var regex = Python.import_module("re")
    var text = py.str(raw_text)

    # Remove HTML tags
    text = regex.sub(r"<[^>]+>", " ", text)
    # Remove URLs
    text = regex.sub(r"http\S+|www\.\S+", " ", text)
    # Remove digit sequences
    text = regex.sub(r"\d+", " ", text)
    # Remove stray apostrophes (preserve contractions like "don't")
    text = regex.sub(r"(?= 2]"
    )
    return filter_fn(text)

Every step handles a real data problem:

HTML tags appear throughout IMDB reviews (especially for line breaks).
URLs appear in user-written reviews (“I saw this at http://example.com”).
Ratings like “10/10” would leak numeric patterns unrelated to sentiment.
Leading/trailing apostrophes ('hello') are punctuation, but contractions (don't) are real words.
Single-character tokens like “a” and “I” are filtered because they add noise without semantic signal.

The use of Python.evaluate to define a lambda is worth noting. Mojo’s Python interop means we can write Python logic inline without leaving the language — perfect for text processing where Mojo’s standard library doesn’t yet have a regex engine.

Building the Vocabulary

Once we’ve cleaned every review, we collect the unique words across the entire dataset:

@staticmethod
def from_text_lines(text_lines: List[String]) raises -> Self:
    var py = Python.import_module("builtins")
    var all_words: PythonObject = []

    # Collect all words from all text lines
    for line in text_lines:
        all_words.extend(Tokenizer.clean_text(line))

    # Create unique, sorted vocabulary
    all_words = py.list(py.set(all_words))
    all_words = py.sorted(all_words)

    # Add UNKNOWN token for out-of-vocabulary words
    var vocab_with_unknown: PythonObject = [UNKNOWN_TOKEN]
    vocab_with_unknown.extend(all_words)

    # Map each word to a unique integer ID
    var vocabulary = {
        String(token): Int(index)
        for index, token in enumerate(vocab_with_unknown.__iter__())
    }

    return Self(vocabulary^)

Key design decisions:

UNKNOWN token at position 0. Any word seen at test time but not in training gets mapped to ID 0. This is a standard practice — it acts as a catch-all, preventing the model from crashing on novel words.
Alphabetical sort. Sorting the vocabulary before assigning IDs ensures deterministic behavior across runs. The word with ID 1 is always "aaron", not a random word depending on Python’s set iteration order.
Dict[String, Int] for lookup, Dict[Int, String] for decoding. The tokenizer stores both mappings so we can go from text → IDs and back.

Encoding and Decoding

With the vocabulary built, encoding new text is straightforward:

def encode(self, text: String) raises -> List[Int]:
    var words = Tokenizer.clean_text(text)
    var token_ids = List[Int](capacity=len(words))
    for word in words:
        var word_str = String(word)
        token_ids.append(
            self.word_to_id[word_str] if word_str in self.word_to_id
            else self.word_to_id[UNKNOWN_TOKEN]
        )
    return token_ids^

def decode(self, token_ids: List[Int]) raises -> String:
    return " ".join([self.id_to_word[id] for id in token_ids])

The encode step is the inverse of cleaning: the same clean_text function that prepared training data also processes new input. Consistency between training and inference is critical — if your tokenizer cleans text one way during training but differently during inference, your model will see a distribution mismatch.

Loading the IMDB Dataset

The dataset lives at /tmp/aclImdb/train/ with pos/ and neg/ subdirectories. Each file is named like 1234_8.txt — the number after the underscore is the rating from 1 to 10. We filter for strong reviews (rating ≥ 7 positive, ≤ 4 negative) to get cleaner signal:

def init_tokenizer_and_datasets(mut self, dataset_folder: String) raises -> Tokenizer:
    # Ensure dataset is downloaded
    self._download_imdb_dataset()

    var positive_path = Path("/tmp") / dataset_folder / "pos"
    var negative_path = Path("/tmp") / dataset_folder / "neg"
    var all_comments = List[String](capacity=50000)

    # Load positive reviews (rating 7-10)
    if positive_path.exists():
        for file in positive_path.listdir():
            var rating = self._extract_rating_from_filename(file.name())
            if rating >= 7:
                var comment = positive_path.joinpath(file.name()).read_text()
                all_comments.append(comment)

    # Load negative reviews (rating 1-4)
    if negative_path.exists():
        for file in negative_path.listdir():
            var rating = self._extract_rating_from_filename(file.name())
            if rating <= 4:
                var comment = negative_path.joinpath(file.name()).read_text()
                all_comments.append(comment)

    # Build tokenizer from all loaded comments
    var tokenizer = Tokenizer.from_text_lines(all_comments)

    # Tokenize everything and build datasets
    for comment in all_comments:
        var token_ids = tokenizer.encode(comment)
        if len(token_ids) == 0:
            continue
        self.tokenized_reviews.append(token_ids.copy())
        self.concatenated_tokens.extend(token_ids^)

    return tokenizer

We store two views of the data:

tokenized_reviews: each review as a separate list of token IDs. This lets us build context windows within a single review (we never want context crossing review boundaries).
concatenated_tokens: every token ID from every review concatenated into one flat list. This is used for random negative sampling — we draw negative samples uniformly from the entire corpus.

Let’s trace where each number comes from in the code.

Vocabulary size: 252,001. The NegativeSampler.init_tokenizer_and_datasets() method loads every review from aclImdb/train/pos/ and aclImdb/train/neg/, filtering by rating — only reviews with ratings ≥7 or ≤4 qualify. IMDB has 12,500 positive and 12,500 negative training reviews; roughly half of each side passes the rating filter, leaving about 12,000 qualifying reviews. All of them are passed to Tokenizer.from_text_lines(all_comments), which collects every unique word via Python’s set():

all_words = py.list(py.set(all_words))     # unique words only
all_words = py.sorted(all_words)

Then UNKNOWN_TOKEN is prepended at index 0. The result is 252,001 unique word types — every rare name, typo, number, and foreign word from 12,000 movie reviews, all sorted alphabetically.

5,000 reviews for training, not 12,000. The constant MAX_REVIEWS_TO_USE = 5000 (line 470) limits the training loop to the first 5,000 tokenized reviews. The vocabulary is built before this limit, so the embedding tables are dimensioned for the full 252K vocabulary even though we only iterate over 5K reviews.

50 million parameters. The embedding matrices are created with the full vocabulary size:

var input_embeddings = Tensor[dtype].rand(
    Shape(vocabulary_size, EMBEDDING_DIMENSION), ...
)
var output_embeddings = Tensor[dtype].rand(
    Shape(vocabulary_size, EMBEDDING_DIMENSION), ...
)

Each is 252,001 × 100 = 25,200,100 elements. Two tables → 50,400,200 parameters (~50.4M). The console confirms:

Vocabulary size:  252001
Embedding Dimension:   100
Reviews Used:          5000 of 25000

Stage 2: Token Embedding Approaches — A Landscape

Before we dive into our training algorithm, it’s worth stepping back and asking: what approaches exist for turning tokens into vectors, and where does our method fit?

Approach	Dimensionality	Semantics	Training Cost	Inference Cost
One-hot	V (huge)	None	None	O(V)
TF-IDF	V (huge)	Word frequency	O(N)	O(V)
Co-occurrence (GloVe)	d (small)	Context statistics	O(V²)	O(1)
Prediction (word2vec)	d (small)	Context prediction	O(N × d × K)	O(1)

One-hot is the baseline with zero learning — each word is a distinct symbol with no inherent relationship to others.

TF-IDF adds frequency weighting but stays in the V-dimensional space. “King” and “queen” are still treated as completely unrelated dimensions.

Co-occurrence methods (like GloVe) are the closest competitor to prediction-based methods. They count how often each pair of words co-occurs in a context window, then factorize that count matrix. The resulting vectors capture semantics well, but building the full co-occurrence matrix is O(V²) — infeasible for a 100K vocabulary without approximation. GloVe works around this by counting only co-occurrences above a threshold, but it still requires iterating over every word pair in every context window.

Prediction-based methods (word2vec and its variants) take a different route: instead of counting co-occurrences, they train a classifier to predict them. This is the approach we’ll implement. The key insight is that predicting whether a word appears in a given context forces the model to learn vector geometry that captures semantic relationships — as a side effect of optimizing classification accuracy, not as an explicit goal.

Within prediction-based methods, there are two main architectures:

CBOW (Continuous Bag of Words): Given the context words, predict the target word. Fast to train, but less effective for rare words.
Skip-gram: Given the target word, predict the context words. Slower to train, but produces better vectors for rare words.

We’ll use CBOW. The intuition: given “the, cat, on, the”, predict “sat”. CBOW averages the context word embeddings into a single vector, then scores candidate words against it. It’s simpler to implement with manual gradients — a single average instead of per-context-word gradient distribution — and faster to train per step since each training example processes one target word instead of C context words.

Stage 3: The CBOW Idea

CBOW (Continuous Bag of Words) is built on a simple intuition from linguistics: “a word is known by the company it keeps.” Words that appear in similar contexts have similar meanings.

The CBOW training objective:

Given context words w_{t-C}, ..., w_{t-1}, w_{t+1}, ..., w_{t+C},
maximize the probability of seeing the target word w_t.

In the sentence “The cat sat on the mat”, with a window size of 2 around sat:

Context: [the, cat, on, the]
Target: sat

For every target position in every review, we collect the surrounding words within the window:

var left_context = slice(
    max(0, word_position - CONTEXT_WINDOW_SIZE),
    word_position
)
var right_context = slice(
    word_position + 1,
    min(len(review), word_position + CONTEXT_WINDOW_SIZE)
)

var context_indices = review[left_context].copy()
context_indices.extend(review[right_context].copy())

This produces a variable-length context window centered on each target word. Words closer to the target are included more reliably; the asymmetric edges of documents naturally get fewer context words, which is fine — the model learns to handle varying amounts of context.

The probability of the target word given the context words is computed using the softmax over the entire vocabulary:

\[P(w_{\text{target}} \mid \text{context}) = \frac{\exp(\text{score}(w_{\text{target}}, \text{context}))}{\sum_v \exp(\text{score}(v, \text{context}))}\]

Here, score(w_t, context) is a measure of compatibility between the target word and the averaged context. Word2vec uses two embedding matrices to compute this:

Input embeddings (vocab_size × hidden_size): used to represent the context words. We gather the embeddings for every context word in the window and average them into a single context vector. These are what we’ll eventually use as our word vectors.
Output embeddings (vocab_size × hidden_size): used to represent the candidate word (either the target or a negative sample). Each candidate gets its own embedding, and the score is the dot product between this output embedding and the averaged context vector.

In our code, the context words are looked up from input_embeddings and the target + negatives from output_embeddings:

var context_embedding = input_embeddings.gather[track_grad=False](
    context_indices, reduction=Reduction(1)
)
var averaged_context = context_embedding / Float32(context_length)

var sample_embeddings = output_embeddings.gather[track_grad=False](
    sample_indices
)

var predicted_scores = sample_embeddings.matmul[
    mode=mv, track_grad=False
](averaged_context).sigmoid()

The asymmetry is intentional. Each word has two representations — one for when it acts as surrounding context and one for when it’s the candidate being scored. Having separate parameters makes the optimization easier, and the input embeddings end up as our final word vectors.

The Softmax Wall

The softmax denominator sums over every word in the vocabulary. For each training step, computing this requires:

V dot products (one per vocabulary word)
V exponentiations
V additions for the denominator
V divisions for the final probabilities

With V ≈ 100K, that’s 100K dot products per step. With 5 million training tokens and 5 iterations (epochs), that’s 2.5 trillion dot products. Even at 1 microsecond per dot product, that’s months of computation.

This is the softmax wall — the fundamental computational bottleneck that prevented early neural language models from scaling to large vocabularies.

Stage 4: Negative Sampling

The critical insight from Mikolov et al. (2013) is that we don’t need the full softmax. We don’t care about the exact probability distribution over all words — we only care that the model learns good vector representations. And for that, we can replace the multi-class softmax with a much cheaper binary classification task.

The idea: Instead of computing “how likely is this context word given this target, out of all possible context words?”, train a binary classifier that answers “did this target-context pair come from real data or random noise?”

For each real (target, context) pair (a positive sample), we generate K negative samples — random words drawn from the corpus that are unlikely to be real context words. The model then learns to assign high probability to positive pairs and low probability to negative pairs.

The objective function for a single training example:

\[J = \log \sigma(\mathbf{u} \cdot \mathbf{v}) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n}[\log \sigma(-\mathbf{u}_k \cdot \mathbf{v})]\]

Where:

$\mathbf{u}$ is the embedding of the candidate word (target or negative sample) — looked up from output_embeddings
$\mathbf{v}$ is the averaged context embedding — computed from input_embeddings
$\sigma(\cdot)$ is the sigmoid function
$P_n(w)$ is the noise distribution — we draw negative samples from it

The first term pushes the target word’s output embedding and the context vector together. Each term in the second sum pushes a random noise word’s output embedding and the context vector apart.

This equation is binary cross-entropy in disguise. Every $\log \sigma(\cdot)$ term is paired with an implicit label: the positive term has label 1, which maximizes $\log \sigma(\cdot)$ when the dot product is large and positive; the negative terms have label 0, which maximizes $\log \sigma(-(\cdot))$ — equivalent to $\log(1 - \sigma(\cdot))$ via sigmoid symmetry $\sigma(-x) = 1 - \sigma(x)$. The expectation $\mathbb{E}_{w_k \sim P_n}$ is a Monte Carlo estimate: instead of summing over the full vocabulary (which is the softmax), we draw $K$ random words from the noise distribution and average their contributions. With $K$ typically between 5 and 20, we replace an $O(V)$ sum with $O(K)$ samples — the entire point of negative sampling.

K+1 Binary Classifications Instead of One V-Way Classification

This is the entire point: instead of one V-way softmax (V computations per step), we now have K+1 binary classifications (K+1 computations per step). With K = 5–20, that’s a 5,000x–20,000x reduction in computation per training step.

The Noise Distribution

Mikolov found empirically that the best noise distribution is the unigram distribution raised to the 3/4 power:

P_n(w) = count(w)^(3/4) / Z

Where Z is a normalization constant. Raising to the 3/4 power has the effect of giving rare words a higher chance of being selected as negatives than they would under the raw unigram distribution. This prevents the model from seeing only common words as negatives, which would make the task too easy.

Our implementation uses a simpler uniform random distribution (drawing from the concatenated token list), which is a common approximation:

def generate_negative_samples(
    current_review: List[Int],
    target_position: Int,
    all_tokens: List[Int],
    num_negative_samples: Int,
) -> List[Int]:
    var corpus_length = Float64(len(all_tokens))
    var negative_samples = [
        all_tokens[
            min(Int(random_float64() * corpus_length), len(all_tokens) - 1)
        ]
        for _ in range(num_negative_samples)
    ]

    # Insert the target word at position 0 (positive sample)
    negative_samples.insert(0, current_review[target_position])

    return negative_samples^

The result is a list of K+1 token IDs: position 0 is the positive sample (the real context word), and positions 1 through K are random negatives.

This is the heart of negative sampling — a few lines of code that turn an intractable O(V) problem into a tractable O(K) one.

Stage 5: The Training Loop

With the theory in place, the training loop ties everything together. The model is encapsulated in a Word2Vec struct that holds both embedding tables and exposes forward() and step() methods. The inner loop simplifies to four lines:

var scores = model.forward(ctx, tgt)
model.step(scores, fixed_target, ctx, tgt, Float32(LEARNING_RATE))

For each word in each review, the loop:

Builds a context window around the target word.
Calls model.forward(ctx, tgt) which averages context embeddings, scores targets, and applies sigmoid — caching intermediates for the next step.
Calls model.step(scores, labels, ctx, tgt, lr) which does backward (gradient = scores − labels, chain rule through matmul) and scatter-adds sparse updates to both embedding tables.
Uses Tenmo’s scatter_add under the hood, updating only the rows that participated in the forward pass.

The full inner loop:

for word_position in range(len(review)):
    var left = slice(max(0, word_position - CONTEXT_WINDOW_SIZE), word_position)
    var right = slice(word_position + 1,
        min(len(review), word_position + CONTEXT_WINDOW_SIZE))
    if left.start == left.end and right.start == right.end:
        continue

    var ctx = review[left].copy()
    ctx.extend(review[right].copy())
    if len(ctx) == 0:
        continue

    var tgt = generate_negative_samples(review, word_position,
        all_tokens, NUM_NEGATIVE_SAMPLES)

    var scores = model.forward(ctx, tgt)
    model.step(scores, fixed_target, ctx, tgt, Float32(LEARNING_RATE))

Let’s look at what happens inside those two method calls.

Forward Pass

The forward pass is encapsulated in Word2Vec.forward():

def forward(
    mut self,
    context_indices: List[Int],
    target_indices: List[Int],
) -> Tensor[Self.dt]:
    self.cached_avg = self.input_embeddings.gather[track_grad=False](
        context_indices, reduction=Reduction(0)
    )
    self.cached_tgt_emb = self.output_embeddings.gather[track_grad=False](
        target_indices
    )
    var scores = self.cached_tgt_emb.matmul[mode=mv, track_grad=False](
        self.cached_avg
    )
    return scores.sigmoid[track_grad=False]()

The same three operations, now in one place:

Gather with reduction. gather(context_indices, reduction=Reduction(0)) looks up the embedding for each context word ID and averages them (Reduction(0) means “mean”). This turns, say, 6 context words into a single 100-dimensional vector. The result is cached as cached_avg for the subsequent step() call.

Matmul with mode=mv. cached_tgt_emb is shape (K+1, hidden_size); cached_avg is shape (hidden_size,). mode=mv tells matmul to treat this as matrix-vector multiplication, producing shape (K+1,). Each entry is the dot product between one sample’s embedding and the averaged context.

Sigmoid. The dot products are raw scores in (-∞, ∞). Sigmoid squashes them to (0, 1) so they can be interpreted as probabilities.

The method also caches cached_tgt_emb for the backward pass to use. These cached intermediates let step() avoid re-running the gather operations when computing gradients.

Training Target

var fixed_target = Tensor[dtype].zeros(NUM_NEGATIVE_SAMPLES + 1)
fixed_target[0] = 1

The target vector is [1, 0, 0, 0, 0, 0] (when K=5). The 1 at position 0 tells the model “the word at index 0 (the positive sample) should have high probability.” The 0s at positions 1–5 say “these random words should have low probability.”

This is a binary cross-entropy setup: each of the K+1 positions is an independent binary classification. The target is created once and reused across every training step.

Backward + Update: The step() Method

The backward pass and parameter update are combined in Word2Vec.step(). The gradient of binary cross-entropy with respect to the logits simplifies to a single subtraction — scores - labels — so the autograd graph would be pure overhead here. Instead, we compute gradients by hand and apply them directly with scatter_add:

def step(
    mut self,
    scores: Tensor[Self.dt],
    labels: Tensor[Self.dt],
    context_indices: List[Int],
    target_indices: List[Int],
    lr: Scalar[Self.dt],
):
    var context_length = len(context_indices)
    var gradient = scores - labels
    var grad_ctx = self.cached_tgt_emb.transpose[track_grad=False]().matmul[
        mode=mv, track_grad=False
    ](gradient)

    # Input embeddings — rank-1 source broadcasts to all context rows
    var ctx_update = -grad_ctx * lr / Scalar[Self.dt](context_length)
    Filler[Self.dt].scatter_add(
        self.input_embeddings.buffer,
        ctx_update.buffer,
        IntArray(context_indices),
    )

    # Output embeddings — outer product, each target row gets its own
    var out_update = -gradient.unsqueeze(1) * self.cached_avg.unsqueeze(0) * lr
    Filler[Self.dt].scatter_add(
        self.output_embeddings.buffer,
        out_update.buffer,
        IntArray(target_indices),
    )

Three distinct computations happen here:

1. The gradient formula

scores - labels is the gradient of binary cross-entropy with respect to pre-sigmoid logits. For L = -[t log(p) + (1-t) log(1-p)] with p = σ(x), the gradient simplifies to dL/dx = p - t. No exponentials, no logarithms — just a subtraction.

We’re computing this by hand intentionally. Tenmo has a complete autograd engine — you can set track_grad=True on any tensor, call .backward() on the loss, and the framework will unroll the full computation graph, compute all gradients, and feed them to an optimizer. But here, the gradient formula collapses to a single element-wise subtraction. Dispatching that through graph construction, tape recording, and jump-table dispatch would add 10-100x overhead for no benefit. The manual path isn’t a workaround — it’s the right tool for this job.

2. Chain rule through matmul

grad_ctx = cached_tgt_emb^T @ gradient is the chain rule through the dot product. If score = u · v and dL/dscore = gradient, then dL/dv = u^T · gradient. We transpose the cached target embeddings (shape (hidden_size, K+1)) and multiply by the gradient (shape (K+1,)), getting the gradient for the averaged context vector (shape (hidden_size,)).

3. Sparse updates with scatter_add

Both embedding updates use Filler.scatter_add — Tenmo’s sparse update primitive that adds gradient contributions to specific rows of a tensor buffer, leaving all other rows untouched. This avoids materializing a full (vocab_size, hidden_size) gradient matrix — a savings of ~100× memory and computation.

The input embedding update uses rank-1 broadcast: scatter_add detects that ctx_update has rank 1 and broadcasts it uniformly across all indices. Every context word gets the same gradient vector added to its row, without needing unsqueeze + repeat to tile it into a matrix first.

The output update is different. Each of the K+1 samples gets its own update proportional to how wrong its prediction was:

out_update[sample_i] = -gradient[i] * cached_avg * lr

The unsqueeze operations handle broadcasting: gradient is shape (K+1,), cached_avg is shape (hidden_size,). After unsqueezing, gradient.unsqueeze(1) is (K+1, 1) and cached_avg.unsqueeze(0) is (1, hidden_size). The element-wise multiplication broadcasts to (K+1, hidden_size) — exactly the shape needed to update all K+1 sample embeddings in one scatter_add call.

The division by context_length in the input update is critical: in the forward pass, we averaged the context embeddings, so the chain rule requires dividing the gradient by context_length. Without this, longer context windows would get disproportionately large updates.

Gradient Flow Verification

After each epoch, we check that gradients are actually flowing by comparing the weight sum against the initial value captured before training began:

var final_sum = model.input_embeddings.sum[track_grad=False]().item()
print(
    "\nWeight sum change:", final_sum - initial_weight_sum,
    "(should be != 0 — proves gradients are flowing!)",
)

If the weight sum hasn’t changed, something is wrong with the gradient computation or the update. This is a cheap sanity check that catches bugs like a zero learning rate, a disconnected graph, or a failed scatter_add. In practice, seeing a weight change of non-zero confirms the entire pipeline — from forward pass through gradient computation through update — is functioning.

Stage 6: Probing the Learned Embeddings

Training yields an embedding matrix of shape (vocab_size, 100). To test whether these vectors actually capture semantics, we write a function that finds words closest to a given query:

def find_similar_words(
    tokenizer: Tokenizer,
    ref embeddings: Tensor[DType.float32],
    query_word: String = "beautiful",
    top_n: Int = 10,
) raises -> List[Tuple[String, Float32]]:

    # Get embedding for the query word
    var query_ids = tokenizer.encode(query_word)
    var query_embedding = embeddings.gather[track_grad=False](query_ids)

    # If multiple tokens (unlikely for single word), average them
    if len(query_ids) > 1:
        query_embedding = query_embedding.mean[track_grad=False](
            IntArray(0), keepdims=True
        )

    # Compute Euclidean distance to all other words
    var differences = embeddings - query_embedding
    var distances = (
        (differences * differences)
        .sum[track_grad=False](IntArray(1))
        .sqrt[track_grad=False]()
    )

    # Build results and sort by similarity
    var results = List[Tuple[String, Float32]](capacity=len(tokenizer))
    for ref pair in tokenizer.word_to_id.items():
        var word = pair.key
        var index = pair.value
        if word == query_word or "_" in word:
            continue
        results.append((word, -distances[index]))

    sort[cmp_fn=compare_by_similarity](results)

    var top_results = List[Tuple[String, Float32]](capacity=min(top_n, len(results)))
    for k in range(min(top_n, len(results))):
        top_results.append(results[k])
    return top_results^

The similarity metric is negative Euclidean distance — we compute -||v_query - v_word|| for every word in the vocabulary, then sort descending. Negative distance means “closer is more similar,” which makes sorting natural (highest first).

The steps are worth noting:

embedding - query_embedding computes a (vocab_size, hidden_size) difference matrix — a single broadcast operation.
(differences * differences).sum(axis=1) squares and sums along the hidden dimension, producing a (vocab_size,) distance vector.
.sqrt() converts squared distances to actual Euclidean distances.
We iterate over the vocabulary, skip the query word itself and symbol-heavy words, and build a (String, Float32) result list.
The results are sorted and the top N returned.

This is intentionally simple — we use Euclidean distance rather than cosine similarity because it’s cheaper to compute (no normalization step). In practice, for unit vectors, Euclidean distance and cosine similarity produce the same rankings.

The demo output, when the training converges, shows:

🔍 Words similar to 'terrible':
   horrible → similarity: -1.4567126
   boring → similarity: -2.1396909
   wonderful → similarity: -2.1462088
   ridiculous → similarity: -2.1734316
   weak → similarity: -2.276786
   stupid → similarity: -2.280788
   fantastic → similarity: -2.2870705
   lame → similarity: -2.2934372
   simple → similarity: -2.2952878
   poor → similarity: -2.3172371

Most neighbors are negative-sentiment words (horrible, boring, ridiculous), which is expected — “terrible” lives in negative semantic space. A couple of positive words (wonderful, fantastic) also appear, which may reflect shared intensity or syntactic patterns in the training data. If the embeddings were random or poorly trained, we’d see unrelated words like “the”, “movie”, or “and” clustering at the top. The fact that the nearest neighbors are mostly semantically related is evidence that the training worked.

Why Tenmo?

This implementation highlights a few of Tenmo’s design strengths:

First-class scatter_add primitive. Most tensor libraries treat row-scatter as an afterthought or don’t expose it at all. PyTorch has index_add_, but it passes through the autograd engine, adding overhead for graph tracking that sparse updates don’t need. Tenmo’s Filler.scatter_add is a direct buffer operation — no graph, no tape, no dispatch. It’s the right primitive for word2vec, and Tenmo exposes it directly.

Autograd when you need it, not when you don’t. Tenmo has full autograd: track_grad=True, .backward(), optimizers like SGD, everything you’d expect. But when your gradient simplifies to p - t, the autograd path is pure overhead. Tenmo doesn’t force you through it — you can call Filler.scatter_add on raw buffers, compute gradients by hand, and skip the graph entirely. The choice is yours per operation, not all-or-nothing.

Ownership without GC pauses. Each training step allocates intermediate tensors (gather outputs, scores, gradients). In a garbage-collected language, these allocations trigger the GC to track and reclaim them. Mojo’s ownership system (which Tenmo is built on) lets us control exactly when temporaries are destroyed — or reuse buffers explicitly.

CPU-first with optional GPU. The code runs on CPU without modification. Tenmo detects GPU availability at compile time via has_accelerator(). When a GPU is present, tensors are transparently moved and operations dispatched to GPU kernels. Same code, one compile flag.

Conclusion

We built the full pipeline from raw text to word vectors using Tenmo:

A text tokenizer that cleans HTML-laden reviews, builds a vocabulary, and encodes text into integer IDs with an unknown-word fallback.
A CBOW training loop that predicts the target word from averaged context embeddings, with context window construction and embedding averaging.
Negative sampling that turns a V-way softmax into K+1 binary classifications — the key algorithmic insight that makes word2vec practical.
A Word2Vec struct whose forward() and step() methods encapsulate manual gradient computation and sparse scatter_add updates — optimizing only the embedding rows that actually participated in each training step.
A similarity probe that validates the learned embeddings by finding nearest neighbors in vector space.

The final implementation trains on 5,000 IMDB reviews, producing word vectors where “terrible” is close to “awful”, “horrible”, and “dreadful” — without ever being told that these words are related. The model learned it purely from the statistics of word co-occurrence in raw text.

Next steps to explore:

Swap negative sampling for hierarchical softmax and compare training speed and embedding quality.
Move to a larger corpus (Wikipedia dumps are a common next step) and use subword tokenization (BPE) instead of word-level tokens.

The full code (around 760 lines) is available in the tenmo repo’s examples/word2vec_cbow.mojo. It’s MIT-licensed and ready to run — just mojo -I . examples/word2vec_cbow.mojo with the IMDB dataset in /tmp/aclImdb/.

A solana on-chain contract and off-chain client in rust

2022-04-16T23:09:00+00:00

https://github.com/ratulb/solana_program_and_rust_client

Originally published on https://rbsomeg.blogspot.com

Migrate kubernetes embedded etcd to external etcd - easy back and forth switch

2021-07-01T01:18:00+00:00

Gist:

Create a multi-master kubernetes cluster from the comfort of a shell menu without tweaking a thing. Front the apiservers with load balancer of your choice - namely h**aproxy/nginx/envoy. Do hassle free back and forth switch between embedded etcd and external etcd.
**

In this post, we discuss kube-etcd-switch - which is not quite a tool rather a bunch of scripts behind a shell menu that help us to do all the above in a hassle free manner.

Curious? Read on then. But you have been forewarned - it might not be your cup of tea.

Kubernetes treats pods as cattle - they are discarded if not healthy. No effort is wasted on reviving unhealthy pods - instead new ones are created to replace the bad ones.

Kubernetes is conjoined with etcd by an umbilical chord. Etcd stores kubernetes schema and state. Kubernetes is useless without etcd(as things stand currently). At times - it can be quite a challenge to bring up a kubernetes cluster if etcd starts throwing its tantrums. For example - you want to remove an etcd node because it has gone bad - but etcd cluster would not let you do that because the node is not up yet. Quite a vexatious situation to be in.

So, what do we do in such a chicken and egg situation? Well, follow the same kubernetes philosophy - we discard the etcd cluster ( Not the cluster itself - we have compunction - mechanical sympathy. Instead we scrap etcd ) - create a new one to replace the faulty one. We treat everything as cattle - no pets. If a piece of software is not crunching data and providing information - it is not serving its cause - it’s redundant. Below we provide a glimpse of how we do that. That is, of course, as long as we have data at our hands, a backup or a snapshot - we care for data - it’s valuable - amorphous gold.

First up, we need a kubernetes cluster -kube-etcd-switch can interface with any existing kubernetes cluster - but here we show how to setup a k8s cluster as well because we don’t have one at hand currently and we need a cluster for the show to go on.

Requirements: A set of machines (Debian buster/ubuntu16/18/20 flavor) with root SSH access.

Here, we use four machines - one for load balancer(lb), two for kubernetes master nodes(m-1,m-2), one worker(w-1) node.

We run everything from the load balancer node.

1) Clone the following GitHub repository - go inside and launch the ‘cluster.sh’ script.

git clone https://github.com/ratulb/kube-etcd-switch

cd kube-etcd-switch/

./cluster.sh

We would be presented with menu which has quite a few choices as shown

We need a cluster - hence we make the appropriate selection and get on with the cluster setup process driven by the menu choices.

2) We enter the cluster details such asload balancer, master nodes and worker node. Following few snaps capture the steps.

3) Load balancer details

4) Next we enter master and worker details:

5) Next we select option to launch the cluster creation process. This would provide us with running kubernetes cluster in a matter of minutes with weave CNI plugin and demo nginx pod deployed.

6) Following snap shows the end result of cluster creation:

7) Next is the initialization step. For k8s-etcd-switch to work with any cluster it needs to be initialized first. We need to provide the master IP (or name) of any one of the masters for this. k8s-etcd-switch will query the cluster - gather information such a master members, copy ETCD CA cert and setup kubectl,cloudflare CFSSLand other required binaries to perform its duties. The initialization process can be repeated - it is idempotent. The initialization process is minimum once per servers’ certificate rotation.

Following snaps show the initialization choices.

Note: Above we see that master endpoints are already detected - that is because k8s setup has already configured kube config. It will not be so for a pre-existing cluster. Initialization would be needed in either case.

8) Post initialization k8s-etcd-switch show cluster’s system pod states. Now it can talk to the kubernetes cluster.

9) At this point - our cluster is pristine( it would not be so for an existing cluster ). Lets go ahead and deploy a demo nginx pod in the default namespace. We select console and deploy the pod.

10) We see that our nginx pod is running along with demo pods that were deployed during the cluster creation process.

11) We want to survive cluster failure whether kubernetes or etcd. Kubernetes is done deal - we have shown it above. Etcd would be without it’s salt - if it did not have data. But now it has data - whole kubernetes cluster’s schema and state - that also contains our freshly deployed nginx pod’s information. We need that data - we want to preserve it to survive cluster failure - computation calamity.

We exit out of the console - that would take us back to where we were before. We select snapshot view from the menu - we would be presented with an option to choose between embedded and external etcd cluster. Presently, we do not have an external cluster. We choose embedded and take a snapshot.

12)With a snapshot in hand - we are safe. We heave a sigh of relief. We are ready to combat disaster. We want put our conviction to test - we want to simulate a catastrophe and survive through it - making ourselves doubly reassured that we can infuse life back into etcd in the event of a cluster failure.

We head back to the main menu - choose console (this can be done from a usual terminal - there is no difference - but we want to be in the context of the menu - hence choose console anyway) and the run the script shown in the following snap. This script will wreak havoc on our cluster - it will wipe out our cluster and render it useless. All data would be expunged. Only the static pods would be running meekly with utter indifference. Had it been a production cluster - business would have come to a grinding halt. Some may be updating their resumes - freshening up on the tricks of the trade. Yet some others may be philosophizing what life is all about - consequences may be far and beyond one’s imagination - all due to a failed etcd cluster(pun intended 😜).

Cluster demolition in progress:

Total annihilation:

13) Now that our cluster is decimated, we want to bring it back to life using the snapshot that we had taken. We can - and we would restore the snapshot on top of embedded etcd cluster - but first we would launch an external etcd cluster and restore the snapshot on top of it and verify that api servers are responding as expected.

We exit from the console and go back to main menu and choose ‘Manage external etcd’

14) We proceed with external etcd cluster setup process. For this post, we choose to host the cluster on the load balancer and the worker node ( Digression: we can also imagine kubernetes master nodes being part of external etcd cluster. For that to happen - the stacked/embedded etcd would need to bottom out one by one giving external etcd space to be hosted as separate processes on the master nodes).

15) The external etcd cluster is ready with required configurations and binaries but not yet started.It would be up once we restore the snapshot.

16) Lets go ahead and restore the snapshot. Following snaps capture the steps. We go back to snapshot view and select restore option.

17) We choose external etcd as target cluster and select the snapshot that we had saved earlier.

18) We see snapshot restoration on external etcd cluster in progress.

19) Snapshot restoration on external etcd cluster is complete and system pods are up and running in a couple of minutes.

20) We have survived a disaster without a scratch. That was easy! Lets go ahead take out an etcd node for repair. Kubernetes cluster should suffer no hiccups.

21) There has been no hiccups for the cluster as we can see from the kube system pods. Embedded etcd cluster is still running but api servers are not pointing at them. They will have nothing in them - because when the disaster struck - they were hollowed out.

22) Node repaired. Lets add it back to the cluster again.

23) Repaired node has become a member of the cluster again.

24) Lets bring the embedded etcd cluster back to live. We go back to snapshots view, select embedded cluster as restore target.

25) We see that our embedded cluster is back - and system pods are back too.

26) Our nginx pod should be back on the default namespace. Lets check that.

This effortless switch between two environments using snapshots opens the door for lot of use cases - disaster recovery, cluster replication, fail over, rapid development and testing, preview releases to just name a few.

What about the situation - where we have just restored a snapshot but would like to go back to the previous state we were in? Well, we would definitely take a backup snapshot before migration - and use that as fallback option. But in reality - snapshot always takes us to a new state - it creates new data directories, new configurations - its not exactly the same setup as before.

But we want to go back to the exact setup - we were in. Can we do that? Of course we can. We would need o manually alter settings and configurations. That would involve rounds of testing and verification. That is going to be error prone and not hassle free. Well, freedom from hassle is what k8s-kube-switch strives for.

As it turns out, these scripts can help us to go back to not only the previous state, but any previous state. As said, when we are restoring a snapshot, we are creating new restore paths and configurations and moving on to them - whether it is embedded or external etcd. We are leaving behind a trail of data directories and configurations. What it does is - any time we restore a snapshot, it looks at current settings and data directories across nodes and backs them all up in a single archive and saves it(Where? Currently underneath a directory called kube_vault - in the node where k8s-kube-switch runs. These archives can be easily be pushed to a safe storage and duplicated to prevent data loss).

We have not talked about states so far. States is the the mechanism that helps us to go back to any last good state. But it has challenges of its own. We are good if cluster topology remains same. We can just spread out the archived state across the nodes and resume etcd and kubernetes api servers. But what if nodes leave or new nodes are added to the etcd cluster? As we know - etcd does not like it if a node does not leave the cluster in good terms - it will not bury that hatchet otherwise. And talk of adding a node surreptitiously to the cluster - you have to dance a new dance to calm etcds’ tantrums. States is a topic for another post, another day.

Conclusion:

We have covered a lot. We started with a fresh cluster setup, taken a snapshot, brought it to its knees, created an external etcd cluster, restored a snapshot on it - brought it to life, taken a node out of the cluster, added it back - and finally switched back the kubernetes cluster to embedded etcd. We have also touched upon states.

Behind all this are a bunch of shell scripts. We can see what they are doing because we are close to the metal. They enable experimentation - We can choose the console option - tweak/improve/cookie cut the scripts to suit our needs - exit the console - refresh the view and see the effects.

Happy experimentation - if you wish.

Source: https://github.com/ratulb/kube-etcd-switch/blob/main/cluster.sh

Originally published on https://rbsomeg.blogspot.com

VPC native kubernetes cluster in GCP

2021-06-15T20:11:00+00:00

VPC native k8s clusters have quite a few advantages:

POD IPs are directly routable. This eliminates the need for a load balancer to hop from node to pod. Instead traffic can reach PODs directly minimizing latency.
POD IPs are reserved before PODs are created. This helps avoid POD IP collision with existing resource IPs.
Firewall rules can be configured for POD IP ranges instead of node IP ranges.
POD IPs can be accessed from on-premise connected networks via VPN or cloud inter-connect.

VPC native cluster requires a subnet for cluster nodes, 2 secondary subnets inside the subnet for nodes - one for POD IPs and another for service IPs.

Commands to launch a VPC native k8s cluster quickly:

Create VPC network:

gcloud compute networks create gke –project=[project_id] –subnet-mode=custom –mtu=1460 –bgp-routing-mode=regional

Create subnet and secondary ranges for POD and services:

gcloud compute networks subnets create primary-subnet –project=[project_id] –range=10.0.0.0/8
--network=gke –region=asia-south1 –secondary-range=pod-subnet=172.16.0.0/12 –secondary-range=service-subnet=192.168.0.0/16

Launch the cluster:

gcloud container clusters create gke-cluster \
    --network gke \
    --enable-ip-alias \
    --subnetwork=primary-subnet \
    --cluster-secondary-range-name=pod-subnet \
    --services-secondary-range-name=service-subnet \
    --num-nodes 3 \
--zone asia-south1-b

Initialize kubeconfig:

gcloud container clusters get-credentials gke-cluster –zone asia-south1-b

Deploy a nginx POD:

kubectl run nginx –image nginx

Expose POD via cloud load balancer:

kubectl expose pod nginx -l run=nginx –port 80 –type LoadBalancer

Access exposed POD via load balancer IP:

curl [load balancer IP]

Originally published on https://rbsomeg.blogspot.com

grpc connect — rust, java and grpc-web

2021-04-18T22:31:00+00:00

Gist: Route calls from browser(using grpc-web) to rust grpc application(implemented using tonic), which in turn delegates to java grpc and vice versa.

Note : We use latest versions of various libraries/binaries for this demonstration. One would be well advised to use disposable cloud VMs to carry out the steps demonstrated in this post. Verified for debian buster and various flavors of ubuntu.

Grpc offers many advantages — schema first design enforces well-defined interfaces, protobuf based binary protocol is performant, multiple requests over a single connection, implementation of clients and servers in multiple languages based on language specific artifacts generated by protoc compiler, bi-directional streaming etc.

In this post, however, we stick to a simple example of request and reply since our focus is on connectivity between different pieces. Following figure captures the request and response flow:

Note: It will help to clone the following GitHub project to follow along the steps described:

git clone https://github.com/ratulb/grpc-rust-java-web.git

Part 1 : java and rust grpc connectivity

Following is the protobuf interface definition that rust/java/grpc-web use to generate language specific protocol buffer artifacts, clients and services

2. We implement the rust service first. We assume that rust is already installed.

3. We create the rust grpc server implementation within ../rust/server (refer to https://github.com/ratulb/grpc-rust-java-web/tree/main/rust/server).

cargo new server

4. We create a new folder called ‘proto’ inside the ‘server’ project created above and place the protobuf definition file ‘echo.proto’ inside that.

5. There are multiple grpc frameworks available in rust. We use tonic as rust grpc framework because of its feature completeness, contributor count and production readiness. Hence we edit the Cargo.toml file to include tonic with its dependencies.

6. To trigger the protobuf code generation we need to add a file named ‘build.rs’ inside the server folder with the following content.

7. At this point, we are ready to build the project. We run ‘cargo build’. Post build, we find that there is a echo.rs file generated inside the target directory.

8. We add a src/echo.rs with content of the file as shown below:

tonic::include_proto!(“echo”);

9. Next we modify the src/main.rs file with content shown as below:

Note : The the content of https://github.com/ratulb/grpc-rust-java-web/blob/main/rust/server/src/main.rs file differs from the one shown above. That is because — once the rust grpc server receives a request — it will try to pass on the request to a java delegate if registered. Also, we need to make sure there is no endless delegation cycle. The rust implementation uses grpc request headers and the java implementation(https://github.com/ratulb/grpc-rust-java-web/blob/main/java/server/src/main/java/grpc/java/server/EchoServer.java) uses request header along with request interceptor to break the cycle.

10. At this point — we are ready to launch rust grpc server implementation by running “cargo run ”.

11. Our rust server should be running at this point. We would be using ‘grpcurl’ to invoke the server.

12. We run the “grpc-curl.sh” script as shown below:

./grpc-curl.sh 0.0.0.0:30031

13. We should get back a response from the server.

14. At this point we should be able navigate to the ./rust/client folder and run the rust client implementation(https://github.com/ratulb/grpc-rust-java-web/blob/main/rust/client/src/main.rs) as shown below:

cargo run or just call ./run.sh

15. At this point — we should be able to navigate to ./java/server/ and ./java/client/ folders and run the ‘run.sh ’ script in respective folders.

16. If both rust and java grpc servers are running — then running rust client should get a response from the java grpc server and vice versa — this would mean that rust andjava grpc connectivity is working as expected.

Part 2 : Envoy proxy

Note : Rust and java grpc do not need envoy proxy to connect to each other. They talk proper grpc which makes use of HTTP2 as the underlying transport protocol. We are just setting things up for what is coming next- Grpc-web.

Navigate to ./envoy folder and run ‘./setup.sh ’ - this would install envoy proxy locally.
Next run ‘./runs.sh ’. Envoy would start listening at port 10000. Envoy is configured to route request based on a request header called “target_cluster ” . So grpc payload to envoy should carry the request header called “target-cluster” as part of grpc request metadata. Later we would see that grpc-web client is sending this header from the browser request. Based on the grpc request metadata header, the incoming request is routed to upstream rust or java grpc server.

3. For now we can navigate to ./java/server or ./rust/server folder and execute the ‘grpc-curl.sh’ script. We should be able to get a response back because these scripts are configured to send the target_cluster request header as shown below:

4. So far we have made sure that if we can deliver a grpc request payload to the envoy listening address, the request would be answered by either java or rust grpc server. Next, we would look at sending a grpc request from the browser.

Part 3 : Grpc-web

As things stand currently, the browser does not talk grpc (though it supports HTT2 - and remember grpc != HTT2 ). Also, the browser does not expose APIs with enough control for request manipulation and make outgoing grpc request. So — that’s where grpc-web comes in — it is a JavaScript client library that facilitate connectivity between a browser application and grpc server. but grpc-web does not talk proper grpc either. It talks in terms of a protocol which makes it easy to change the conversation into proper grpc — which is what is done by the envoy proxy (by making use of a filter — “envoy.filters.http.grpc_web” —in ./envoy/envoy.yaml & ./envoy/envoy-local.yaml).

The overall process of making a grpc application available in the browser is as follows:

a) Generate JavaScript protobuf message classes and client stub for the client using protoc compiler from protobuf definition file.

b) Compile all the required libraries along with generated protobuf message classes and stub into one javascript library compatible with browsers. This can be achieved using tools like “browserify”, webpack etc. Optionally, we can minify the the compiled library. We are using webpack in this example.

c) Host client app(index.html) in a webserver (tomcat in our example).

d) Set up a proxy (envoy proxy) to intercept grpc-web request from the browser. Delegate the intercepted request to grpc server, gather response and send it back to the browser.

Detailed steps:

Note: We are using NodeJS packages npx and webpack-cli along with dependencies to compile required libraries and protobuf message classes and client stub into one single library. That’s why the installation of NodeJS and the dependencies.

Navigate to ./web folder and run the ‘./install-protoc.sh’ script — This would install ‘protoc’ and ‘protoc-gen-grpc-web’ required for generating javascript protobuf message classes and client stub from the protobuf definition.
Next, run the ‘./gen-js-proto.sh’ script. This would compile the proto/echo.proto definition and generate two output files — namely ‘echo_pb.js’ and ‘echo_grpc_web_pb.js’. We are using definitions from these two files in ‘client.js’.
Change the IP address in line 9 of ‘client.js’ to that of envoy proxy IP(if required). The javascript function “main” defined in client.js is being used in index.html. Note: IP address change is not required — if everything is running locally.
We are using NodeJS npx and webpack-cli along with dependencies to compile required libraries and protobuf message classes and client stub into one single library. Execute the “./setup-node-wp.sh ” script install NodeJS and dependencies.
We would need a webserver to host our grpc-web client app(index.html). Navigate to ./web/tomcat/ directory and run ‘./setup.sh’. This would install tomcat server.
At this point, we are ready to deploy our client app(index.html) to tomcat server. We navigate to ./web folder and run “./deploy-app.sh ”. This would compile all the javascript files into one single ./web/dist/main.js file followed by copying resources to./web/tomcat../webapp/client directory.
At this point, we can navigate back to the project root folder and execute ‘./run.sh’. This would run rust and java grpc servers and tomcat and envoy proxy. We should be able to access the webpage at http://IP:8080/client (http://127.0.0.1:8080/client -if running locally) -where the IP is the address of the tomcat server ip address.
Browser should display a page as shown below. We should be able to select rust or java from the the drop down and call the grpc servers.

Originally published on https://rbsomeg.blogspot.com

Format shell script

2021-04-13T08:15:00+00:00

snap install shfmt

shfmt -i 2 -ci -w ./*.sh

Originally published on https://rbsomeg.blogspot.com

Linus Torvalds on rust in linux

2021-03-24T22:27:00+00:00

https://www.zdnet.com/article/linus-torvalds-on-where-rust-will-fit-into-linux/

Originally published on https://rbsomeg.blogspot.com

Algorithmic Muscle Excercise - maximum subsequence length in rust

2021-03-21T13:45:00+00:00

Maximum sub-sequence length of 3 strings - bottom up approach:

Source: https://github.com/ratulb/algos_in_rust/blob/master/max_sub_sequence_bottom_up/src/lib.rs

Originally published on https://rbsomeg.blogspot.com

Algorithmic Muscle Excercise - Word Search In Rust

2021-03-20T00:08:00+00:00

Word search in a grid:

Source: https://github.com/ratulb/algos_in_rust/blob/master/word_search_in_matrix/src/lib.rs

Originally published on https://rbsomeg.blogspot.com

Tech Cottage

From Bytes to Gradients: Tracing a Neural Network Through Tenmo, One Layer at a Time

1. The Memory Model — Buffer

2. Shape + Strides + Views — NDBuffer

3. Tensor — The User-Facing Type

4. Forward Pass — A Real MNIST Step

Matmul — The CPU Kernel

Bias Add — Broadcast Arithmetic

Cross-Entropy — Fused GPU Kernel

5. The Backward Graph

What add_ancestry Stores

The Backward Pass — Phase by Phase

Example: Multiply Broadcast Backward

The “Aha” Moment — Reshape Backward

6. The Optimizer — SGD Step

7. GPU Transfer

8. Putting It All Together

What the Benchmarks Say

Common Pitfalls

Try It Yourself

From Raw Text to Word Vectors: Building a Tokenizer and Word Embeddings with Tenmo

The Problem: Computers Don’t Read

One-Hot Encoding

Bag-of-Words and TF-IDF

Co-Occurrence Matrices (GloVe)

Prediction-Based Embeddings (word2vec)

Stage 1: Building a Tokenizer from Scratch

Cleaning Text

Building the Vocabulary

Encoding and Decoding

Loading the IMDB Dataset

Stage 2: Token Embedding Approaches — A Landscape

Stage 3: The CBOW Idea

The Softmax Wall

Stage 4: Negative Sampling

K+1 Binary Classifications Instead of One V-Way Classification

The Noise Distribution

Stage 5: The Training Loop

Forward Pass

Training Target

Backward + Update: The step() Method

1. The gradient formula

2. Chain rule through matmul

3. Sparse updates with scatter_add

Gradient Flow Verification

Stage 6: Probing the Learned Embeddings

Why Tenmo?

Conclusion

A solana on-chain contract and off-chain client in rust

Migrate kubernetes embedded etcd to external etcd - easy back and forth switch

VPC native kubernetes cluster in GCP

grpc connect — rust, java and grpc-web

Format shell script

Linus Torvalds on rust in linux

Algorithmic Muscle Excercise - maximum subsequence length in rust

Algorithmic Muscle Excercise - Word Search In Rust

What `add_ancestry` Stores