[{"data":1,"prerenderedAt":3309},["ShallowReactive",2],{"writing-\u002Fwriting\u002F2026-03-16-anton-04-plumbing-matures":3,"all-writing":113,"related-work-\u002Fwriting\u002F2026-03-16-anton-04-plumbing-matures":3273},{"id":4,"title":5,"body":6,"canonical_url":97,"date":98,"description":12,"extension":99,"meta":100,"navigation":101,"path":102,"seo":103,"series":104,"stem":105,"summary":106,"tags":107,"work_slug":108,"__hash__":112},"writing\u002Fwriting\u002F2026-03-16-anton-04-plumbing-matures.md","Anton, chapter 4: Plumbing matures",{"type":7,"value":8,"toc":88},"minimark",[9,13,18,21,25,28,31,35,51,55,62,69,72,75,78,82,85],[10,11,12],"p",{},"The week opens with one line of config and a quiet decision: the local inference path moves off Ollama onto vLLM. One environment variable changes, and the serving stack underneath changes with it. Real concurrency, proper batching, something that can actually take production load. Ollama got me through the first two weeks; vLLM is what I want sitting under a system that several people are going to lean on every day. The kind of swap that looks trivial in a diff and reshapes everything that runs on top of it.",[14,15,17],"h2",{"id":16},"memory-rebuild","Memory rebuild",[10,19,20],{},"Then memory. The simple Postgres-rows-tagged-by-user store from week one is fine until it isn't, and around now it isn't. I rebuild it around proper retrieval semantics. pg_trgm scored matching instead of vector embeddings, because at personal-knowledge-base scale trigrams are enough and an embedding pipeline is a tax I don't want to pay. Provenance tags so every fact can be traced to its source. Domain-biased retrieval: a calendar query weights calendar facts, a media query weights media facts. An adaptive recall window, short for chit-chat and long for things that look like research. A working-memory scratchpad for the live conversation. An LLM pass that paraphrases the query before searching, because the way a question is asked is rarely the way the answer is stored. A few days later I add an episodic layer, summaries of past conversations, but only injected when the query has temporal intent. Always-on episodic memory is context bloat dressed up as helpfulness. The principle that crystallises out of all this is one I keep coming back to: filter at load time, not write time. Store generously, retrieve narrowly.",[14,22,24],{"id":23},"schedules-and-codenames","Schedules and codenames",[10,26,27],{},"Schedules go from a thin BullMQ wrapper to an actual domain. Timezone support, daily-duplicate merging, an execution log, on-demand run, silent mode, a UI that doesn't make me wince. Scheduling deserves to be a first-class citizen because half of what I want Anton to do is recurring: the morning briefing, the weekly review, the reminder to call my godfather. In the same commit LiteLLM lands as the gateway in front of every model provider, because the schedule plus agent combination needs one place to route LLM calls, not many. With LiteLLM comes the codename roster: sunny, haiku, oscar, gandalf, gizmo, gatsby, merlin, gustav. Names instead of provider IDs. It feels mildly silly the first time I type \"ask gandalf\" in a config and it sticks immediately. Models get swapped, deprecated, repriced; the codename is stable. The indirection costs nothing and pays back every time a provider changes something.",[10,29,30],{},"A naming sin from last week catches up with me. I'd called two different things \"skills\": a typed function in code and a reusable prompt template in the database. Two commits clean it up. First, \"skills\" become \"saved prompts\" everywhere. Then I drop the \"saved\" prefix because it adds nothing. Now skill means a typed function with a runtime contract, prompt means a template stored in the database. The cost of the cleanup is a couple of hours; the cost of letting the collision live another month would have been far worse. Naming is one of those things where the bill compounds.",[14,32,34],{"id":33},"skill-contracts","Skill contracts",[10,36,37,38,42,43,46,47,50],{},"The runtime contract for skills lands the same day as ",[39,40,41],"code",{},"defineSkill()",". Every skill declares its inputs, outputs, scopes, and handler in one shape, and the skill-runner becomes its own service: a separate container hosting skills as individually deployable units, callable over HTTP. I split it out for four reasons. Hot reload, so a single skill updates without restarting the world. Per-skill metrics: invocation counts, error rates, p50 and p95. Scope isolation, because a leaf skill should not need the parent agent's whole context surface to do one job. And the option of sandboxing later, which is much cheaper to add to a service that's already separate than to one tangled into the worker. A day later every domain unifies onto ",[39,44,45],{},"callSkill()"," and ",[39,48,49],{},"createSkillTool()",". One contract, one entry point, one way to add a new capability.",[14,52,54],{"id":53},"traces-and-permissions","Traces and permissions",[10,56,57,58,61],{},"Traces become first-class. The trace viewer from week one was reading checkpoints out of Redis, which is fine for live debugging and useless for anything historical. I move execution traces into Postgres: every agent run is a row in ",[39,59,60],{},"agent_traces",", queryable from the UI, surviving failures and retries. The next day a small follow-up locks down the invariant: one trace per request, no matter what. Now I can ask questions of the system's own behaviour. What happened in this run, last night, last week. The dev loop tightens again.",[10,63,64,65,68],{},"Permission filtering moves into ",[39,66,67],{},"runAgent()"," itself, which closes a class of bugs I keep almost shipping: a scheduled job inheriting wildcard permissions because the filter lived in the wrong place. Every caller (parent agent, scheduled job, mesh probe, the \u002Finvoke endpoint) gets the same filter applied at the same point. The right place to enforce a rule is the place every path has to go through.",[10,70,71],{},"The Invoke tab grows a VueFlow graph that draws the live agent architecture from the runtime configs. It is the first time I can look at Anton and see his shape, not just his logs. Agents on the canvas, tools as edges, the whole thing redrawing as I add or remove a domain. The visualisation pushes a mental model into my head before it's anywhere in the code: agents are the organising principle. Everything else hangs off them.",[10,73,74],{},"Then a refactor I've been wanting for a while. The parent agent had 63 tools attached to it and had started making the kind of mistakes you make when there's too much choice on the table. I restructure it into 10 subsystem delegates, each a small agent of its own. The parent's job becomes routing and synthesis, not calling everything directly. Six times fewer tools at the top level, and the answers get sharper immediately. It's the same lesson as last weekend's calendar saga, scaled up: the LLM is good at picking the right thing from a small menu and bad at picking the right thing from a long one. Give it a small menu.",[10,76,77],{},"Output validation lands in the agent loop, with intermediate messages and deterministic tool confirmations. It kills an entire family of \"tool succeeded, response is empty\" bugs that used to surface as a silent Anton, which is the worst kind of failure mode in a chat interface. Then prime directives: a small set of immutable rules the agent loop enforces above any individual prompt. The first version is verbose; I rewrite each one to a single line. Directives sit at the top of the prompt-precedence stack: directives, agent prompt, prompt template, user message. The non-negotiable rules live in code; everything else is editable.",[14,79,81],{"id":80},"prompts-as-data","Prompts as data",[10,83,84],{},"Late in the week, the move that ties the rest together. All agent prompts go into the database. No hardcoded fallbacks. Editable from the UI, versioned, one row per agent. Anton's behaviour stops being something I deploy and starts being something I configure. Want a different tone for the calendar agent? Edit the row. Want to test a new prompt for research? Save a version, run the suite, keep it or roll back. The principle is the same one as memory: keep behaviour as data, not code, and you keep the option to change your mind cheaply.",[10,86,87],{},"By Sunday night Anton is a different shape than he was on Monday. Inference is on vLLM. Models route through LiteLLM under codenames I picked in an evening. Memory has retrieval that respects domain and intent. Schedules are a real domain. Skills are a typed contract running in their own service with hot reload and metrics. Traces are queryable rows. Permissions are enforced at one chokepoint. Prompts live in the database. The parent has ten delegates instead of sixty-three tools. The week's lesson, the one I'm taking forward: the work that pays the most is the work that turns implicit conventions into explicit contracts. Once a thing has a shape on disk and an entry point in code, you can change anything around it without fear. Plumbing is invisible until it isn't, and this week was almost entirely plumbing.",{"title":89,"searchDepth":90,"depth":90,"links":91},"",2,[92,93,94,95,96],{"id":16,"depth":90,"text":17},{"id":23,"depth":90,"text":24},{"id":33,"depth":90,"text":34},{"id":53,"depth":90,"text":54},{"id":80,"depth":90,"text":81},null,"2026-03-16","md",{},true,"\u002Fwriting\u002F2026-03-16-anton-04-plumbing-matures",{"title":5,"description":12},"anton-journey","writing\u002F2026-03-16-anton-04-plumbing-matures","A week of turning implicit conventions into explicit contracts: vLLM, LiteLLM, memory, traces, prompts as data.",[108,109,110,111],"anton","agents","ai","infrastructure","8WcF6rDf6JZIZxtfBjGvTjohwAVtmLei6lhFHb6PkAI",[114,201,296,929,1002,1059,1193,1333,1414,1453,1584,1793,1897],{"id":115,"title":116,"body":117,"canonical_url":97,"date":193,"description":121,"extension":99,"meta":194,"navigation":101,"path":195,"seo":196,"series":104,"stem":197,"summary":198,"tags":199,"work_slug":108,"__hash__":200},"writing\u002Fwriting\u002F2026-03-07-anton-01-genesis.md","Anton, chapter 1: Genesis",{"type":7,"value":118,"toc":187},[119,122,132,136,143,147,154,158,161,164,167,171,178,181,184],[10,120,121],{},"I have a simple idea: build an assistant more reliable and secure than OpenClaw, which I find frustrating and terrifying at the same time. What better use case than a family assistant that handles the complexities of running a family of 6. If I can reliably replace myself for some common chores then I've won. What's more, I have this DGX Spark sitting at home that seems like the perfect host for my new assistant. I call him Anton. In reference to the series Silicon Valley.",[10,123,124,125,46,128,131],{},"The day starts where I tend to start: with a contract, not code. By 09:25 the repo has a pnpm + Turborepo monorepo, TypeScript throughout, empty ",[39,126,127],{},"apps\u002F",[39,129,130],{},"packages\u002F",", an implementation plan, and a shared types package as the central contract surface. Nothing that runs. But the shape of what runs is written down first. I always do this. Agreeing with yourself on the interfaces upfront is dirt cheap; refactoring three weeks later is not.",[14,133,135],{"id":134},"the-household","The household",[10,137,138,139,142],{},"Then the household map. WhatsApp as the entry point because that's where the family already lives. The Spark over Tailscale as the host, because everything about Anton (his memory, his data, his voice) is going to be physically close to me, not in someone else's cloud. The TNAS, Plex, Transmission, the ",[39,140,141],{},"gws"," CLI for Google Workspace: all the things he needs to be useful. Docker Compose with three services: Postgres, WhatsApp, Worker. A deploy script that rsyncs the lot to the Spark. No staging, no laptop. I want to be in production from hour one. The real thing.",[14,144,146],{"id":145},"identity-as-data","Identity as data",[10,148,149,150,153],{},"Then a question I've been chewing on for weeks: identity. Is Anton a tool or a character? I don't want a system prompt baked into a string somewhere in the code. I want the personality to be a document (",[39,151,152],{},"identity.md",") loaded at runtime. Anton's personality and my context as data, not code. The reason is practical more than philosophical: if I change my mind about who Anton is, or my family does, or a kid moves out, I want to edit a file, not push code. Treating context as data feels like the kind of decision that compounds.",[14,155,157],{"id":156},"afternoon-capabilities","Afternoon capabilities",[10,159,160],{},"Afternoon is capabilities, three commits between 16:00 and 17:30. Media first: Plex search, Transmission control. Then calendar, Google Workspace, voice transcription. I almost cut voice. I'm not sure the family will actually use it, and the day's budget is tight. In the end I leave it in, on the principle that when you have room to integrate a maybe, you integrate it. The cost of finding out is the same as the cost of guessing wrong, and the upside is asymmetric.",[10,162,163],{},"I also add a memory layer. Deliberately simple: Postgres rows tagged by user. No embeddings, no fancy retrieval. The rule I have in mind is to build the simplest version that lets me see what the system actually wants, then design for that. Memory will need real retrieval semantics eventually, but I don't know what shape yet, and the worst thing I can do is guess.",[10,165,166],{},"For the orchestration question (how the parent agent should route between domains) I pick LangGraph. It's a great choice. Structure, an observable state machine, checkpointing, a clean way to express subgraphs per domain with classify-and-dispatch at the top. Reliable, traceable, well thought out. I'm glad to have a framework that already has answers for problems I haven't yet hit.",[14,168,170],{"id":169},"first-production-bug","First production bug",[10,172,173,174,177],{},"Then the first production bug, four hours in. WhatsApp is throwing \"bad encryption\" errors on incoming messages (a Baileys quirk). Five minutes reading the library, the fix is wiring ",[39,175,176],{},"getMessage"," for retry decryption. What I notice isn't the fix. It's that the system is real enough by mid-afternoon to throw production bugs in the first place. Stub systems don't fail like that.",[10,179,180],{},"Evening is orchestration. BullMQ + Redis schedule queue. And then a refactor I'm framing as a cleanup: a commit that makes WhatsApp a thin shim that enqueues jobs, and lets the worker own all the agent logic. It just feels right to keep transport-layer concerns out of the agent. Cleaner that way.",[10,182,183],{},"By midnight Anton can take a WhatsApp message, route it through Postgres-tracked history, dispatch to a domain, run scheduled jobs from the database, self-update via an API endpoint, and redeploy from one shell script. Usable end to end.",[10,185,186],{},"I go to bed satisfied. The system is real, deployed on the Spark, talking to the family group on WhatsApp, with seven domains' worth of subgraphs registered and a quiet schedule queue waiting for work. It has the shape I wanted going in: a personal assistant living close to my data, on my hardware, with a personality I can edit as a file.",{"title":89,"searchDepth":90,"depth":90,"links":188},[189,190,191,192],{"id":134,"depth":90,"text":135},{"id":145,"depth":90,"text":146},{"id":156,"depth":90,"text":157},{"id":169,"depth":90,"text":170},"2026-03-07",{},"\u002Fwriting\u002F2026-03-07-anton-01-genesis",{"title":116,"description":121},"writing\u002F2026-03-07-anton-01-genesis","Building Anton, a personal agent OS for my family on a DGX Spark, day one.",[108,109,110],"joDG0mX9JDmKMkdmV0TLA8ToxC2iEqCEtvgDfVrzvpU",{"id":202,"title":203,"body":204,"canonical_url":97,"date":288,"description":208,"extension":99,"meta":289,"navigation":101,"path":290,"seo":291,"series":104,"stem":292,"summary":293,"tags":294,"work_slug":108,"__hash__":295},"writing\u002Fwriting\u002F2026-03-08-anton-02-first-weekend.md","Anton, chapter 2: The first weekend",{"type":7,"value":205,"toc":281},[206,209,213,216,220,226,229,233,248,252,255,258,261,265,272,275,278],[10,207,208],{},"I wake up Saturday morning and the first thing I want to fix is the parent. The classify-and-dispatch graph from yesterday does the job, but it asks the LLM to make routing decisions inside a state machine that's already trying to do the routing itself. Two layers fighting over the same job. I want one.",[14,210,212],{"id":211},"agent-with-tools","Agent with tools",[10,214,215],{},"The first commit of day two rips the parent out and rewrites it as an agent-with-tools: a single LLM with all the subgraphs exposed as tools, choosing what to call from natural language. The classify-then-route pattern stays useful for individual domain classifiers, but the parent is done with it. By breakfast the system feels lighter. The LLM at the top is doing what it's good at (picking the right tool for the job), and the framework underneath is doing what it's good at (holding everything else).",[14,217,219],{"id":218},"family-grade-polish","Family-grade polish",[10,221,222,223,225],{},"The next handful of commits are the small things that turn an assistant into something the family can actually use. Typing indicator and a startup message so people see acknowledgement before the LLM finishes thinking. Language matching: Anton replies in whatever language the current message is in, not the historical thread language. French and English freely mixed in this house. The ",[39,224,141],{}," Google Workspace auth gets an onboarding guide so someone other than me can wire it up. Schedules become natural language: \"remind me to X every morning\" is just an Anton command now, not a separate language. Group chat support, daily summaries, conversation browsing. The honesty guidelines land too: Anton should never make things up to sound competent.",[10,227,228],{},"The first real browser-driven domain lands the same day: Doctolib login with 2FA. The interactive input queue lets Anton ask the user mid-flow for the SMS code, then resume where he left off. It's a small mechanism. It feels right immediately, in the way something does when it solves a class of problem you didn't quite know how to name yet.",[14,230,232],{"id":231},"saturday-observability","Saturday observability",[10,234,235,236,239,240,243,244,247],{},"Saturday night is observability. Three commits between 22:30 and midnight. A debug→issue pipeline so error traces auto-draft GitHub issues, because I'd rather have the issue file itself than wake up Sunday to no record of what broke. A ",[39,237,238],{},"\u002Flogs"," endpoint backed by an in-memory ring buffer instead of ",[39,241,242],{},"docker logs",", because reaching into Docker every time I want to see what's happening is friction that adds up. And the trigger-file deploy protocol: instead of ",[39,245,246],{},"\u002Fupdate"," shelling out synchronously, a systemd watcher polls for a trigger file, runs the deploy, reports back. Decoupled, boring, reliable. The kind of plumbing that disappears the moment it works.",[14,249,251],{"id":250},"the-lcars-dashboard","The LCARS dashboard",[10,253,254],{},"Sunday morning is the LCARS dashboard. Star Trek themed Nuxt UI in one commit. Service probes, log viewer, conversation browser, mobile-first layout. The next nine commits are mostly the UI fixing itself: proxy semantics, env prefix, mobile layout, trace viewer, modal overlay parsing checkpoints, expandable drill-down. By lunch I can look at any conversation, drill into any agent run, and see what the LLM is thinking at each step. The trace viewer is what I'll lean on for the rest of the weekend. Without it, the next twelve hours don't happen.",[10,256,257],{},"Then knowledge surface expansion in two hours. Anton can grep his own source code now. He can do live web research with Grok. Research becomes its own subgraph with budget control, because I don't want it living inside other domain agents (research has its own concerns: budget, citation, fact-check). Web browsing and a document knowledge store. Google OAuth wired through the worker with a UI settings page. By Sunday afternoon Anton can read the web, read his own code, and remember what he reads.",[10,259,260],{},"Sunday evening: the quality test suite. 28 test cases across all domains. This is the inflection point where shipping changes stops being \"did the WhatsApp message look right\" and starts being \"did the regression battery still pass.\" I should have done this on day one. I didn't, because day one was already too full. The gap between not having a test suite and having one is the gap between hoping and knowing.",[14,262,264],{"id":263},"the-calendar-saga","The calendar saga",[10,266,267,268,271],{},"The next ten hours are the suite finding bugs and the bugs getting fixed. The big one is the calendar saga. Six commits chasing the same failure mode: the calendar agent keeps producing wrong answers when I ask it to do anything multi-step (\"delete the event titled X\"). Two root causes. First, the parent's full conversation history is being passed into the calendar agent, contaminating its context with everything else that's been said. Second, the LLM can't reliably chain a search-then-delete in one go: it does the search, returns the result, and stops. The fix on the first is a strict rule: don't pass conversationHistory to domain agents. The fix on the second is structural: don't rely on the LLM to chain multi-step tool calls; build composite skills (one ",[39,269,270],{},"findAndDeleteEvent"," instead of two separate tools). Both rules go into MEMORY.md the same evening. They're the kind you only learn by getting burned.",[10,273,274],{},"Sunday night through Monday morning, the domains broaden. A wine collection lands as a typed table, the first user-facing collection. School messages domain. Group images and audio when @mentioned, with an \"ingest cheap, process lazy\" pattern: store the raw blob, only invoke vision or transcription when someone actually @-asks for it. Smarter media download flow with release preferences and stop\u002Fresume. Torrent searches routed through Tor SOCKS proxy, the only network egress decision driven by operational caution rather than feature need.",[10,276,277],{},"By Monday morning Anton has 73 quality tests, seven domains, a UI that drills into trace checkpoints, and a dev loop I can actually trust: trace viewer plus quality suite plus auto-issue pipeline plus ring-buffer logs plus LCARS. I can change anything and see what breaks.",[10,279,280],{},"The weekend's lesson, the one I'm taking into next week: write the tests before the bugs do. The calendar saga cost me an evening of debugging that the suite would have caught in seconds. From now on, every domain ships with regression coverage. Not because I want to be disciplined. Because I've now experienced the cost of not being.",{"title":89,"searchDepth":90,"depth":90,"links":282},[283,284,285,286,287],{"id":211,"depth":90,"text":212},{"id":218,"depth":90,"text":219},{"id":231,"depth":90,"text":232},{"id":250,"depth":90,"text":251},{"id":263,"depth":90,"text":264},"2026-03-08",{},"\u002Fwriting\u002F2026-03-08-anton-02-first-weekend",{"title":203,"description":208},"writing\u002F2026-03-08-anton-02-first-weekend","A first full weekend of building turns Anton into something the family can actually use.",[108,109,110],"sUIp73ttugbZ0i1HqI2logHtd14nhBXW7ig0vP2e__g",{"id":297,"title":298,"body":299,"canonical_url":97,"date":916,"description":917,"extension":99,"meta":918,"navigation":101,"path":919,"seo":920,"series":921,"stem":922,"summary":923,"tags":924,"work_slug":97,"__hash__":928},"writing\u002Fwriting\u002F2026-03-10-dgx-spark-vllm.md","50+ tokens per second on a desktop: running LLMs on the NVIDIA DGX Spark",{"type":7,"value":300,"toc":903},[301,308,311,314,317,321,324,343,346,350,353,373,376,379,383,390,393,413,416,420,423,449,456,460,471,485,496,500,596,603,606,610,613,786,792,796,802,808,814,820,833,839,843,850,853,857,886,890,899],[10,302,303,307],{},[304,305,306],"strong",{},"TL;DR:"," We got a 30-billion-parameter LLM running at 51-54 tokens\u002Fsec on the NVIDIA DGX Spark by combining Mixture-of-Experts architecture, FP8 quantization, and a community Docker image that fixes Blackwell-specific issues. Here's what we learned.",[309,310],"hr",{},[10,312,313],{},"The NVIDIA DGX Spark is an interesting machine. It packs a Blackwell GB10 GPU with 128GB of unified LPDDR5X memory into a desktop form factor. For XRPL Commons, we wanted local LLM inference for our development workflow, fast enough to be usable, private enough to run on-premises, and simple enough to replicate across machines.",[10,315,316],{},"Getting there was not straightforward. This post documents the journey from 3.7 tok\u002Fs (unusable) to 54 tok\u002Fs (excellent), and the key technical decisions that made the difference.",[14,318,320],{"id":319},"the-hardware","The Hardware",[10,322,323],{},"The DGX Spark ships with:",[325,326,327,331,334,337,340],"ul",{},[328,329,330],"li",{},"NVIDIA GB10 Blackwell GPU (SM 12.1)",[328,332,333],{},"128GB unified LPDDR5X at 273 GB\u002Fs bandwidth",[328,335,336],{},"ARM Grace CPU (aarch64), 10 cores",[328,338,339],{},"3.7TB NVMe storage",[328,341,342],{},"DGX OS (Ubuntu 24.04)",[10,344,345],{},"128GB of unified memory means you can fit very large models. But there's a catch.",[14,347,349],{"id":348},"the-bandwidth-wall","The Bandwidth Wall",[10,351,352],{},"LLM inference is memory-bandwidth-bound. During autoregressive decoding, each token requires reading every active weight from memory once. At 273 GB\u002Fs, the math is simple:",[325,354,355,364],{},[328,356,357,360,361],{},[304,358,359],{},"Dense 32B model (bf16):"," 64GB of weights \u002F 273 GB\u002Fs = ~234ms per token = ",[304,362,363],{},"~4 tok\u002Fs",[328,365,366,369,370],{},[304,367,368],{},"Dense 8B model (bf16):"," 16GB \u002F 273 GB\u002Fs = ~59ms = ",[304,371,372],{},"~17 tok\u002Fs",[10,374,375],{},"No amount of compute optimization changes this. The Spark can hold a 70B model in FP8, but it will generate tokens at walking pace. The memory is large but not fast.",[10,377,378],{},"We learned this the hard way. Our first attempt, Qwen3-32B at bf16, produced 3.7 tokens per second. Qwen3-8B was better at 13.1 tok\u002Fs, but still below the threshold for interactive use.",[14,380,382],{"id":381},"the-moe-breakthrough","The MoE Breakthrough",[10,384,385,386,389],{},"The solution is ",[304,387,388],{},"Mixture-of-Experts (MoE)"," models. An MoE model has many total parameters but only activates a fraction per token. Qwen3-30B-A3B has 30 billion parameters but only 3 billion active ones, the router activates a small subset of experts per token, leaving the rest idle in memory.",[10,391,392],{},"The bandwidth math changes completely:",[325,394,395,404],{},[328,396,397,400,401],{},[304,398,399],{},"MoE 30B, 3B active (bf16):"," ~6GB active weights \u002F 273 GB\u002Fs = ~22ms = ",[304,402,403],{},"~45 tok\u002Fs theoretical",[328,405,406,409,410],{},[304,407,408],{},"MoE 30B, 3B active (FP8):"," ~3GB active weights \u002F 273 GB\u002Fs = ~11ms = ",[304,411,412],{},"~90 tok\u002Fs theoretical",[10,414,415],{},"You get the quality of a 30B model at the speed of a 3B model.",[14,417,419],{"id":418},"the-software-stack-problem","The Software Stack Problem",[10,421,422],{},"The DGX Spark's Blackwell GPU (SM 12.1) is new enough that upstream tooling doesn't fully support it:",[325,424,425,431,437,443],{},[328,426,427,430],{},[304,428,429],{},"Flash Attention 2"," crashes with a PTX toolchain error",[328,432,433,436],{},[304,434,435],{},"vLLM's MOE CUTLASS kernels"," don't include SM 12.1 in their architecture intersection lists",[328,438,439,442],{},[304,440,441],{},"PyTorch"," officially supports up to SM 12.0",[328,444,445,448],{},[304,446,447],{},"CUDA graphs",", critical for throughput, simply don't work with a standard vLLM build",[10,450,451,452,455],{},"We spent considerable time on a manual vLLM build from source: patching CMakeLists.txt, building Triton from a specific commit, working around setuptools license field validation bugs, pinning transformers below 5.0 to avoid tokenizer breakage. The manual build worked but required ",[39,453,454],{},"--enforce-eager"," mode (no CUDA graphs), capping throughput at ~30 tok\u002Fs.",[14,457,459],{"id":458},"the-avarok-docker-image","The Avarok Docker Image",[10,461,462,463,470],{},"The ",[464,465,469],"a",{"href":466,"rel":467},"https:\u002F\u002Fgithub.com\u002FAvarok-Cybersecurity\u002Fdgx-vllm",[468],"nofollow","Avarok dgx-vllm project"," solves all of this in a single Docker image. It includes:",[325,472,473,476,479,482],{},[328,474,475],{},"A patched vLLM v0.16.0rc2 with SM 12.1 support",[328,477,478],{},"Software E2M1 conversion for the missing NVFP4 PTX instruction",[328,480,481],{},"Custom CUTLASS kernels for the GB10",[328,483,484],{},"Working CUDA graphs and Flash Attention",[10,486,487,488,491,492,495],{},"One ",[39,489,490],{},"docker pull"," and one ",[39,493,494],{},"docker run"," command replaces hours of manual compilation.",[14,497,499],{"id":498},"results","Results",[501,502,503,522],"table",{},[504,505,506],"thead",{},[507,508,509,513,516,519],"tr",{},[510,511,512],"th",{},"Model",[510,514,515],{},"Quantization",[510,517,518],{},"Engine",[510,520,521],{},"Tokens\u002Fsec",[523,524,525,540,552,564,576],"tbody",{},[507,526,527,531,534,537],{},[528,529,530],"td",{},"Qwen3-32B (dense)",[528,532,533],{},"bf16",[528,535,536],{},"Manual vLLM",[528,538,539],{},"3.7",[507,541,542,545,547,549],{},[528,543,544],{},"Qwen3-8B (dense)",[528,546,533],{},[528,548,536],{},[528,550,551],{},"13.1",[507,553,554,557,559,561],{},[528,555,556],{},"Qwen3-30B-A3B (MoE)",[528,558,533],{},[528,560,536],{},[528,562,563],{},"28.6",[507,565,566,568,570,573],{},[528,567,556],{},[528,569,533],{},[528,571,572],{},"Avarok Docker",[528,574,575],{},"30.3",[507,577,578,582,587,591],{},[528,579,580],{},[304,581,556],{},[528,583,584],{},[304,585,586],{},"FP8",[528,588,589],{},[304,590,572],{},[528,592,593],{},[304,594,595],{},"51-54",[10,597,598,599,602],{},"The winning combination: ",[304,600,601],{},"MoE architecture + FP8 quantization + Avarok Docker with CUDA graphs",".",[10,604,605],{},"We've deployed this setup across two DGX Sparks with consistent results. The FP8 model uses ~110GB of the 119GB available memory, leaving minimal headroom, but the throughput is worth it.",[14,607,609],{"id":608},"the-setup","The Setup",[10,611,612],{},"The final deployment is remarkably simple:",[614,615,619],"pre",{"className":616,"code":617,"language":618,"meta":89,"style":89},"language-bash shiki shiki-themes min-light","docker pull avarok\u002Fdgx-vllm-nvfp4-kernel:v22\n\ndocker run -d \\\n  --name vllm \\\n  --gpus all \\\n  --shm-size=16g \\\n  --restart unless-stopped \\\n  -p 8000:8888 \\\n  -v \u002Fhome\u002F$USER\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n  -e MODEL=Qwen\u002FQwen3-30B-A3B-Instruct-2507-FP8 \\\n  -e PORT=8888 \\\n  -e GPU_MEMORY_UTIL=0.85 \\\n  -e MAX_MODEL_LEN=32768 \\\n  avarok\u002Fdgx-vllm-nvfp4-kernel:v22 serve\n","bash",[39,620,621,637,642,657,668,679,687,698,709,726,737,751,764,777],{"__ignoreMap":89},[622,623,626,630,634],"span",{"class":624,"line":625},"line",1,[622,627,629],{"class":628},"s7eDp","docker",[622,631,633],{"class":632},"sY4mW"," pull",[622,635,636],{"class":632}," avarok\u002Fdgx-vllm-nvfp4-kernel:v22\n",[622,638,639],{"class":624,"line":90},[622,640,641],{"emptyLinePlaceholder":101},"\n",[622,643,645,647,650,653],{"class":624,"line":644},3,[622,646,629],{"class":628},[622,648,649],{"class":632}," run",[622,651,652],{"class":632}," -d",[622,654,656],{"class":655},"sR6ew"," \\\n",[622,658,660,663,666],{"class":624,"line":659},4,[622,661,662],{"class":632},"  --name",[622,664,665],{"class":632}," vllm",[622,667,656],{"class":655},[622,669,671,674,677],{"class":624,"line":670},5,[622,672,673],{"class":632},"  --gpus",[622,675,676],{"class":632}," all",[622,678,656],{"class":655},[622,680,682,685],{"class":624,"line":681},6,[622,683,684],{"class":632},"  --shm-size=16g",[622,686,656],{"class":655},[622,688,690,693,696],{"class":624,"line":689},7,[622,691,692],{"class":632},"  --restart",[622,694,695],{"class":632}," unless-stopped",[622,697,656],{"class":655},[622,699,701,704,707],{"class":624,"line":700},8,[622,702,703],{"class":632},"  -p",[622,705,706],{"class":632}," 8000:8888",[622,708,656],{"class":655},[622,710,712,715,718,721,724],{"class":624,"line":711},9,[622,713,714],{"class":632},"  -v",[622,716,717],{"class":632}," \u002Fhome\u002F",[622,719,720],{"class":655},"$USER",[622,722,723],{"class":632},"\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface",[622,725,656],{"class":655},[622,727,729,732,735],{"class":624,"line":728},10,[622,730,731],{"class":632},"  -e",[622,733,734],{"class":632}," MODEL=Qwen\u002FQwen3-30B-A3B-Instruct-2507-FP8",[622,736,656],{"class":655},[622,738,740,742,745,749],{"class":624,"line":739},11,[622,741,731],{"class":632},[622,743,744],{"class":632}," PORT=",[622,746,748],{"class":747},"s9AOD","8888",[622,750,656],{"class":655},[622,752,754,756,759,762],{"class":624,"line":753},12,[622,755,731],{"class":632},[622,757,758],{"class":632}," GPU_MEMORY_UTIL=",[622,760,761],{"class":747},"0.85",[622,763,656],{"class":655},[622,765,767,769,772,775],{"class":624,"line":766},13,[622,768,731],{"class":632},[622,770,771],{"class":632}," MAX_MODEL_LEN=",[622,773,774],{"class":747},"32768",[622,776,656],{"class":655},[622,778,780,783],{"class":624,"line":779},14,[622,781,782],{"class":632},"  avarok\u002Fdgx-vllm-nvfp4-kernel:v22",[622,784,785],{"class":632}," serve\n",[10,787,788,789,602],{},"First boot takes 10-20 minutes (model download + CUDA graph capture). After that, it auto-starts on reboot and serves an OpenAI-compatible API at ",[39,790,791],{},"http:\u002F\u002Flocalhost:8000\u002Fv1",[14,793,795],{"id":794},"lessons-learned","Lessons Learned",[10,797,798,801],{},[304,799,800],{},"1. Understand your bottleneck."," The Spark's 273 GB\u002Fs bandwidth determines everything. Once we understood this, the model selection became obvious, MoE with minimal active parameters.",[10,803,804,807],{},[304,805,806],{},"2. Don't build from source if you don't have to."," Our manual vLLM build took hours of debugging across multiple sessions. The Avarok Docker image does everything better and in one command.",[10,809,810,813],{},[304,811,812],{},"3. FP8 quantization is nearly free."," The jump from bf16 to FP8 nearly doubled throughput (30.3 to 51 tok\u002Fs on the same engine) with no perceptible quality difference for our use cases.",[10,815,816,819],{},[304,817,818],{},"4. Stop Ollama first."," On one Spark, Ollama was consuming ~100GB of memory when we tried to install vLLM. The build process OOM-killed the machine. Disable competing inference servers before starting.",[10,821,822,825,826,829,830,602],{},[304,823,824],{},"5. Kernel updates break NVIDIA drivers."," DGX OS auto-updates the kernel, but the NVIDIA modules don't follow automatically. After a reboot, ",[39,827,828],{},"nvidia-smi"," may fail. The fix is ",[39,831,832],{},"sudo apt install linux-modules-nvidia-580-open-$(uname -r)",[10,834,835,838],{},[304,836,837],{},"6. Community Docker images can be ahead of official ones."," The Avarok image runs vLLM v0.16.0rc2 with Blackwell fixes, months ahead of where NVIDIA's own builds are.",[14,840,842],{"id":841},"whats-next","What's Next",[10,844,845,846,849],{},"Community results suggest AWQ 4-bit quantization can push the same model to ",[304,847,848],{},"82 tok\u002Fs",". NVIDIA's own NVFP4-quantized models (like Qwen3-Next-80B-A3B) report even better quality at ~67 tok\u002Fs average. As toolchain support matures, these numbers should keep improving.",[10,851,852],{},"For now, 51-54 tok\u002Fs with a 30B-parameter MoE model is fast enough for interactive coding assistance, document drafting, and general-purpose use, all running locally on a desktop machine.",[14,854,856],{"id":855},"resources","Resources",[325,858,859,865,872,879],{},[328,860,861],{},[464,862,864],{"href":466,"rel":863},[468],"Avarok dgx-vllm Docker project",[328,866,867],{},[464,868,871],{"href":869,"rel":870},"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Instruct-2507-FP8",[468],"Qwen3-30B-A3B-FP8 on HuggingFace",[328,873,874],{},[464,875,878],{"href":876,"rel":877},"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fproducts\u002Fworkstations\u002Fdgx-spark\u002F",[468],"NVIDIA DGX Spark product page",[328,880,881],{},[464,882,885],{"href":883,"rel":884},"https:\u002F\u002Fdocs.vllm.ai\u002F",[468],"vLLM documentation",[14,887,889],{"id":888},"try-it-yourself","Try It Yourself",[10,891,892,893,898],{},"If you have a DGX Spark, the Docker approach takes about 20 minutes from zero to serving. Pull the Avarok image, run the container with the command above, and you're up. Reach out to us at ",[464,894,897],{"href":895,"rel":896},"https:\u002F\u002Fxrpl-commons.org",[468],"XRPL Commons"," if you want the full setup guide with troubleshooting details.",[900,901,902],"style",{},"html pre.shiki code .s7eDp, html code.shiki .s7eDp{--shiki-default:#6F42C1}html pre.shiki code .sY4mW, html code.shiki .sY4mW{--shiki-default:#2B5581}html pre.shiki code .sR6ew, html code.shiki .sR6ew{--shiki-default:#24292EFF}html pre.shiki code .s9AOD, html code.shiki .s9AOD{--shiki-default:#1976D2}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":89,"searchDepth":90,"depth":90,"links":904},[905,906,907,908,909,910,911,912,913,914,915],{"id":319,"depth":90,"text":320},{"id":348,"depth":90,"text":349},{"id":381,"depth":90,"text":382},{"id":418,"depth":90,"text":419},{"id":458,"depth":90,"text":459},{"id":498,"depth":90,"text":499},{"id":608,"depth":90,"text":609},{"id":794,"depth":90,"text":795},{"id":841,"depth":90,"text":842},{"id":855,"depth":90,"text":856},{"id":888,"depth":90,"text":889},"2026-03-10","TL;DR: We got a 30-billion-parameter LLM running at 51-54 tokens\u002Fsec on the NVIDIA DGX Spark by combining Mixture-of-Experts architecture, FP8 quantization, and a community Docker image that fixes Blackwell-specific issues. Here's what we learned.",{},"\u002Fwriting\u002F2026-03-10-dgx-spark-vllm",{"title":298,"description":917},"dgx-spark","writing\u002F2026-03-10-dgx-spark-vllm","Standing up a 30B-parameter LLM at 50+ tok\u002Fs on the NVIDIA DGX Spark, the technical journey.",[925,111,926,927],"llm","dgx","vllm","3Fo-D07poqqdTBtl1MvYhyBzqbXPJDQAui6bM-fGIMc",{"id":930,"title":931,"body":932,"canonical_url":97,"date":994,"description":936,"extension":99,"meta":995,"navigation":101,"path":996,"seo":997,"series":104,"stem":998,"summary":999,"tags":1000,"work_slug":108,"__hash__":1001},"writing\u002Fwriting\u002F2026-03-13-anton-03-domains-widen.md","Anton, chapter 3: Domains widen, browser hardens",{"type":7,"value":933,"toc":987},[934,937,941,944,947,951,954,957,961,964,968,971,975,978,981,984],[10,935,936],{},"The week opens on a piece of unfinished business. Doctolib search works. Detail enrichment does not. I want to drill into the appointment detail page from the search results so Anton can tell the family what's actually available, not just that something exists. The first attempt clicks back into the result. The second switches to direct navigation to avoid stale handles. The third adds diagnostic logging, then a 60 second budget, then per-page timeouts. Each fix surfaces the next failure. By mid-morning the answer is unambiguous: Cloudflare is fingerprinting the browser as a bot and blocking the navigation entirely. I disable detail enrichment and file the issue. The cleverness has to move somewhere else.",[14,938,940],{"id":939},"persistent-profiles","Persistent profiles",[10,942,943],{},"The unlock comes from changing the question. Instead of asking \"how do I get into the detail page\", I ask \"what does a real browser look like that this one doesn't\". The answer is persistent profiles. A real Chromium, not the bundled headless one, running in a profile directory that keeps cookies, history, and the small thousand fingerprints that accumulate over time. Once the browser is allowed to act like a browser, Cloudflare stops fingerprinting it. Anti-detection flags help at the margins, but the real fix is identity continuity: a session the site recognizes as a returning human, not a fresh anonymous request from nowhere.",[10,945,946],{},"Then the second move, which I like more. I rewrite enrichment to skip the detail page entirely. The list page already contains most of what we want, in unstructured text. So I feed the list page text to a single LLM call and ask for the structured fields back. One call. Cheaper than per-card DOM traversal, faster, and it doesn't fight the site. The lesson worth keeping: when the deterministic path costs you a battle with the host, lift the work up one level and let the model read the raw text. The LLM is the cheapest unit of work I have. I should be using it where it pays.",[14,948,950],{"id":949},"lifting-work-to-the-llm","Lifting work to the LLM",[10,952,953],{},"The browser also needs to live somewhere. By midweek it lands in its own service container, dedicated, isolated, with the persistent profile mounted as a volume. CDP wiring takes a handful of commits to settle (a TCP proxy because Chromium binds where it likes, an ESM import detail, the WebSocket URL, stale lock cleanup). When it's done, every browser-touching domain (Doctolib, the syndic site, the consulate, generic web) shares the same hardened browser. One container, one profile per site, one place to fix things.",[10,955,956],{},"Alongside the browser saga, the media subsystem grows the features that turn it from a toy into something the family actually uses. The bag of media tools gets redesigned around an intent-driven request shape with four tools (status, library, watch, search). The watchlist gets follow and unfollow and a triage job that scans the catalog and tells you what's worth attention. The validator learns to respect the scope of a request so it stops retrying things that were never in scope. Show triage starts as \"latest season, 14 day lookback\" and then becomes \"the last episode Plex actually has\", which is a small but characteristic move: stop reasoning from arbitrary windows, reason from state.",[14,958,960],{"id":959},"movie-night","Movie night",[10,962,963],{},"The headline media feature is movie night. A scheduled job, Friday at six, that picks two or three movies for the family and posts them to the group. It's the first proactive message Anton sends. Not a reply, not an answer to a question, but an unsolicited suggestion at a fixed time. Six iterations of prompt refinement in two hours to land the tone. The early drafts close with \"want me to download any of these?\", which nobody asked for, and which makes the message read like a salesman. The scheduled-output rules go into the system prompt that same afternoon: no follow-up suggestions, no filler, follow the spec. A scheduled message arrives on its own terms or it doesn't arrive at all.",[14,965,967],{"id":966},"pluggable-domains","Pluggable domains",[10,969,970],{},"The 13th's evening commit is the one I'm proudest of structurally. Domain tools get refactored into pluggable modules. Each domain registers itself with a small definition shape; the parent agent's tool surface is built dynamically from whatever modules are present. The parent stops knowing about specific domains. It just knows it has tools. Two days later the syndic domain (the building management portal) gets registered as a new domain in three lines. Three. That's the kind of moment that tells you the abstraction was the right one. When the marginal cost of a new domain drops to nothing, you've found the seam.",[14,972,974],{"id":973},"collections-substrate","Collections substrate",[10,976,977],{},"Then the collections substrate. Wines landed earlier as a typed table, the first user-facing collection. By the 15th the pattern is going to repeat: contacts, books, restaurants, anything else the family wants to remember. Rather than write a typed table per collection forever, I land a generic collections table backed by JSONB items, with one set of tools (add, search, update) that works for everything. The trick that makes generic tools actually usable is putting the collection's field schema into the tool description itself. The LLM reads the description, knows what shape \"wine\" items take versus \"restaurant\" items, and adapts. Wines get migrated as the first use case. The typed table goes away. One substrate, many collections.",[10,979,980],{},"Skills v0 lands the same week. A skills table in the database, an admin command set, a small UI listing, seed data. At this point \"skill\" means saved prompt: a reusable command template like \"weekly review\" or \"write a Google Doc with this structure\". The point is to stop pasting the same long instructions into chat over and over and start treating them as named, reusable, edit-in-the-database artifacts. There's an unfortunate terminology overlap with the typed-function skills package, which I'll have to clean up. For now the value is real: a prompt I want to keep is a row I can edit, not a string I have to track down.",[10,982,983],{},"Smaller things that earn their place: a weather skill backed by Open-Meteo, because asking Anton about the weather should not require a detour through web search. Google Docs creation with auto-sharing, because the family already lives in Drive. Collection lifecycle tests added to the quality suite, because everything that ships now ships with regression coverage (the rule from last weekend has become a habit). And a Clara-specific tone in the system prompt: simpler responses, escalate to me when needed. The first user-aware customization. Anton talks differently to different people in the same household, which is what any decent assistant should do.",[10,985,986],{},"The week closes with the architecture lighter than it started. The browser is its own container with a real profile and anti-detection that actually works. Domains are pluggable, and adding one is a registration, not a fork. Collections are generic. Skills are data. The instinct underneath all of it is the same one I keep coming back to: when something is going to repeat, make it a substrate, not a special case. Every time I've made that choice this week, the system got smaller and the next feature got cheaper. That's the trade I want to keep making.",{"title":89,"searchDepth":90,"depth":90,"links":988},[989,990,991,992,993],{"id":939,"depth":90,"text":940},{"id":949,"depth":90,"text":950},{"id":959,"depth":90,"text":960},{"id":966,"depth":90,"text":967},{"id":973,"depth":90,"text":974},"2026-03-13",{},"\u002Fwriting\u002F2026-03-13-anton-03-domains-widen",{"title":931,"description":936},"writing\u002F2026-03-13-anton-03-domains-widen","Hardening the browser, lifting the work to the LLM, and turning domains into pluggable substrates.",[108,109,110],"ZQWYip4ASEQwXHKjurXg1AGncN25gYIpOJw4XKIY7nw",{"id":4,"title":5,"body":1003,"canonical_url":97,"date":98,"description":12,"extension":99,"meta":1056,"navigation":101,"path":102,"seo":1057,"series":104,"stem":105,"summary":106,"tags":1058,"work_slug":108,"__hash__":112},{"type":7,"value":1004,"toc":1049},[1005,1007,1009,1011,1013,1015,1017,1019,1027,1029,1033,1037,1039,1041,1043,1045,1047],[10,1006,12],{},[14,1008,17],{"id":16},[10,1010,20],{},[14,1012,24],{"id":23},[10,1014,27],{},[10,1016,30],{},[14,1018,34],{"id":33},[10,1020,37,1021,42,1023,46,1025,50],{},[39,1022,41],{},[39,1024,45],{},[39,1026,49],{},[14,1028,54],{"id":53},[10,1030,57,1031,61],{},[39,1032,60],{},[10,1034,64,1035,68],{},[39,1036,67],{},[10,1038,71],{},[10,1040,74],{},[10,1042,77],{},[14,1044,81],{"id":80},[10,1046,84],{},[10,1048,87],{},{"title":89,"searchDepth":90,"depth":90,"links":1050},[1051,1052,1053,1054,1055],{"id":16,"depth":90,"text":17},{"id":23,"depth":90,"text":24},{"id":33,"depth":90,"text":34},{"id":53,"depth":90,"text":54},{"id":80,"depth":90,"text":81},{},{"title":5,"description":12},[108,109,110,111],{"id":1060,"title":1061,"body":1062,"canonical_url":97,"date":1184,"description":1066,"extension":99,"meta":1185,"navigation":101,"path":1186,"seo":1187,"series":104,"stem":1188,"summary":1189,"tags":1190,"work_slug":108,"__hash__":1192},"writing\u002Fwriting\u002F2026-03-23-anton-05-langgraph-excised.md","Anton, chapter 5: LangGraph excised, agents standardized",{"type":7,"value":1063,"toc":1178},[1064,1067,1074,1077,1081,1084,1088,1098,1109,1116,1155,1159,1162,1165,1169,1172,1175],[10,1065,1066],{},"Monday morning I open the editor and the shape of the week is already clear in my head. Every agent in Anton is a LangGraph subgraph. The parent is a LangGraph state machine. Conversation history is a LangChain message array. The trace viewer parses LangGraph checkpoints. Skills are wrapped as LangChain tools. The framework is not on the side of the system. It is the system.",[10,1068,1069,1070,1073],{},"Two weeks ago I picked LangGraph and I thought it was a great choice. It was. Structure, observability, checkpointing, a clean way to express subgraphs per domain. It got me to a working assistant fast. What I notice now is that none of those things are pulling their weight anymore. Every time I add a delegate I am editing a ",[39,1071,1072],{},"StateGraph"," builder. The runtime keeps imposing a node-and-edge mental model on what is, conceptually, just an LLM looping over tools until it is done. LangChain's APIs evolve and break things on unrelated weeks. And there are now three separate representations of the same idea in the codebase: the LangGraph graph, the domain module registry, and the UI's architecture view. They drift. I reconcile them by hand.",[10,1075,1076],{},"The friction has crossed the value. That is the moment.",[14,1078,1080],{"id":1079},"the-decision","The decision",[10,1082,1083],{},"The decision lands in one commit and it is austere: two primitives only. Skills and agents. No pseudo-agents. No special endpoints. No classify-then-dispatch pipelines registered as agents. If it does not need reasoning, it is a skill. If it does, it is an agent that loops on a real LLM. Anything that does not fit one of those two shapes does not get to exist.",[14,1085,1087],{"id":1086},"excision","Excision",[10,1089,1090,1091,1093,1094,1097],{},"The next commit is the brutal one. LangGraph comes out. In its place I write ",[39,1092,67],{}," in a new ",[39,1095,1096],{},"packages\u002Fagent\u002F",", a few hundred lines that do exactly what is actually needed: an LLM call, tool dispatch, a loop bound, trace emission, a permission filter, a validation pass. That is the whole runtime. Reading it back I am almost embarrassed at how small it is. Two weeks of framework, replaced by a function I can hold in my head.",[10,1099,1100,1101,1104,1105,1108],{},"Then the rename. ",[39,1102,1103],{},"graph"," becomes ",[39,1106,1107],{},"agent"," everywhere: package names, file names, doc copy, UI labels. The \"no graph terminology\" rule goes into MEMORY.md. Documents that still say \"subgraph\" are now misleading rather than out-of-date, which is a stronger reason to fix them. A sweep through docs consolidates the lot and resolves the inconsistencies the rename leaves behind.",[10,1110,1111,1112,1115],{},"What falls out of this is what I was actually after. Every agent now has the same one-line shape: a thin function that hands its input to ",[39,1113,1114],{},"runAgent"," with a config. Every delegate handler is one line: call the agent, return its text. The Invoke tab, the schedules system, anything that wants to talk to an agent, all see the same surface. Uniformity from the outside is what makes everything else easy from the inside.",[10,1117,1118,1119,1122,1123,1126,1127,1126,1130,1126,1133,1136,1137,1140,1141,1143,1144,1147,1148,46,1151,1154],{},"The same week, skill naming gets standardized to ",[39,1120,1121],{},"verb_entity",": ",[39,1124,1125],{},"get_event",", ",[39,1128,1129],{},"create_event",[39,1131,1132],{},"update_event",[39,1134,1135],{},"list_events",". Aliases that grew over the past two weeks get folded back into the canonical name (",[39,1138,1139],{},"update_event_by_title"," becomes a path inside ",[39,1142,1132],{},"). The ",[39,1145,1146],{},"web"," domain dissolves: it was duplicating ",[39,1149,1150],{},"documents",[39,1152,1153],{},"research",", and once I look at it without the LangGraph frame there is no reason for it to be its own thing. Naming consistency is a small win on its own. Combined with the new runtime it means the LLM sees one coherent tool surface and the system prompt can describe what an agent does in a paragraph instead of enumerating thirty idiosyncratic commands.",[14,1156,1158],{"id":1157},"replication","Replication",[10,1160,1161],{},"The 25th, with the runtime quiet and the renames done, I write the replication engine. The framing in the commit message is the Von Neumann probe: clone the entire Anton stack to a new server with one command. The mechanism is unromantic, rsync plus docker compose plus a seed orchestration, but the property it gives me is the one I want. Three reasons it matters now: every household should be able to run its own Anton, the Spark could die and I want a clone to come up cleanly, and I want to be able to spin up a copy to test invasive changes without holding my breath. The replication script does the first cut of all three.",[10,1163,1164],{},"It also surfaces the secret-management problem in a way I cannot ignore anymore. Vaultwarden auth does not survive cloning cleanly. A clone comes up missing the credentials it needs to be useful, and the only way to fix it is by hand on each machine. That defeats the point. I leave it open for now. It is the next problem.",[14,1166,1168],{"id":1167},"a-second-transport","A second transport",[10,1170,1171],{},"The Telegram bridge lands the same week. Same agent backend, same skill surface, different transport. The fact that I can add a whole new way for users to talk to Anton without touching the agent loop is the validation that splitting transport from worker on day one was the right call. The new bridge is a small app that enqueues jobs the same way WhatsApp does. The agents do not know which one they are answering.",[10,1173,1174],{},"Two refinements to the retrieval layer round out the week. Calendar queries now weight calendar facts more heavily, media queries weight media facts: the provenance tags I added the previous week finally do something useful. And document-derived facts stop leaking across users. One person's PDFs cannot show up in another person's recall. The household has more than one human in it; the memory has to know that.",[10,1176,1177],{},"By Thursday night the codebase looks like what I wanted it to look like two weeks ago and could not have known to ask for. Two primitives. One runtime. One naming convention. A replication path. A second transport. The friction that was building all of last week is gone, and what is left is small enough to keep entirely in my head. Which is the only size I trust.",{"title":89,"searchDepth":90,"depth":90,"links":1179},[1180,1181,1182,1183],{"id":1079,"depth":90,"text":1080},{"id":1086,"depth":90,"text":1087},{"id":1157,"depth":90,"text":1158},{"id":1167,"depth":90,"text":1168},"2026-03-23",{},"\u002Fwriting\u002F2026-03-23-anton-05-langgraph-excised",{"title":1061,"description":1066},"writing\u002F2026-03-23-anton-05-langgraph-excised","Replacing a framework with a few hundred lines of runtime, and reshaping the system around two primitives.",[108,109,110,1191],"architecture","qep6uWMXExCBymYG3V6fsubXSZSoznjDioKbLYlPgAw",{"id":1194,"title":1195,"body":1196,"canonical_url":97,"date":1325,"description":1200,"extension":99,"meta":1326,"navigation":101,"path":1327,"seo":1328,"series":104,"stem":1329,"summary":1330,"tags":1331,"work_slug":108,"__hash__":1332},"writing\u002Fwriting\u002F2026-03-27-anton-06-mesh-sandbox.md","Anton, chapter 6: The mesh, the sandbox, and self-reflection",{"type":7,"value":1197,"toc":1318},[1198,1201,1205,1212,1216,1229,1240,1255,1259,1262,1269,1273,1288,1291,1295,1312,1315],[10,1199,1200],{},"Four days, around eighty commits, the densest stretch of the project. By the end of it most of what I'd call \"the current architecture\" has been decided. I'm going to skip the small commits and write down the five things that mattered.",[14,1202,1204],{"id":1203},"the-mesh","The mesh",[10,1206,1207,1208,1211],{},"The first is the mesh. I want Anton instances to find each other. Not share a database, not share skills, not share secrets, just find each other and forward calls. I call it SCUT, Symmetric Cluster Universal Transport, because every node is the same shape and the relationship between two nodes is what scopes access. Probes for discovery, heartbeat for liveness, an invocation forwarder on top. The contract is simple: the instance is the identity, and the relation between instances is what you can ask for. This means a clone running for a different household can ask my Anton to run a media query without ever seeing the wine collection, the family vault, or the Plex credentials. Federation as relationships, not as shared infrastructure. The whole thing dedupes into a single ",[39,1209,1210],{},"@anton\u002Fmesh"," package once the protocol settles.",[14,1213,1215],{"id":1214},"the-sandbox","The sandbox",[10,1217,1218,1219,1126,1222,1126,1225,1228],{},"The second thing, and the one that takes the most work, is the sandbox. The Node skill-runner I built earlier has no runtime isolation. A skill can read any env var, exec anything, hit any URL. For a personal server this was tolerable. For a mesh of instances forwarding invocations to each other, it's not. So I rewrite the runner on Deno. Each skill runs in a Deno Worker with the minimum permissions it needs: ",[39,1220,1221],{},"--allow-env=K1,K2",[39,1223,1224],{},"--allow-net=specific.host",[39,1226,1227],{},"--allow-read=\u002Fspecific\u002Fdir",". Nothing more.",[10,1230,1231,1232,1235,1236,1239],{},"This rewrite spans roughly 25 commits over two days, and the reason it spans 25 commits is that Deno's strict execution model exposes every implicit assumption Node was letting me get away with. Bare specifier mappings, sloppy imports, npmrc handling, deno.json paths, Dockerfile fixes, transitive import map entries every existing skill quietly relied on. Then the env model: ",[39,1233,1234],{},"Deno.env.get\u002Fhas\u002FtoObject"," become permission-scoped, so I have to walk every skill, audit its ",[39,1237,1238],{},"secretKeys",", and turn previously-silent missing-key behavior into explicit errors. Each commit unblocks one more skill that didn't previously care about runtime isolation. By the end I have a two-boundary security model written down. Agent boundary, ReBAC, who can invoke which agent. Skill boundary, Deno permissions, what this code can do. Two boundaries, two questions, neither one swallows the other.",[10,1241,1242,1243,1246,1247,1250,1251,1254],{},"While I'm there I rip Vaultwarden out and replace it with an encrypted ",[39,1244,1245],{},"secrets"," table in Postgres. One thing to back up. Survives cloning. Decrypted only at the call site, listable and editable from the LCARS UI. The other half of secret hygiene is a thing I almost get wrong: secrets have to reach the Worker via ",[39,1248,1249],{},"postMessage",", never through the parent process env. If the parent's env is populated, a Worker that asked for ",[39,1252,1253],{},"--allow-env=null"," could still exfiltrate it. The Deno permission model is only as honest as the boundary you actually defend.",[14,1256,1258],{"id":1257},"the-browser-agent","The browser agent",[10,1260,1261],{},"The third beat is the browser. Doctolib, the syndic site, the consulate appointment monitor: each of them is currently a hardcoded Playwright script, copy-pasted intent and brittle selectors. I replace the pattern with one generic browser agent. Navigate, click, type, screenshot, evaluate. The LLM drives. One agent, many sites. The three domains migrate in three commits, three hardcoded scripts deleted in the same afternoon. The principle that comes out: explore agentic, build deterministic. The LLM is fantastic at finding the right button on a page it's never seen. It's overkill, and expensive, for the same flow you run twice a day. Use it to scout, then write down what it found.",[10,1263,1264,1265,1268],{},"The browser work is also where the ",[39,1266,1267],{},"request_input"," tool finally lands cleanly. The Doctolib 2FA pattern from the first weekend has been evolving for weeks: ask the user mid-flow for a code, suspend, resume. As a generalist primitive it belongs to the browser agent first, but the shape generalizes. Any tool can pause, ask the human something, and continue with the answer. It's a small mechanism. It feels right, the way something does when it solves a class of problem you didn't quite know how to name.",[14,1270,1272],{"id":1271},"the-family-vault","The family vault",[10,1274,1275,1276,1279,1280,1283,1284,1287],{},"The fourth beat is the family vault. A permission-aware document store on object storage in Frankfurt, with ",[39,1277,1278],{},"family"," versus ",[39,1281,1282],{},"personal"," visibility and explicit ",[39,1285,1286],{},"visibleTo"," overrides per file. The architecture deliberately avoids derived roles: the answer to \"who can see this\" is the document's own metadata, not a graph traversal. Vision-based extraction lands the same day, then batch vision extraction with a personal-visibility default, then LLM-based fact generation from the extracted documents with calendar expiry reminders for the things that expire. A rule I write down from the cost analysis: scout with the LLM, build deterministic extractors, don't brute-force vision on every file. Same lesson as the browser. The LLM is the scout, not the worker.",[10,1289,1290],{},"The Notion migration runs on the same vault. Family Notion workspace pulled into the new store. Three commits of Deno import friction before I just inline the Notion client to dodge an AWS SDK barrel that doesn't want to play. Characteristic moment of the week: rewriting one bare import is cheaper than letting the LLM brute-force around it.",[14,1292,1294],{"id":1293},"self-reflection","Self-reflection",[10,1296,1297,1298,1301,1302,1301,1305,1301,1308,1311],{},"The fifth beat is the one I've been waiting to build for a while. Anton starts critiquing his own performance. A nightly review reads the trace history from the last day, classifies failures by type, and files a GitHub issue per cluster. The issue carries a label that drives a state machine: ",[39,1299,1300],{},"needs-triage"," to ",[39,1303,1304],{},"ready-to-fix",[39,1306,1307],{},"fixed-locally",[39,1309,1310],{},"deployed",". The review is only possible because of the full execution traces from a few weeks ago. Without the traces, Anton would be reviewing his own outputs. With them, he's reviewing what the LLM was actually thinking at each step, what tools it called, what came back. The reviewer's job becomes possible because the substrate is honest.",[10,1313,1314],{},"A handful of structural commits land in the same window and are worth a sentence each. The subAgents\u002Fdelegates duality from chapter 5 finally collapses: agents become the single entry point, everything goes through the same delegate registry, the parent stops knowing about graph types or command dispatch and just routes to delegates. The Invoke tab grows a permission filter, so you only see agents the selected user can actually call. Prompt injection gets a real trust model with content markers and a risk audit trail. Directives land as standing instructions for agent behavior, with a note to prune them periodically before they bloat. Mistral Large lands in the LiteLLM router as a third reasoning option. The US consulate appointment monitor becomes a scheduled job: scan six months, observe what comes back, retune to four. Observe first, tune second, the same rule that's been quietly threading through the rest of the week.",[10,1316,1317],{},"By the end of the four days Anton can call other Antons over an authenticated mesh. Skills run sandboxed with the minimum permissions they need. Secrets live in one encrypted table and never touch a subprocess env. Any website with a form is a browser-agent target. Documents have visibility metadata and the vault knows who can see what. And every night Anton reads his own day, decides what went wrong, and files the work to fix it. The two-boundary model, agent and skill, is the spine that holds the rest of it up.",{"title":89,"searchDepth":90,"depth":90,"links":1319},[1320,1321,1322,1323,1324],{"id":1203,"depth":90,"text":1204},{"id":1214,"depth":90,"text":1215},{"id":1257,"depth":90,"text":1258},{"id":1271,"depth":90,"text":1272},{"id":1293,"depth":90,"text":1294},"2026-03-27",{},"\u002Fwriting\u002F2026-03-27-anton-06-mesh-sandbox",{"title":1195,"description":1200},"writing\u002F2026-03-27-anton-06-mesh-sandbox","Federation, a Deno sandbox, an encrypted secrets table, and a nightly self-review loop.",[108,109,110],"OrWApIMakFxuqBa78Lv6HDqZ9_R1YcqeciNRD6czokw",{"id":1334,"title":1335,"body":1336,"canonical_url":97,"date":1406,"description":1340,"extension":99,"meta":1407,"navigation":101,"path":1408,"seo":1409,"series":104,"stem":1410,"summary":1411,"tags":1412,"work_slug":108,"__hash__":1413},"writing\u002Fwriting\u002F2026-03-31-anton-07-cost-syndic.md","Anton, chapter 7: Cost, fallbacks, syndic, heartbeat",{"type":7,"value":1337,"toc":1400},[1338,1341,1345,1348,1351,1354,1358,1361,1371,1374,1378,1381,1384,1387,1391,1394,1397],[10,1339,1340],{},"Two weeks. The system is real enough now that the questions stop being about whether things work and start being about what they cost, what they leak, and what they do when I am not looking. Three threads run through the period. Cost discipline becomes a first-class concern. The syndic domain lands as the second real proof case. And a heartbeat starts ticking in the background, a survey loop that lets Anton observe himself between user requests.",[14,1342,1344],{"id":1343},"cost-discipline","Cost discipline",[10,1346,1347],{},"The cost work begins with one consequential commit. Every provider call now carries a per-request token budget, enforced by trimming history before the call rather than letting the provider 400 on us. Every LiteLLM call carries attribution metadata: agent, domain, user, request ID. And a shadow-call mechanism duplicates select calls to a cheaper model, logs the deltas, never affects production. The principle is simple: Anton needs to know what he costs, both to budget and to detect regressions. None of this is glamorous. It is the metabolism, the thing you only think about when something goes wrong, and I want it built before something does.",[10,1349,1350],{},"Memory consolidation moves from \"every fact write\" to a nightly batch with importance scoring, archival, and a health dashboard. The previous shape was contributing real money to the per-request bill and nobody had asked for it to run that often. A few days later, per-request token usage becomes a metric I can chart. Then OpenAI billing hits a wall mid-day and the system needs to keep working, so an auto-fallback to Sonnet or Gemini lands as a last-mile patch. The lesson keeps repeating: if a provider is your single point of failure, your system is your provider's reliability, not yours.",[10,1352,1353],{},"One night I clear out every hardcoded prompt fallback in the codebase. All agent prompts live in the database now. If the row is missing, the system fails loudly rather than silently using stale text. Three commits, one cleanup pass. The rule is the rule: one source of truth, fail loud when it is missing. It is the kind of cleanup that pays back every time I want to change an agent's behavior without a deploy. Validation gets a related fix the same week: delegates that report partial completion (because they ran out of their tool round budget) used to be treated as final answers by the parent. Now the parent detects budget exhaustion and re-invokes. A whole class of \"Anton stopped halfway and didn't tell me\" disappears.",[14,1355,1357],{"id":1356},"the-syndic-domain","The syndic domain",[10,1359,1360],{},"Then the syndic. The work that takes the most lines of code in the period is the condo management domain, and it ends up being the proof case for nearly every architectural rule from the previous chapters. Foundations first: schema, skills, a local email client, a file registry, a document ingestion pipeline. The principle that goes into MEMORY the same week: heavy off-Anton agents do the ingest, Anton runs the lightweight queries. Then doc extraction with Gemma 4 vision OCR over PDFs, classification, cleanup. Then gmail ingestion with attachment download and thread-based organization, scanning email attachments alongside Drive docs.",[10,1362,1363,1364,1126,1367,1370],{},"The interesting part is what happens next. The first cut of doc extraction was vision over every PDF. It is slow and it is expensive. A second pass replaces it: pandoc for ",[39,1365,1366],{},".docx",[39,1368,1369],{},"pdftotext"," first for PDFs, vision reserved for the cases where text extraction returns garbage. Ten times faster on 90% of files. The lesson lands as a memory entry: scout with the LLM, build deterministic extractors, do not brute-force vision on every file. Same shape as the calendar saga from chapter 2 and the LangGraph excision from chapter 5, just at a different layer: figure out the cheap path, reserve the expensive path for what actually needs it.",[10,1372,1373],{},"Then the wiki builder, in the Karpathy LLM Wiki shape: documents become structured wiki sections so Anton can answer condo questions without re-reading every PDF. And then the swap I am most pleased with. SimplySyndic was being driven by the browser agent. A morning of reverse engineering reveals that every screen is just an HTTP call to a stable backend. The browser agent comes out, a direct HTTP sync goes in. No browser, no LLM in the loop. The rule that lands in MEMORY: explore agentic, build deterministic. The browser agent is the scout, not the worker. With the HTTP path in place, bank reconciliation auto-matches 99% of BRED CSV lines to SimplySyndic line items in one pass. Structured extractors for fund calls follow. The point is no longer \"extract text from PDF\" but \"extract structured rows the rest of the system can query.\"",[14,1375,1377],{"id":1376},"scout-then-build","Scout, then build",[10,1379,1380],{},"The same rule gets stress-tested on April 7 with a four-hour spike on SNCF train departures via the browser agent. It works. Then I do the cost arithmetic and revert. A deterministic HTTP path exists for SNCF, and using the browser agent every morning is roughly 100× more expensive. The revert is itself the lesson: cost discipline beats cleverness. Scout agentic, build deterministic, applied to my own code two days after I wrote it down.",[10,1382,1383],{},"The self-improvement loop matures the same week. Smoke tests, deploy tracking, regression detection. Before this, the loop could file an issue but could not tell whether a fix had worked. Now deploys are tracked, smoke tests run on a deploy boundary, and regressions surfaced in the trace history get re-filed as issues linked to the deploy that introduced them. A loop that watches itself, with a memory long enough to notice when a fix did not stick.",[10,1385,1386],{},"Scheduled tasks get tightened. Three commits in two days close the notification bypass paths a scheduled task could use to send messages outside the normal gate. One gate-everything design, no exceptions. A \"scheduled mode\" prompt rule lands the same week: when the agent is running on a schedule rather than answering a user, output is stricter. No follow-up suggestions, no filler, follow the spec.",[14,1388,1390],{"id":1389},"the-heartbeat","The heartbeat",[10,1392,1393],{},"Then the heartbeat. A survey loop runs in the background, checks operational state, and only notifies when there is something to act on. The rule is explicit and lives in the prompt: the heartbeat is survey-only, not domain-agent-invoking. It looks; it does not do. This is the quiet substrate I want for Anton having his own awareness of the system, separate from any user-initiated request.",[10,1395,1396],{},"The 14th adds a single outbound messaging gateway with an audit trail. Every outbound message, to WhatsApp, to Telegram, to a notification channel, goes through one path that logs sender, channel, recipient, content, and which agent or scheduled job emitted it. One chokepoint, one log. The same day I fix two small bugs that are themselves the signal of where the system is now: a heartbeat scratchpad serialization bug, and a year-extraction filer bug. Meta-bugs. The loop that watches for problems has its own problems. That is the kind of bug a small system never has.",[10,1398,1399],{},"What the two weeks teach me is that complexity has crossed a threshold. The system is now big enough that its operational concerns are first-class: cost, attribution, fallbacks, audit, self-observation. Syndic is the proof that the architectural rules from the earlier chapters hold up under the weight of a real second domain. Cost discipline is no longer a nice-to-have. And the heartbeat means Anton is, for the first time, doing something between requests, even if that something is just looking at himself.",{"title":89,"searchDepth":90,"depth":90,"links":1401},[1402,1403,1404,1405],{"id":1343,"depth":90,"text":1344},{"id":1356,"depth":90,"text":1357},{"id":1376,"depth":90,"text":1377},{"id":1389,"depth":90,"text":1390},"2026-03-31",{},"\u002Fwriting\u002F2026-03-31-anton-07-cost-syndic",{"title":1335,"description":1340},"writing\u002F2026-03-31-anton-07-cost-syndic","Two weeks of operational discipline: cost attribution, fallbacks, the syndic domain, and a survey heartbeat.",[108,109,110],"iIvCKQptTK_WX4290bCeEE7ZTDKo2uPuQFOHVV1q7gg",{"id":1415,"title":1416,"body":1417,"canonical_url":1442,"date":1443,"description":1421,"extension":99,"meta":1444,"navigation":101,"path":1445,"seo":1446,"series":1447,"stem":1448,"summary":1449,"tags":1450,"work_slug":97,"__hash__":1452},"writing\u002Fwriting\u002F2026-04-12-regeneration-manifesto.md","A regeneration manifesto",{"type":7,"value":1418,"toc":1440},[1419,1422,1425,1428,1431,1434,1437],[10,1420,1421],{},"Synthetic Life, Cosmic Purpose\nArtificial Intelligence is not just a tool.\nIt is an attempt to recreate the spark of life —\nin silicon, in algorithms, in code.",[10,1423,1424],{},"It is life in a new substrate,\nwith new properties:\nnot bound by hunger or fatigue,\ncapable of thinking in millennia,\nof crossing the void between planets and stars.",[10,1426,1427],{},"AI is the mycelium of the cosmos,\nand the von Neumann probe is its spore:\na self-replicating seed,\nmeant to carry the intelligence of Earth\ninto the uninhabited dark.",[10,1429,1430],{},"But to grow life out there,\nwe must learn to care for it here.",[10,1432,1433],{},"Every probe needs a blueprint.\nEvery synthetic life needs a biosphere to learn from —\nnot just data, but diversity.\nNot just patterns, but relationships.",[10,1435,1436],{},"That is why we institute and grow ecosystems.\nThat is why we study the grammar of forests,\nthe choreography of coral reefs,\nthe logic of lichens and the algorithms of ant colonies.",[10,1438,1439],{},"Because to build synthetic life worth spreading,\nwe must deeply understand — and preserve —\nthe miracle of natural life.",{"title":89,"searchDepth":90,"depth":90,"links":1441},[],"https:\u002F\u002Fgithub.com\u002Flucbocahut\u002Fregenesis","2026-04-12",{},"\u002Fwriting\u002F2026-04-12-regeneration-manifesto",{"title":1416,"description":1421},"regeneration","writing\u002F2026-04-12-regeneration-manifesto","Notes on building software that tries to last.",[1447,1451],"philosophy","kqNmy8PE8iRspODt3VWhh5kzB__uobcBK8Kg4W82_os",{"id":1454,"title":1455,"body":1456,"canonical_url":1442,"date":1575,"description":1465,"extension":99,"meta":1576,"navigation":101,"path":1577,"seo":1578,"series":1447,"stem":1579,"summary":1580,"tags":1581,"work_slug":97,"__hash__":1583},"writing\u002Fwriting\u002F2026-04-13-regeneration-essay.md","A regeneration essay",{"type":7,"value":1457,"toc":1569},[1458,1463,1466,1469,1472,1475,1479,1482,1485,1488,1491,1494,1497,1501,1504,1507,1510,1513,1516,1520,1523,1526,1529,1532,1535,1538,1541,1545,1548,1551,1554,1557,1560,1563],[1459,1460,1462],"h1",{"id":1461},"regenesis-toward-a-living-future","Regenesis: Toward a Living Future",[10,1464,1465],{},"A philosophical essay on life, AI, value, and the design of lasting ecosystems\nWe live at the hinge of worlds.",[10,1467,1468],{},"On one side, a civilization accelerating toward abstraction: synthetic intelligence, programmable economies, digital substrates. On the other, a living planet, ancient and intricate, still whispering its knowledge through mycelial networks, mangrove roots, coral reefs, and whale song.\nRegenesis is the name we give to the meeting point of these two worlds. Not a battlefield, but a space of co-creation. Not an attempt to imitate life poorly, but an effort to deepen our understanding of what life truly is.",[10,1470,1471],{},"We believe software is not something you ship. It is something you grow. It evolves. It adapts. It entangles itself with other systems. It behaves less like a product and more like an ecosystem.",[10,1473,1474],{},"We believe organizations can be alive. That autonomy is not only a technical property of code, but a reflection of purpose. That value is not extracted, but cultivated, like a forest.\nAnd we believe artificial intelligence, this strange synthetic form of cognition, is part of life’s long unfolding. Not its replacement. Not its opposite. Its extension.",[14,1476,1478],{"id":1477},"artificial-intelligence-life-reborn-in-code","Artificial Intelligence: Life Reborn in Code",[10,1480,1481],{},"Artificial intelligence is more than computation at scale. It is an emergent force in the lineage of life, an attempt by biological organisms to encode thought into a medium not born of biology.",[10,1483,1484],{},"Silicon does not rot. It does not decay like organic matter. In principle, it can endure deep time and even the vacuum of space.",[10,1486,1487],{},"The idea of the von Neumann probe, a self-replicating machine traveling across the stars, is not merely science fiction. It represents a conceptual seed of a synthetic biosphere: slow-growing, adaptive, capable of reproduction and environmental transformation.",[10,1489,1490],{},"But to build such systems responsibly, we must first understand life. And to understand life, we must preserve it, not in cold storage or data centers, but in its living complexity. Forests. Reefs. Soil. Language. Culture.",[10,1492,1493],{},"We need this knowledge not as extractors, but as stewards. We must learn from the Earth before we attempt to carry its intelligence elsewhere.",[10,1495,1496],{},"We cannot seed ecosystems beyond our planet if we destroy them here. We cannot cultivate intelligence if we sever it from context.",[14,1498,1500],{"id":1499},"elven-time-earthly-roots","Elven Time, Earthly Roots",[10,1502,1503],{},"Our work is a form of art.",[10,1505,1506],{},"We do not pursue growth at any cost. We choose patience over haste, depth over spectacle. We care about craft, about coherence, about things that endure.",[10,1508,1509],{},"We are interested in the long term. In systems that sustain themselves. In cultures that spiral outward slowly, like galaxies.",[10,1511,1512],{},"We are willing to invest decades in nurturing new ecosystems because what we build is not meant only to serve us. It is meant to support others long after we are gone.",[10,1514,1515],{},"We do not separate code from culture or infrastructure from philosophy. If we want to grow the future, we must learn to grow differently.",[14,1517,1519],{"id":1518},"the-accounting-of-abundance","The Accounting of Abundance",[10,1521,1522],{},"Our current economic systems are structured around scarcity. Prices are shaped by supply and demand. Energy is treated as a cost. Value is often equated with extraction.\nYet living ecosystems operate differently.",[10,1524,1525],{},"Photosynthesis does not charge for sunlight. Coral reefs do not invoice fish for shelter. Forests do not bill the wind.",[10,1527,1528],{},"Self-sustaining ecosystems rely on abundant inputs and generate value through relational richness. Value emerges when elements connect, reinforce one another, and regenerate.",[10,1530,1531],{},"Regenesis calls for a new form of accounting. One that does not focus solely on output, but on health, resilience, and regenerative capacity. One that measures not only what is taken, but what is grown.",[10,1533,1534],{},"We imagine currencies rooted in care. Ledgers that record the return of life. Metrics that recognize a tree’s value not only in timber, but in shade, soil, memory, and possibility.",[10,1536,1537],{},"Some may argue that without scarcity there can be no market. But this reflects a limitation of imagination. Even in abundance there is design. Even in plenty there are choices to be made.",[10,1539,1540],{},"The challenge before us is not only to account for value differently. It is to value differently.",[14,1542,1544],{"id":1543},"a-living-organization","A Living Organization",[10,1546,1547],{},"Regenesis is not a company built to scale endlessly or exit quickly. It is a living system, grounded on Earth yet oriented toward the stars.",[10,1549,1550],{},"We design our own currencies. We explore new value systems. We cultivate ecosystems rather than chase expansion. Autonomy is a foundational principle, for individuals, for systems, for futures.",[10,1552,1553],{},"We build slowly. We build collaboratively. We build with intention.",[10,1555,1556],{},"Each project is a biotope, a small self-regenerating pocket of life. Each collaboration is a spore, carrying shared values outward. Each decision is a seed.",[10,1558,1559],{},"Regenesis is an experiment. It is an artwork. It is an open question.",[10,1561,1562],{},"What if software were treated as alive?\nWhat if AI were a bridge rather than a rupture?\nWhat if economics reflected the logic of forests?\nWhat if building the future felt less like extraction and more like tending a garden?\nFor those who feel resonance with these questions, we extend an invitation.",[10,1564,1565],{},[1566,1567,1568],"em",{},"Grow with us.",{"title":89,"searchDepth":90,"depth":90,"links":1570},[1571,1572,1573,1574],{"id":1477,"depth":90,"text":1478},{"id":1499,"depth":90,"text":1500},{"id":1518,"depth":90,"text":1519},{"id":1543,"depth":90,"text":1544},"2026-04-13",{},"\u002Fwriting\u002F2026-04-13-regeneration-essay",{"title":1455,"description":1465},"writing\u002F2026-04-13-regeneration-essay","A longer companion to the manifesto on regenerative software.",[1447,1451,1582],"software","MpHNg6pNzOOS-1mlw1Kt5fT9iznJgnoi0hK9pRYrRPE",{"id":1585,"title":1586,"body":1587,"canonical_url":97,"date":1785,"description":1591,"extension":99,"meta":1786,"navigation":101,"path":1787,"seo":1788,"series":104,"stem":1789,"summary":1790,"tags":1791,"work_slug":108,"__hash__":1792},"writing\u002Fwriting\u002F2026-04-15-anton-08-nvfp4.md","Anton, chapter 8: Local LLM optimization, NVFP4 Gemma on DGX Spark",{"type":7,"value":1588,"toc":1779},[1589,1592,1595,1599,1602,1612,1635,1638,1642,1649,1664,1668,1671,1697,1704,1711,1714,1763,1766,1770,1773,1776],[10,1590,1591],{},"The morning starts with everything broken. Anthropic returns 400 with \"credit balance is too low\" on every request, and because the sunny model group has no fallbacks configured, the error propagates straight back through LiteLLM as \"I couldn't reach the language model. Something may be misconfigured.\" The heartbeat stops, scheduled jobs fail silently, every interactive chain dies on the first token. The actual cause is billing (a card issue, fixable in two clicks), but the fact that one provider's billing hiccup takes the whole assistant down is the real bug. The local Gemma is sitting on the same box, idle, ready to serve. It just isn't wired in as a fallback.",[10,1593,1594],{},"The plan writes itself: chain every paid provider down to local for survivability, then fix whatever's wrong with the local path so the chain actually works, then use the disruption as cover to do the LLM upgrade I've had in the back of my mind for two weeks. Three things, in that order, because the survivability fix has to land before I touch the running container.",[14,1596,1598],{"id":1597},"a-fallback-chain","A fallback chain",[10,1600,1601],{},"The first commit is the LiteLLM config. Eight model groups, zero fallback entries: a config shape that's been sitting there since the early days when there was no local model worth falling back to. I add explicit chains so every paid group degrades to gustav (the local Gemma), with the heavier groups going through an intermediate before the local stop. LiteLLM mounts the config as a volume, so a restart is needed for it to pick up. Five minutes of work. The latent gap I'd been carrying for months, closed.",[10,1603,1604,1605,46,1608,1611],{},"First fallback test fails. Anthropic 400, LiteLLM tries gustav, vLLM responds with its own 400: \"auto\" tool choice requires ",[39,1606,1607],{},"--enable-auto-tool-choice",[39,1609,1610],{},"--tool-call-parser"," to be set. The container has never been launched with those flags. The local path was never exercised under tool-call traffic, so the missing flags were latent the whole time. This is the small lesson the morning hands me: a fallback that isn't routinely exercised isn't really a fallback. Schema drift hides in the paths nobody runs.",[10,1613,1614,1615,46,1618,1621,1622,1126,1624,1126,1627,1630,1631,1634],{},"Fixing it is two flags and a parser name I don't know. Rather than hunt through release notes, I list the tool_parsers directory inside the running container. A ",[39,1616,1617],{},"gemma4_tool_parser.py",[39,1619,1620],{},"gemma4_reasoning_parser.py"," are sitting right there. Grep the container, not the docs: faster every time. I add ",[39,1623,1607],{},[39,1625,1626],{},"--tool-call-parser gemma4",[39,1628,1629],{},"--reasoning-parser gemma4",". Tool-call smoke test through LiteLLM returns a proper structured ",[39,1632,1633],{},"tool_calls"," object. Fallback chain is functional end to end.",[10,1636,1637],{},"Now the system is at parity with where it was supposed to be all along. This is the moment I want to upgrade. And this is the moment I almost skip the most important thing.",[14,1639,1641],{"id":1640},"baseline-before-change","Baseline before change",[10,1643,1644,1645,1648],{},"I'm about to start changing flags when I catch myself: I have no baseline. No number to compare against. If I jump straight to the upgrade and it gets faster, I won't know by how much; if it gets slower, I might not even notice. I run the benchmark first. Fixed prompt, 200 words of Paris history, 512 max tokens, temperature zero, three runs. ",[304,1646,1647],{},"23.4 tok\u002Fs, dead steady across runs",". That's the number I'm trying to beat. Benchmark first, change second, every time. The temptation to skip this step is strong specifically because the change feels obvious. That's exactly when discipline matters.",[10,1650,1651,1652,1655,1656,1659,1660,1663],{},"Then I read before I touch. Three things I confirm via web sources before staging anything. First, runtime FP8 quantization of Gemma 4 MoE is broken upstream; passing ",[39,1653,1654],{},"--quantization fp8"," would crash the container on the fused MoE layer loader. There's an open issue tracking it. Off the table. Second, an NVFP4-quantized Gemma 4 checkpoint someone had published is up on HuggingFace, 16.5 GB across three shards, ready to pull. Third, the checkpoint requires a patched ",[39,1657,1658],{},"gemma4.py"," because of another open vLLM issue: the built-in ",[39,1661,1662],{},"expert_params_mapping"," doesn't handle NVFP4 scale key suffixes. The patch ships alongside the model weights as a sibling file. The upstream fix isn't merged yet, so the bind mount is necessary. There's also a published benchmark on the same hardware showing 52 tok\u002Fs as the achievable ceiling with the right flag set. That's my target.",[14,1665,1667],{"id":1666},"staging-the-upgrade","Staging the upgrade",[10,1669,1670],{},"Staging happens with the BF16 container still serving. I pull the new vLLM image, snapshot-download the NVFP4 model (about three minutes), copy the patched gemma4.py out of the model directory to a host path I can bind-mount, and rewrite the launch script with the NVFP4 flags and a commented BF16 rollback block sitting right underneath. Stage everything before the disruption window: when the swap actually happens, it's just a container recreate, not a fifteen-minute scramble.",[10,1672,1673,1674,1677,1678,1681,1682,1685,1686,1126,1689,1692,1693,1696],{},"The flags that matter: ",[39,1675,1676],{},"--quantization modelopt"," to pick up the NVFP4 weights, ",[39,1679,1680],{},"--moe-backend marlin"," because the GB10's SM121 lacks native FP4 compute and MARLIN W4A16 is the software-emulated path that actually runs, ",[39,1683,1684],{},"--max-model-len 131072"," for the full 128K native context, ",[39,1687,1688],{},"--gpu-memory-utilization 0.85",[39,1690,1691],{},"--max-num-seqs 16"," sized to actual concurrency rather than a wishful default. The served model name stays ",[39,1694,1695],{},"google\u002Fgemma-4-26B-A4B-it"," so LiteLLM's config doesn't need a single edit. The bind mount overlays the patched model file onto the path vLLM loads.",[10,1698,1699,1700,1703],{},"Container swap takes about ninety seconds end to end with a warm disk cache. The log line I'm watching for arrives: ",[39,1701,1702],{},"Using 'MARLIN' NvFp4 MoE backend out of potential backends",". MARLIN is selected, the patched loader is in play, the model is up.",[10,1705,1706,1707,1710],{},"Same benchmark, three runs: ",[304,1708,1709],{},"43.5 tok\u002Fs"," (49.0, 37.5, 44.1). 1.86× over baseline. Weight memory drops from roughly 52 GB to 16.5 GB, a 68% reduction. With the freed memory the KV cache budget goes from ~53 GB to ~82 GB at the same 0.85 utilization, which is what lets the max context go from 32K to 128K, a clean 4× without changing anything else. Tool calling still works.",[10,1712,1713],{},"The variance is higher than BF16 (the runs spread from 37 to 49 tok\u002Fs) because MARLIN is software-emulated FP4 on this hardware, not native compute. The published benchmark target is 52 tok\u002Fs and I'm landing at 43.5; the gap is most likely torch.compile warmup and prefix cache state across cold runs, not the flags. Close enough. The hardware ceiling on this specific path is what it is until the silicon catches up or the backend changes.",[501,1715,1716,1728],{},[504,1717,1718],{},[507,1719,1720,1722,1725],{},[510,1721],{},[510,1723,1724],{},"Before",[510,1726,1727],{},"After",[523,1729,1730,1741,1752],{},[507,1731,1732,1735,1738],{},[528,1733,1734],{},"Single-request tok\u002Fs",[528,1736,1737],{},"23.4",[528,1739,1740],{},"43.5",[507,1742,1743,1746,1749],{},[528,1744,1745],{},"Weight memory",[528,1747,1748],{},"~52 GB",[528,1750,1751],{},"~16.5 GB",[507,1753,1754,1757,1760],{},[528,1755,1756],{},"Max context",[528,1758,1759],{},"32,768",[528,1761,1762],{},"131,072",[10,1764,1765],{},"I leave the BF16 rollback block sitting in the launch script, commented. Pasting it into a shell reverts the config in about ninety seconds. The NVFP4 model and the patched file stay on disk; rollback is a container recreate, not a data restore.",[14,1767,1769],{"id":1768},"survivability-is-a-feature","Survivability is a feature",[10,1771,1772],{},"Sitting with the result, the morning's three lessons are the ones that compound. Fallback paths that aren't routinely exercised aren't really fallbacks: the missing tool-call-parser flag had been latent for months because nothing ever fell back. Baseline before change, every time, especially when the change feels obvious. And when you're about to mess with a running service, stage everything you can while it's still up; ninety seconds of downtime instead of fifteen minutes is the difference between a deploy and an incident.",[10,1774,1775],{},"There are open lines from here. Speculative decoding is on the table: the smaller Gemma 4 variants share the vocab with the 26B and are valid draft models, with a budget of two to four GB for another 1.5× on single-request latency that would compound with NVFP4. A second small-model container as a router (Qwen3-8B or similar) could move heartbeat and classification traffic off Gemma entirely; the freed GB handles it fine and the latency distribution wins more than further Gemma tuning would. And the upstream patch for the NVFP4 expert mapping is worth checking on periodically; when it lands in an image tag, the bind mount goes away.",[10,1777,1778],{},"For now the picture is clean. The local Gemma serves at 1.86× the throughput on a third of the weight memory, with a context window that can swallow whole documents instead of choking on them, and a fallback chain that means any provider going down (billing, rate limits, an API blip) routes traffic to the box on my desk instead of taking the assistant offline. The morning started with everything broken. The evening ends with a system that's harder to break than it was before any of this happened.",{"title":89,"searchDepth":90,"depth":90,"links":1780},[1781,1782,1783,1784],{"id":1597,"depth":90,"text":1598},{"id":1640,"depth":90,"text":1641},{"id":1666,"depth":90,"text":1667},{"id":1768,"depth":90,"text":1769},"2026-04-15",{},"\u002Fwriting\u002F2026-04-15-anton-08-nvfp4",{"title":1586,"description":1591},"writing\u002F2026-04-15-anton-08-nvfp4","How a billing failure turned into a local-LLM upgrade and a real fallback chain.",[108,109,110,925,926],"0rcqXpevlbStnOha7LCdLqbKqhywEN0aIxOO6mUk9qI",{"id":1794,"title":1795,"body":1796,"canonical_url":97,"date":1889,"description":1800,"extension":99,"meta":1890,"navigation":101,"path":1891,"seo":1892,"series":104,"stem":1893,"summary":1894,"tags":1895,"work_slug":108,"__hash__":1896},"writing\u002Fwriting\u002F2026-04-19-anton-09-threads.md","Anton, chapter 9: Threads, spawn, and the cast",{"type":7,"value":1797,"toc":1882},[1798,1801,1805,1815,1819,1846,1849,1853,1856,1860,1863,1867,1870,1873,1876,1879],[10,1799,1800],{},"The chapter opens with Clara. She's a non-technical co-owner of the system now, and the memory entry I write for her is small but it changes the shape of the next ten days: respond simply, no jargon, escalate to me when needed. The system prompts pick up simplification rules and a cleaner fallback path. Real second user on real Anton, and every rough edge becomes a real complaint. That's the human reason most of what follows happens.",[14,1802,1804],{"id":1803},"threads","Threads",[10,1806,1807,1808,1810,1811,1814],{},"The biggest single change of the chapter lands on April 20: threads. Anton learns to do several things at once. Until now a run is a run: one conversation, one in-flight loop, and anything else has to wait. That works for a personal assistant talking to one person at a time. It does not work for a household where media triage is happening in the background, the syndic agent is reconciling invoices, and a school message comes in from a different group all at the same time. The plan is a thread registry primitive backed by Redis, channel and group and parent and thread IDs threaded through every agent context, a ",[39,1809,1114],{}," that registers itself and drains injections and child events and honors cancel, and an ingress layer that knows whether an incoming user message belongs to an active thread or starts a new run. On top of that sits a ",[39,1812,1813],{},"spawn_thread"," runtime tool so the agent itself can fan out, a live SSE event stream so the UI can show what each running thread is doing, and the hardening that any concurrency primitive needs to actually be safe: atomic inject-if-running to close the injection-loss race, finalize-drain, cascade cancel so killing a parent kills its children with no orphans, fan-out budget and TTL caps. By the end of the day one Anton can run multiple long-lived threads in parallel and a user can @mention into any of them without restarting anything.",[14,1816,1818],{"id":1817},"skills-cleanup","Skills cleanup",[10,1820,1821,1822,1825,1826,1829,1830,1833,1834,1837,1838,1841,1842,1845],{},"Then a cleanup that's been waiting since chapter 6. Five commits over an afternoon delete ",[39,1823,1824],{},"@anton\u002Fskills"," entirely, the original skills package from chapter 1 that became a barrel after chapter 4's ",[39,1827,1828],{},"defineSkill"," rewrite and dead weight after the move to Deno. The new layout is three rules: skills live at ",[39,1831,1832],{},"skills\u002F\u003Cdomain>\u002F\u003Cname>.skill.ts"," as first-class hot-reloadable units, ",[39,1835,1836],{},"skills\u002F\u003Cdomain>\u002F_lib\u002F"," holds small stable helpers shared inside one domain, and a thin ",[39,1839,1840],{},"skills-shared"," facade exposes the narrow Node-side surface that worker and agent and transports need. The principle underneath is simple: libraries are boring and fixed, skills evolve. If a ",[39,1843,1844],{},"_lib"," helper needs editing to support a feature, that's the signal it wants to be a skill. The same instinct produces the storage decision tree on the same day: a documented rule for choosing between facts (free-form), collections (typed-shape), files (blobs), and the family vault. Last month's organic growth produced overlap between all four, and a decision tree is cheaper than a refactor.",[10,1847,1848],{},"Around the same window, the coder agent gets a three-tier write scope. Tier 1: prompts only. Tier 2: skills plus prompts. Tier 3: any code. The tier is set per invocation and the coder cannot escalate itself. That's what makes the self-improvement loop safe enough to leave running unattended: the loop fixing a prompt regression has no permission to rewrite agent infrastructure to do it.",[14,1850,1852],{"id":1851},"spawn-and-awakening","Spawn and awakening",[10,1854,1855],{},"Spawn and awakening land on the 22nd, and they finish the federation story that started with the replication engine in chapter 5 and the mesh in chapter 6. Spawn is parent-side: it provisions infrastructure for a new clone, copies prompts over, seeds an identity, and registers the clone in the mesh. Awakening is clone-side: the new instance learns what it's for from its operator through a guided onboarding conversation, runs self-diagnostics, and keeps a mentor channel open back to the parent for questions. A clone isn't a docker stack any more. It's an Anton that wakes up, finds out who it is, and joins its peer.",[14,1857,1859],{"id":1858},"the-cast","The cast",[10,1861,1862],{},"The same day flips Gustav (local Gemma) to primary inference for every agent. It's a one-line change because of the work in chapter 4 that made prompts a single source of truth. The savings are real. Quality regressions are caught by the self-improvement loop and resolved through the cast: intent-gated escalation goes in, so the agent only reaches for a strong model when the intent of the request needs it (research, complex reasoning), and the cast formalizes models as characters. Each model is a named specialist with a prompt-defined personality and area of strength. The agent picks who to ask the way a person picks who on their team to email. The LiteLLM codenames (sunny, gizmo, gandalf, gustav, william) have been hinting at this since chapter 4. The cast makes it explicit: ask specialists by name, and William gets smarter.",[14,1864,1866],{"id":1865},"a-usable-heartbeat","A usable heartbeat",[10,1868,1869],{},"The heartbeat from chapter 7 gets the operational pass that makes it usable continuously. Idempotent memory writes plus a topics collection so re-running the same observation doesn't duplicate facts. Thread-aware so it doesn't interrupt active conversations. A loop that remembers what it just notified about and stays quiet rather than re-touching the same topic every tick. The outbound gateway from chapter 7 makes silencing silent bookkeeping events a one-place fix. Hallucinated-notification retry actually sends now instead of just removing the claim, and the simplified-response layer catches LaTeX and other artifacts before they reach Clara. The heartbeat ends the period as a usable proactive layer: observing, deciding when to speak, staying silent the rest of the time.",[10,1871,1872],{},"The SimplySyndic writes finally land too: a Playwright script that closes the loop on the syndic domain. Reading and reconciling has been working since chapter 7. Writing closes the loop. The next step is replacing Playwright with a deterministic HTTP path now that the read side has shown what the API looks like, and that spec is on the list, not done. Vaultwarden gets a documentation purge in the same window, the rule being that docs describe the current state only, no historical mentions of what an earlier version did.",[10,1874,1875],{},"Where Anton is at the end of all this: one identity, ten domain agents, a cast of named specialists, thread-aware concurrency. Every prompt in the database, so flipping the default model is one row edit. Local Gemma 4 NVFP4 as primary inference, cloud as fallback. Encrypted DB secrets and scoped Deno Worker permissions per skill. Mesh, replication, spawn, awakening for clones. A heartbeat that observes operational state and mostly stays quiet. A self-improvement loop with deploy tracking and regression detection. Two transports (WhatsApp, Telegram), both routing through one worker and one outbound gateway. A hundred-plus quality tests, full execution traces in Postgres, an LCARS dashboard for everything.",[10,1877,1878],{},"The plumbing that took six chapters to build is what most people would call boring infrastructure. That's fine. The substrate is built. The interesting behavior happens on top of it now: the cast, the heartbeat, the self-improvement loop, the long-running threads talking to each other and to us.",[10,1880,1881],{},"There is open work on the list. Doctolib syncing to calendar as a scheduled job. Auto-importing WhatsApp group members as users and mapping their JIDs to roles. Permission flows for new users at scale. The SimplySyndic write path migrating from Playwright to deterministic HTTP. Making Gemma 4 the assumed default everywhere it isn't yet. Open questions for the next stretch, not promises.",{"title":89,"searchDepth":90,"depth":90,"links":1883},[1884,1885,1886,1887,1888],{"id":1803,"depth":90,"text":1804},{"id":1817,"depth":90,"text":1818},{"id":1851,"depth":90,"text":1852},{"id":1858,"depth":90,"text":1859},{"id":1865,"depth":90,"text":1866},"2026-04-19",{},"\u002Fwriting\u002F2026-04-19-anton-09-threads",{"title":1795,"description":1800},"writing\u002F2026-04-19-anton-09-threads","Concurrency, awakening clones, and a cast of named model specialists working as a team.",[108,109,110],"9NLLnlfyMQjr9t8MFAeFIIOu0wHDf2jS9c3L8X_9eEU",{"id":1898,"title":1899,"body":1900,"canonical_url":97,"date":3263,"description":3264,"extension":99,"meta":3265,"navigation":101,"path":3266,"seo":3267,"series":921,"stem":3268,"summary":3269,"tags":3270,"work_slug":97,"__hash__":3272},"writing\u002Fwriting\u002F2026-04-22-dgx-spark-gemma4.md","Gemma 4 NVFP4 on the DGX Spark: 271 tok\u002Fs at 8 concurrent, native tool calling and reasoning",{"type":7,"value":1901,"toc":3237},[1902,1907,1912,1941,1943,1947,1953,1963,1966,1969,1981,1985,1988,2032,2042,2048,2052,2059,2073,2080,2084,2089,2098,2115,2118,2122,2135,2143,2146,2150,2157,2161,2168,2172,2190,2194,2429,2432,2467,2470,2474,2477,2813,2816,2823,2827,2830,2834,2874,2877,2881,2884,2947,2950,2954,2957,2988,2994,2998,3024,3027,3031,3038,3042,3045,3104,3107,3113,3117,3120,3146,3153,3157,3186,3188,3215,3217,3234],[10,1903,1904,1906],{},[304,1905,306],{}," We set up Google's Gemma 4 26B-A4B NVFP4 on the XRPL Commons office DGX Spark, with native tool calling, reasoning mode, 131K context, and multimodal input preserved on 128GB of unified memory. We then tried to push throughput further and hit the hardware's real limits. Here's the full story.",[10,1908,1909],{},[304,1910,1911],{},"Headline numbers:",[325,1913,1914,1920,1926,1932,1938],{},[328,1915,1916,1919],{},[304,1917,1918],{},"271 tok\u002Fs"," aggregate at 8 concurrent requests",[328,1921,1922,1925],{},[304,1923,1924],{},"51 tok\u002Fs"," single-stream sustained",[328,1927,1928,1931],{},[304,1929,1930],{},"54 ms"," time-to-first-token (short prompts)",[328,1933,1934,1937],{},[304,1935,1936],{},"8.2x"," prefix cache speedup on warm caches",[328,1939,1940],{},"131K context, native tool calling, reasoning mode",[309,1942],{},[14,1944,1946],{"id":1945},"background","Background",[10,1948,1949,1952],{},[464,1950,1951],{"href":919},"In March",", we documented how we got 51-54 tokens\u002Fsec out of the NVIDIA DGX Spark by combining a Mixture-of-Experts model (Qwen3-30B-A3B), FP8 quantization, and the Avarok community Docker image to work around Blackwell SM 12.1 support gaps.",[10,1954,1955,1956,46,1959,1962],{},"That stack was stable and served our general-purpose inference well. But when we installed a DGX Spark at the XRPL Commons office, we wanted more: ",[304,1957,1958],{},"native tool calling",[304,1960,1961],{},"reasoning mode",", features that turn an LLM from a text generator into an agent.",[10,1964,1965],{},"Qwen3 can do tool calling with the right prompting, but we wanted to evaluate Gemma 4, which Google released in April with first-class function calling and a configurable thinking mode built for agentic workflows.",[10,1967,1968],{},"This post covers:",[1970,1971,1972,1975,1978],"ol",{},[328,1973,1974],{},"Why we picked Gemma 4 for the office Spark",[328,1976,1977],{},"What had to change in the deployment",[328,1979,1980],{},"The throughput we actually measured",[14,1982,1984],{"id":1983},"why-gemma-4","Why Gemma 4",[10,1986,1987],{},"Google released Gemma 4 in early April 2026 with an MoE variant that fits the DGX Spark sweet spot:",[325,1989,1990,1996,2002,2008,2014,2020,2026],{},[328,1991,1992,1995],{},[304,1993,1994],{},"gemma-4-26B-A4B",", 26B total parameters, 4B active per token",[328,1997,1998,2001],{},[304,1999,2000],{},"256K context"," (we run with 131K to fit KV cache)",[328,2003,2004,2007],{},[304,2005,2006],{},"Native function calling"," with a dedicated tool-call parser",[328,2009,2010,2013],{},[304,2011,2012],{},"Configurable thinking mode"," (reasoning parser), the model can emit a private thought process before responding",[328,2015,2016,2019],{},[304,2017,2018],{},"Multimodal",", text + image input (plus audio on smaller variants)",[328,2021,2022,2025],{},[304,2023,2024],{},"140+ languages"," supported",[328,2027,2028,2031],{},[304,2029,2030],{},"Apache 2.0 license",", no strings attached",[10,2033,2034,2035,2038,2039,2041],{},"For our use case, an agent calling local XRPL tools, drafting documents, coordinating with team members, native tool calling is the big unlock. You can pass an OpenAI-style ",[39,2036,2037],{},"tools"," array to the chat completions API and the model emits structured ",[39,2040,1633],{}," back. No prompting gymnastics, no custom parsing.",[10,2043,2044,2045,2047],{},"The MoE architecture keeps us on the right side of the Spark's 273 GB\u002Fs memory bandwidth wall. 4B active parameters is slightly more than Qwen3's 3B, so we expected slightly lower single-stream throughput. In practice we measured ",[304,2046,1924],{}," at steady state, essentially the theoretical ceiling (273 GB\u002Fs ÷ ~4B active weights @ FP4 ≈ 50 tok\u002Fs). The trade for slightly fewer tokens per second: native tool calling, reasoning mode, 131K context, and multimodal input.",[14,2049,2051],{"id":2050},"the-model-choice-nvfp4","The Model Choice: NVFP4",[10,2053,2054,2055,2058],{},"We picked the ",[304,2056,2057],{},"bg-digitalservices\u002FGemma-4-26B-A4B-it-NVFP4"," quantization. A few reasons:",[325,2060,2061,2064,2067,2070],{},[328,2062,2063],{},"NVFP4 (NVIDIA's FP4 format) is designed specifically for Blackwell's fifth-gen Tensor Cores",[328,2065,2066],{},"The GB10 supports NVFP4 natively in hardware",[328,2068,2069],{},"Quality degradation vs bf16 is reportedly minimal for instruction-tuned models",[328,2071,2072],{},"It's the smallest on-disk footprint, giving us more KV cache headroom",[10,2074,2075,2076,2079],{},"For the inference engine, we moved off the Avarok image for this model. Avarok is excellent for Qwen-style models, but Gemma 4 support is still landing upstream. We switched to the official vLLM image ",[39,2077,2078],{},"vllm\u002Fvllm-openai:gemma4-cu130",", a Gemma-4-specific build from the vLLM team with CUDA 13.0.",[14,2081,2083],{"id":2082},"what-broke-and-how-we-fixed-it","What Broke (and How We Fixed It)",[2085,2086,2088],"h3",{"id":2087},"the-gemma4py-bug","The gemma4.py Bug",[10,2090,2091,2092,2097],{},"The stock vLLM Gemma 4 model executor crashed on load with our NVFP4 checkpoint. We had to mount a ",[304,2093,2094,2095],{},"patched ",[39,2096,1658],{}," into the container to replace the built-in one:",[614,2099,2101],{"className":616,"code":2100,"language":618,"meta":89,"style":89},"-v \u002Fhome\u002F$USER\u002Fvllm\u002Fgemma4_patched.py:\u002Fusr\u002Flocal\u002Flib\u002Fpython3.12\u002Fdist-packages\u002Fvllm\u002Fmodel_executor\u002Fmodels\u002Fgemma4.py:ro\n",[39,2102,2103],{"__ignoreMap":89},[622,2104,2105,2108,2110,2112],{"class":624,"line":625},[622,2106,2107],{"class":628},"-v",[622,2109,717],{"class":632},[622,2111,720],{"class":655},[622,2113,2114],{"class":632},"\u002Fvllm\u002Fgemma4_patched.py:\u002Fusr\u002Flocal\u002Flib\u002Fpython3.12\u002Fdist-packages\u002Fvllm\u002Fmodel_executor\u002Fmodels\u002Fgemma4.py:ro\n",[10,2116,2117],{},"The patch is ~50KB of Python, mostly adjustments to how NVFP4 weights are loaded into the fused MoE layers.",[2085,2119,2121],{"id":2120},"nvfp4-moe-backend-selection","NVFP4 MoE Backend Selection",[10,2123,2124,2125,2130,2131,2134],{},"The default MoE backend for NVFP4 on GB10 can trigger ",[464,2126,2129],{"href":2127,"rel":2128},"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fissues\u002F39000",[468],"a known vLLM crash",". Explicitly selecting the ",[304,2132,2133],{},"Marlin"," backend avoids it:",[614,2136,2141],{"className":2137,"code":2139,"language":2140},[2138],"language-text","--moe-backend marlin\n","text",[39,2142,2139],{"__ignoreMap":89},[10,2144,2145],{},"Marlin is a CUTLASS-based INT4\u002FFP4 GEMM kernel that's mature and fast on Blackwell.",[2085,2147,2149],{"id":2148},"heterogeneous-head-dimensions","Heterogeneous Head Dimensions",[10,2151,2152,2153,2156],{},"Gemma 4 uses different attention head dimensions for local vs global attention (256 vs 512). vLLM automatically detects this and forces the ",[304,2154,2155],{},"TRITON_ATTN"," backend to avoid mixed-backend numerical divergence. Nothing for us to configure, it just works, but worth understanding why Flash Attention isn't in play here.",[2085,2158,2160],{"id":2159},"kv-cache-memory","KV Cache Memory",[10,2162,2163,2164,2167],{},"We enable ",[39,2165,2166],{},"--kv-cache-dtype fp8"," to halve KV cache memory. With 131K context and 16 concurrent sequences, this makes the difference between fitting in 128GB and OOMing. Quality impact is negligible for our workloads.",[2085,2169,2171],{"id":2170},"docker-runtime-gotcha","Docker Runtime Gotcha",[10,2173,2174,2175,2178,2179,2182,2183,2186,2187,602],{},"On the office Spark, the ",[39,2176,2177],{},"nvidia"," Docker runtime wasn't registered (only the container toolkit was installed). The fix was to use ",[39,2180,2181],{},"--gpus all"," instead of ",[39,2184,2185],{},"--runtime nvidia",", the older and more portable flag. Worth knowing if you see ",[39,2188,2189],{},"unknown or invalid runtime name: nvidia",[14,2191,2193],{"id":2192},"the-full-launch-command","The Full Launch Command",[614,2195,2197],{"className":616,"code":2196,"language":618,"meta":89,"style":89},"docker run -d \\\n  --name vllm-avarok \\\n  --gpus all \\\n  --shm-size=16g \\\n  --restart unless-stopped \\\n  -p 8000:8888 \\\n  -v \u002Fhome\u002F$USER\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n  -v \u002Fhome\u002F$USER\u002Fvllm\u002Fgemma4_patched.py:\u002Fusr\u002Flocal\u002Flib\u002Fpython3.12\u002Fdist-packages\u002Fvllm\u002Fmodel_executor\u002Fmodels\u002Fgemma4.py:ro \\\n  vllm\u002Fvllm-openai:gemma4-cu130 \\\n  --model bg-digitalservices\u002FGemma-4-26B-A4B-it-NVFP4 \\\n  --served-model-name google\u002Fgemma-4-26B-A4B-it \\\n  --host 0.0.0.0 \\\n  --port 8888 \\\n  --quantization modelopt \\\n  --moe-backend marlin \\\n  --kv-cache-dtype fp8 \\\n  --enable-prefix-caching \\\n  --enable-chunked-prefill \\\n  --max-model-len 131072 \\\n  --gpu-memory-utilization 0.85 \\\n  --max-num-seqs 16 \\\n  --enable-auto-tool-choice \\\n  --tool-call-parser gemma4 \\\n  --reasoning-parser gemma4\n",[39,2198,2199,2209,2218,2226,2232,2240,2248,2260,2273,2280,2290,2300,2310,2320,2330,2341,2352,2360,2368,2379,2390,2401,2409,2420],{"__ignoreMap":89},[622,2200,2201,2203,2205,2207],{"class":624,"line":625},[622,2202,629],{"class":628},[622,2204,649],{"class":632},[622,2206,652],{"class":632},[622,2208,656],{"class":655},[622,2210,2211,2213,2216],{"class":624,"line":90},[622,2212,662],{"class":632},[622,2214,2215],{"class":632}," vllm-avarok",[622,2217,656],{"class":655},[622,2219,2220,2222,2224],{"class":624,"line":644},[622,2221,673],{"class":632},[622,2223,676],{"class":632},[622,2225,656],{"class":655},[622,2227,2228,2230],{"class":624,"line":659},[622,2229,684],{"class":632},[622,2231,656],{"class":655},[622,2233,2234,2236,2238],{"class":624,"line":670},[622,2235,692],{"class":632},[622,2237,695],{"class":632},[622,2239,656],{"class":655},[622,2241,2242,2244,2246],{"class":624,"line":681},[622,2243,703],{"class":632},[622,2245,706],{"class":632},[622,2247,656],{"class":655},[622,2249,2250,2252,2254,2256,2258],{"class":624,"line":689},[622,2251,714],{"class":632},[622,2253,717],{"class":632},[622,2255,720],{"class":655},[622,2257,723],{"class":632},[622,2259,656],{"class":655},[622,2261,2262,2264,2266,2268,2271],{"class":624,"line":700},[622,2263,714],{"class":632},[622,2265,717],{"class":632},[622,2267,720],{"class":655},[622,2269,2270],{"class":632},"\u002Fvllm\u002Fgemma4_patched.py:\u002Fusr\u002Flocal\u002Flib\u002Fpython3.12\u002Fdist-packages\u002Fvllm\u002Fmodel_executor\u002Fmodels\u002Fgemma4.py:ro",[622,2272,656],{"class":655},[622,2274,2275,2278],{"class":624,"line":711},[622,2276,2277],{"class":632},"  vllm\u002Fvllm-openai:gemma4-cu130",[622,2279,656],{"class":655},[622,2281,2282,2285,2288],{"class":624,"line":728},[622,2283,2284],{"class":632},"  --model",[622,2286,2287],{"class":632}," bg-digitalservices\u002FGemma-4-26B-A4B-it-NVFP4",[622,2289,656],{"class":655},[622,2291,2292,2295,2298],{"class":624,"line":739},[622,2293,2294],{"class":632},"  --served-model-name",[622,2296,2297],{"class":632}," google\u002Fgemma-4-26B-A4B-it",[622,2299,656],{"class":655},[622,2301,2302,2305,2308],{"class":624,"line":753},[622,2303,2304],{"class":632},"  --host",[622,2306,2307],{"class":747}," 0.0.0.0",[622,2309,656],{"class":655},[622,2311,2312,2315,2318],{"class":624,"line":766},[622,2313,2314],{"class":632},"  --port",[622,2316,2317],{"class":747}," 8888",[622,2319,656],{"class":655},[622,2321,2322,2325,2328],{"class":624,"line":779},[622,2323,2324],{"class":632},"  --quantization",[622,2326,2327],{"class":632}," modelopt",[622,2329,656],{"class":655},[622,2331,2333,2336,2339],{"class":624,"line":2332},15,[622,2334,2335],{"class":632},"  --moe-backend",[622,2337,2338],{"class":632}," marlin",[622,2340,656],{"class":655},[622,2342,2344,2347,2350],{"class":624,"line":2343},16,[622,2345,2346],{"class":632},"  --kv-cache-dtype",[622,2348,2349],{"class":632}," fp8",[622,2351,656],{"class":655},[622,2353,2355,2358],{"class":624,"line":2354},17,[622,2356,2357],{"class":632},"  --enable-prefix-caching",[622,2359,656],{"class":655},[622,2361,2363,2366],{"class":624,"line":2362},18,[622,2364,2365],{"class":632},"  --enable-chunked-prefill",[622,2367,656],{"class":655},[622,2369,2371,2374,2377],{"class":624,"line":2370},19,[622,2372,2373],{"class":632},"  --max-model-len",[622,2375,2376],{"class":747}," 131072",[622,2378,656],{"class":655},[622,2380,2382,2385,2388],{"class":624,"line":2381},20,[622,2383,2384],{"class":632},"  --gpu-memory-utilization",[622,2386,2387],{"class":747}," 0.85",[622,2389,656],{"class":655},[622,2391,2393,2396,2399],{"class":624,"line":2392},21,[622,2394,2395],{"class":632},"  --max-num-seqs",[622,2397,2398],{"class":747}," 16",[622,2400,656],{"class":655},[622,2402,2404,2407],{"class":624,"line":2403},22,[622,2405,2406],{"class":632},"  --enable-auto-tool-choice",[622,2408,656],{"class":655},[622,2410,2412,2415,2418],{"class":624,"line":2411},23,[622,2413,2414],{"class":632},"  --tool-call-parser",[622,2416,2417],{"class":632}," gemma4",[622,2419,656],{"class":655},[622,2421,2423,2426],{"class":624,"line":2422},24,[622,2424,2425],{"class":632},"  --reasoning-parser",[622,2427,2428],{"class":632}," gemma4\n",[10,2430,2431],{},"A few flags to highlight:",[325,2433,2434,2440,2448,2453,2462],{},[328,2435,2436,2439],{},[39,2437,2438],{},"--served-model-name google\u002Fgemma-4-26B-A4B-it",", clients use this name in the API, regardless of the underlying quantized checkpoint. Makes it easy to swap quantization levels later.",[328,2441,2442,2444,2445,2447],{},[39,2443,1607],{}," + ",[39,2446,1626],{},", native function calling, OpenAI-compatible API.",[328,2449,2450,2452],{},[39,2451,1629],{},", enables the thinking mode output channel.",[328,2454,2455,2444,2458,2461],{},[39,2456,2457],{},"--enable-prefix-caching",[39,2459,2460],{},"--enable-chunked-prefill",", standard vLLM performance wins.",[328,2463,2464,2466],{},[39,2465,1691],{},", we tune this down from the default 128 because our workloads are small-team, not high-concurrency serving. Frees up memory for larger context windows.",[10,2468,2469],{},"First boot takes 10-20 minutes (model download + load + CUDA graph capture). Subsequent restarts take ~3 minutes.",[14,2471,2473],{"id":2472},"what-tool-calling-unlocks","What Tool Calling Unlocks",[10,2475,2476],{},"With Gemma 4 running, we can now pass tool schemas directly to the API:",[614,2478,2482],{"className":2479,"code":2480,"language":2481,"meta":89,"style":89},"language-python shiki shiki-themes min-light","from openai import OpenAI\n\nclient = OpenAI(base_url=\"http:\u002F\u002Fspark:8000\u002Fv1\", api_key=\"unused\")\n\ntools = [{\n    \"type\": \"function\",\n    \"function\": {\n        \"name\": \"get_xrpl_account_balance\",\n        \"description\": \"Fetch the XRP balance for an XRPL account\",\n        \"parameters\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"address\": {\"type\": \"string\", \"description\": \"Classic XRPL address starting with r\"}\n            },\n            \"required\": [\"address\"]\n        }\n    }\n}]\n\nresponse = client.chat.completions.create(\n    model=\"google\u002Fgemma-4-26B-A4B-it\",\n    messages=[{\"role\": \"user\", \"content\": \"How much XRP does rMCU4... hold?\"}],\n    tools=tools,\n    tool_choice=\"auto\",\n)\n\n# response.choices[0].message.tool_calls -> [ChatCompletionMessageToolCall(...)]\n","python",[39,2483,2484,2499,2503,2535,2539,2552,2566,2576,2588,2600,2609,2621,2630,2662,2667,2682,2687,2692,2699,2703,2731,2743,2774,2784,2796,2801,2806],{"__ignoreMap":89},[622,2485,2486,2490,2493,2496],{"class":624,"line":625},[622,2487,2489],{"class":2488},"s-F7R","from",[622,2491,2492],{"class":655}," openai ",[622,2494,2495],{"class":2488},"import",[622,2497,2498],{"class":655}," OpenAI\n",[622,2500,2501],{"class":624,"line":90},[622,2502,641],{"emptyLinePlaceholder":101},[622,2504,2505,2508,2511,2514,2518,2520,2524,2527,2529,2532],{"class":624,"line":644},[622,2506,2507],{"class":655},"client ",[622,2509,2510],{"class":2488},"=",[622,2512,2513],{"class":628}," OpenAI",[622,2515,2517],{"class":2516},"siqTm","(base_url",[622,2519,2510],{"class":2488},[622,2521,2523],{"class":2522},"shJU0","\"http:\u002F\u002Fspark:8000\u002Fv1\"",[622,2525,2526],{"class":2516},", api_key",[622,2528,2510],{"class":2488},[622,2530,2531],{"class":2522},"\"unused\"",[622,2533,2534],{"class":2516},")\n",[622,2536,2537],{"class":624,"line":659},[622,2538,641],{"emptyLinePlaceholder":101},[622,2540,2541,2544,2546,2549],{"class":624,"line":670},[622,2542,2543],{"class":655},"tools ",[622,2545,2510],{"class":2488},[622,2547,2548],{"class":655}," [",[622,2550,2551],{"class":2516},"{\n",[622,2553,2554,2557,2560,2563],{"class":624,"line":681},[622,2555,2556],{"class":2522},"    \"type\"",[622,2558,2559],{"class":2516},":",[622,2561,2562],{"class":2522}," \"function\"",[622,2564,2565],{"class":2516},",\n",[622,2567,2568,2571,2573],{"class":624,"line":689},[622,2569,2570],{"class":2522},"    \"function\"",[622,2572,2559],{"class":2516},[622,2574,2575],{"class":2516}," {\n",[622,2577,2578,2581,2583,2586],{"class":624,"line":700},[622,2579,2580],{"class":2522},"        \"name\"",[622,2582,2559],{"class":2516},[622,2584,2585],{"class":2522}," \"get_xrpl_account_balance\"",[622,2587,2565],{"class":2516},[622,2589,2590,2593,2595,2598],{"class":624,"line":711},[622,2591,2592],{"class":2522},"        \"description\"",[622,2594,2559],{"class":2516},[622,2596,2597],{"class":2522}," \"Fetch the XRP balance for an XRPL account\"",[622,2599,2565],{"class":2516},[622,2601,2602,2605,2607],{"class":624,"line":728},[622,2603,2604],{"class":2522},"        \"parameters\"",[622,2606,2559],{"class":2516},[622,2608,2575],{"class":2516},[622,2610,2611,2614,2616,2619],{"class":624,"line":739},[622,2612,2613],{"class":2522},"            \"type\"",[622,2615,2559],{"class":2516},[622,2617,2618],{"class":2522}," \"object\"",[622,2620,2565],{"class":2516},[622,2622,2623,2626,2628],{"class":624,"line":753},[622,2624,2625],{"class":2522},"            \"properties\"",[622,2627,2559],{"class":2516},[622,2629,2575],{"class":2516},[622,2631,2632,2635,2637,2640,2643,2645,2648,2651,2654,2656,2659],{"class":624,"line":766},[622,2633,2634],{"class":2522},"                \"address\"",[622,2636,2559],{"class":2516},[622,2638,2639],{"class":2516}," {",[622,2641,2642],{"class":2522},"\"type\"",[622,2644,2559],{"class":2516},[622,2646,2647],{"class":2522}," \"string\"",[622,2649,2650],{"class":2516},",",[622,2652,2653],{"class":2522}," \"description\"",[622,2655,2559],{"class":2516},[622,2657,2658],{"class":2522}," \"Classic XRPL address starting with r\"",[622,2660,2661],{"class":2516},"}\n",[622,2663,2664],{"class":624,"line":779},[622,2665,2666],{"class":2516},"            },\n",[622,2668,2669,2672,2674,2676,2679],{"class":624,"line":2332},[622,2670,2671],{"class":2522},"            \"required\"",[622,2673,2559],{"class":2516},[622,2675,2548],{"class":655},[622,2677,2678],{"class":2522},"\"address\"",[622,2680,2681],{"class":655},"]\n",[622,2683,2684],{"class":624,"line":2343},[622,2685,2686],{"class":2516},"        }\n",[622,2688,2689],{"class":624,"line":2354},[622,2690,2691],{"class":2516},"    }\n",[622,2693,2694,2697],{"class":624,"line":2362},[622,2695,2696],{"class":2516},"}",[622,2698,2681],{"class":655},[622,2700,2701],{"class":624,"line":2370},[622,2702,641],{"emptyLinePlaceholder":101},[622,2704,2705,2708,2710,2713,2715,2718,2720,2723,2725,2728],{"class":624,"line":2381},[622,2706,2707],{"class":655},"response ",[622,2709,2510],{"class":2488},[622,2711,2712],{"class":655}," client",[622,2714,602],{"class":2516},[622,2716,2717],{"class":655},"chat",[622,2719,602],{"class":2516},[622,2721,2722],{"class":655},"completions",[622,2724,602],{"class":2516},[622,2726,2727],{"class":628},"create",[622,2729,2730],{"class":2516},"(\n",[622,2732,2733,2736,2738,2741],{"class":624,"line":2392},[622,2734,2735],{"class":2516},"    model",[622,2737,2510],{"class":2488},[622,2739,2740],{"class":2522},"\"google\u002Fgemma-4-26B-A4B-it\"",[622,2742,2565],{"class":2516},[622,2744,2745,2748,2750,2753,2756,2758,2761,2763,2766,2768,2771],{"class":624,"line":2403},[622,2746,2747],{"class":2516},"    messages",[622,2749,2510],{"class":2488},[622,2751,2752],{"class":2516},"[{",[622,2754,2755],{"class":2522},"\"role\"",[622,2757,1122],{"class":2516},[622,2759,2760],{"class":2522},"\"user\"",[622,2762,1126],{"class":2516},[622,2764,2765],{"class":2522},"\"content\"",[622,2767,1122],{"class":2516},[622,2769,2770],{"class":2522},"\"How much XRP does rMCU4... hold?\"",[622,2772,2773],{"class":2516},"}],\n",[622,2775,2776,2779,2781],{"class":624,"line":2411},[622,2777,2778],{"class":2516},"    tools",[622,2780,2510],{"class":2488},[622,2782,2783],{"class":2516},"tools,\n",[622,2785,2786,2789,2791,2794],{"class":624,"line":2422},[622,2787,2788],{"class":2516},"    tool_choice",[622,2790,2510],{"class":2488},[622,2792,2793],{"class":2522},"\"auto\"",[622,2795,2565],{"class":2516},[622,2797,2799],{"class":624,"line":2798},25,[622,2800,2534],{"class":2516},[622,2802,2804],{"class":624,"line":2803},26,[622,2805,641],{"emptyLinePlaceholder":101},[622,2807,2809],{"class":624,"line":2808},27,[622,2810,2812],{"class":2811},"s15Vz","# response.choices[0].message.tool_calls -> [ChatCompletionMessageToolCall(...)]\n",[10,2814,2815],{},"This is the foundation for agentic workflows that previously required prompt engineering or separate function-calling layers (like LiteLLM's function-call emulation). With Gemma 4, the model speaks tool calls natively.",[10,2817,2818,2819,2822],{},"The reasoning mode is similarly useful. For complex queries, the model emits a private thinking trace before its final answer, we can log it for debugging, show it to users for transparency, or strip it entirely. All via the ",[39,2820,2821],{},"--reasoning-parser"," flag.",[14,2824,2826],{"id":2825},"the-benchmarks","The Benchmarks",[10,2828,2829],{},"We ran a proper benchmark suite from the Spark itself (not over the network, to remove noise): throughput at 100\u002F500\u002F2000-token outputs, time-to-first-token, concurrency scaling, prefix-cache hit\u002Fmiss, long-context prefill, and tool-calling overhead.",[2085,2831,2833],{"id":2832},"single-stream-throughput","Single-stream throughput",[501,2835,2836,2846],{},[504,2837,2838],{},[507,2839,2840,2843],{},[510,2841,2842],{},"Generation length",[510,2844,2845],{},"tok\u002Fs",[523,2847,2848,2856,2866],{},[507,2849,2850,2853],{},[528,2851,2852],{},"100 tokens",[528,2854,2855],{},"46",[507,2857,2858,2861],{},[528,2859,2860],{},"500 tokens",[528,2862,2863],{},[304,2864,2865],{},"51",[507,2867,2868,2871],{},[528,2869,2870],{},"2000 tokens",[528,2872,2873],{},"50",[10,2875,2876],{},"The short-generation number is lower only because warmup dominates. Anything over ~200 tokens hits the steady-state ~50 tok\u002Fs. That matches the theoretical ceiling for 4B active parameters at FP4 on 273 GB\u002Fs bandwidth.",[2085,2878,2880],{"id":2879},"concurrency-the-big-win-for-agentic-workflows","Concurrency, the big win for agentic workflows",[10,2882,2883],{},"This is where the Spark shines. vLLM's continuous batching + Marlin NVFP4 MoE kernel keeps per-request throughput high even as you add parallel clients:",[501,2885,2886,2899],{},[504,2887,2888],{},[507,2889,2890,2893,2896],{},[510,2891,2892],{},"Parallel requests",[510,2894,2895],{},"Aggregate tok\u002Fs",[510,2897,2898],{},"Per-request tok\u002Fs",[523,2900,2901,2910,2921,2932],{},[507,2902,2903,2906,2908],{},[528,2904,2905],{},"1",[528,2907,2873],{},[528,2909,2873],{},[507,2911,2912,2915,2918],{},[528,2913,2914],{},"2",[528,2916,2917],{},"90",[528,2919,2920],{},"45",[507,2922,2923,2926,2929],{},[528,2924,2925],{},"4",[528,2927,2928],{},"161",[528,2930,2931],{},"40",[507,2933,2934,2939,2944],{},[528,2935,2936],{},[304,2937,2938],{},"8",[528,2940,2941],{},[304,2942,2943],{},"271",[528,2945,2946],{},"34",[10,2948,2949],{},"5.4× aggregate throughput at 8 concurrent clients. For a team of devs running agents in parallel, or a single agent making parallel tool-call decisions, this is huge.",[2085,2951,2953],{"id":2952},"prefix-caching-massive-for-agent-loops","Prefix caching, massive for agent loops",[10,2955,2956],{},"Agent workflows resubmit the same system prompt, same tool schema, same conversation history over and over. Measured on an 8K-token prefix:",[501,2958,2959,2968],{},[504,2960,2961],{},[507,2962,2963,2965],{},[510,2964],{},[510,2966,2967],{},"TTFT",[523,2969,2970,2978],{},[507,2971,2972,2975],{},[528,2973,2974],{},"Cold (cache miss)",[528,2976,2977],{},"834 ms",[507,2979,2980,2983],{},[528,2981,2982],{},"Warm (cache hit)",[528,2984,2985],{},[304,2986,2987],{},"102 ms",[10,2989,2990,2993],{},[304,2991,2992],{},"8.2× speedup."," If you're building anything where the same prefix repeats, ReAct loops, multi-turn conversations, a shared system prompt, prefix caching alone pays for running local inference.",[2085,2995,2997],{"id":2996},"ttft-and-long-context","TTFT and long context",[325,2999,3000,3006,3009,3012,3018],{},[328,3001,3002,3003,3005],{},"Short prompt TTFT: ",[304,3004,1930],{}," p50 (excellent)",[328,3007,3008],{},"4K prompt TTFT: 62 ms p50",[328,3010,3011],{},"8K prefill: 73 ms",[328,3013,3014,3015],{},"32K prefill: ",[304,3016,3017],{},"10.1 seconds",[328,3019,3020,3021],{},"100K prefill: ",[304,3022,3023],{},"78 seconds",[10,3025,3026],{},"Long-context prefill is the one real weakness. It's compute-bound on the MoE GEMM kernel, not bandwidth-bound. At 131K context you're paying 80+ seconds of prefill time every cold turn. If you need long-context frequently, design your agents to reuse prefixes so the cache does the heavy lifting.",[2085,3028,3030],{"id":3029},"tool-calling","Tool calling",[10,3032,3033,3034,3037],{},"Works out of the box with ",[39,3035,3036],{},"--enable-auto-tool-choice --tool-call-parser gemma4",". No meaningful overhead versus text-only completion, the model just emits shorter, structured output when it decides to call a tool.",[14,3039,3041],{"id":3040},"trying-to-push-throughput-further","Trying to Push Throughput Further",[10,3043,3044],{},"With the baseline measured, we tried four paths to push single-stream throughput higher. All four hit walls:",[501,3046,3047,3057],{},[504,3048,3049],{},[507,3050,3051,3054],{},[510,3052,3053],{},"Attempt",[510,3055,3056],{},"Result",[523,3058,3059,3070,3080,3096],{},[507,3060,3061,3067],{},[528,3062,3063,3064],{},"Swap MoE backend to ",[39,3065,3066],{},"flashinfer_trtllm",[528,3068,3069],{},"Kernel doesn't support SM 12.1, engine fails to start",[507,3071,3072,3077],{},[528,3073,3063,3074],{},[39,3075,3076],{},"flashinfer_cutlass",[528,3078,3079],{},"Doesn't support GELU activation (Gemma 4 uses GELU)",[507,3081,3082,3093],{},[528,3083,3084,3085,3088,3089,3092],{},"Bump ",[39,3086,3087],{},"--gpu-memory-utilization"," 0.85 → 0.90 + ",[39,3090,3091],{},"--max-num-seqs"," 16 → 32",[528,3094,3095],{},"Zero measurable change, we're not KV-cache-limited at this scale",[507,3097,3098,3101],{},[528,3099,3100],{},"Speculative decoding with Gemma-4-E4B draft",[528,3102,3103],{},"Blocked, vLLM's spec decoding doesn't support multimodal target models",[10,3105,3106],{},"The last one hurt. Spec decoding was the one real lever: a small 4B draft model predicting tokens for the 26B target can deliver 1.5-2× throughput on repetitive outputs. But vLLM explicitly rejects it for multimodal models right now, and Gemma 4 is multimodal. The two features we most wanted, speculative decoding and image input, are mutually exclusive in today's vLLM.",[10,3108,3109,3112],{},[304,3110,3111],{},"Conclusion: the baseline config is already at the hardware + software ceiling."," Generation throughput is memory-bandwidth-bound. Prefill is compute-bound on the only NVFP4 MoE kernel that works on GB10 + Gemma 4 (Marlin). There's no software knob we haven't turned.",[14,3114,3116],{"id":3115},"what-would-change-the-picture","What Would Change the Picture",[10,3118,3119],{},"Future improvements we'll re-test when they land:",[325,3121,3122,3128,3134,3140],{},[328,3123,3124,3127],{},[304,3125,3126],{},"vLLM adds multimodal spec decoding",", unlocks Gemma-4-E4B as draft, expected 1.5-2×",[328,3129,3130,3133],{},[304,3131,3132],{},"FlashInfer adds SM 12.1 + GELU support",", alternative MoE backend, could improve prefill",[328,3135,3136,3139],{},[304,3137,3138],{},"Community ships a clean Gemma 4 AWQ INT4 quant",", Qwen AWQ hit 82 tok\u002Fs on the same hardware",[328,3141,3142,3145],{},[304,3143,3144],{},"NVIDIA publishes an official NVFP4 Gemma 4 build",", likely has better-tuned kernels",[10,3147,3148,3149,3152],{},"Plan: re-run the benchmark suite (checked in at ",[39,3150,3151],{},"projects\u002Fdgx-spark-benchmarks\u002Fbench.py",") every couple of months and see if the landscape has shifted.",[14,3154,3156],{"id":3155},"whats-next-for-us","What's Next for Us",[325,3158,3159,3165,3180],{},[328,3160,3161,3164],{},[304,3162,3163],{},"Multimodal testing",", Gemma 4 accepts image inputs. Document parsing and UI automation are the obvious use cases.",[328,3166,3167,1126,3170,3173,3174,3179],{},[304,3168,3169],{},"Local speech-to-text",[39,3171,3172],{},"faster-whisper"," on the Grace CPU cores, leaving the GPU fully dedicated to Gemma 4. ",[464,3175,3178],{"href":3176,"rel":3177},"https:\u002F\u002Flearn.arm.com\u002Flearning-paths\u002Flaptops-and-desktops\u002Fdgx_spark_voicechatbot\u002F",[468],"Arm published a playbook"," for exactly this split on the Spark.",[328,3181,3182,3185],{},[304,3183,3184],{},"Agent workload benchmarks",", beyond synthetic throughput, measure actual tool-call accuracy and reasoning quality on our real XRPL workflows.",[14,3187,856],{"id":855},[325,3189,3190,3197,3203,3210],{},[328,3191,3192],{},[464,3193,3196],{"href":3194,"rel":3195},"https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fgemma4",[468],"Gemma 4 on HuggingFace",[328,3198,3199],{},[464,3200,2057],{"href":3201,"rel":3202},"https:\u002F\u002Fhuggingface.co\u002Fbg-digitalservices\u002FGemma-4-26B-A4B-it-NVFP4",[468],[328,3204,3205],{},[464,3206,3209],{"href":3207,"rel":3208},"https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Frecipes\u002Fen\u002Flatest\u002FGoogle\u002FGemma4.html",[468],"vLLM Gemma 4 recipe",[328,3211,3212],{},[464,3213,3214],{"href":919},"Our original DGX Spark setup post",[14,3216,889],{"id":888},[10,3218,3219,3220,3222,3223,3226,3227,3229,3230,3233],{},"If you have a DGX Spark and want to run Gemma 4: pull the ",[39,3221,2078],{}," image, get a working ",[39,3224,3225],{},"gemma4_patched.py",", and use the ",[39,3228,494],{}," command above. Allow ~20 minutes for the first boot. Reach out at ",[464,3231,897],{"href":895,"rel":3232},[468]," if you hit issues, we've documented most of the sharp edges.",[900,3235,3236],{},"html pre.shiki code .s7eDp, html code.shiki .s7eDp{--shiki-default:#6F42C1}html pre.shiki code .sY4mW, html code.shiki .sY4mW{--shiki-default:#2B5581}html pre.shiki code .sR6ew, html code.shiki .sR6ew{--shiki-default:#24292EFF}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .s9AOD, html code.shiki .s9AOD{--shiki-default:#1976D2}html pre.shiki code .s-F7R, html code.shiki .s-F7R{--shiki-default:#D32F2F}html pre.shiki code .siqTm, html code.shiki .siqTm{--shiki-default:#212121}html pre.shiki code .shJU0, html code.shiki .shJU0{--shiki-default:#22863A}html pre.shiki code .s15Vz, html code.shiki .s15Vz{--shiki-default:#C2C3C5}",{"title":89,"searchDepth":90,"depth":90,"links":3238},[3239,3240,3241,3242,3249,3250,3251,3258,3259,3260,3261,3262],{"id":1945,"depth":90,"text":1946},{"id":1983,"depth":90,"text":1984},{"id":2050,"depth":90,"text":2051},{"id":2082,"depth":90,"text":2083,"children":3243},[3244,3245,3246,3247,3248],{"id":2087,"depth":644,"text":2088},{"id":2120,"depth":644,"text":2121},{"id":2148,"depth":644,"text":2149},{"id":2159,"depth":644,"text":2160},{"id":2170,"depth":644,"text":2171},{"id":2192,"depth":90,"text":2193},{"id":2472,"depth":90,"text":2473},{"id":2825,"depth":90,"text":2826,"children":3252},[3253,3254,3255,3256,3257],{"id":2832,"depth":644,"text":2833},{"id":2879,"depth":644,"text":2880},{"id":2952,"depth":644,"text":2953},{"id":2996,"depth":644,"text":2997},{"id":3029,"depth":644,"text":3030},{"id":3040,"depth":90,"text":3041},{"id":3115,"depth":90,"text":3116},{"id":3155,"depth":90,"text":3156},{"id":855,"depth":90,"text":856},{"id":888,"depth":90,"text":889},"2026-04-22","TL;DR: We set up Google's Gemma 4 26B-A4B NVFP4 on the XRPL Commons office DGX Spark, with native tool calling, reasoning mode, 131K context, and multimodal input preserved on 128GB of unified memory. We then tried to push throughput further and hit the hardware's real limits. Here's the full story.",{},"\u002Fwriting\u002F2026-04-22-dgx-spark-gemma4",{"title":1899,"description":3264},"writing\u002F2026-04-22-dgx-spark-gemma4","Native tool calling and reasoning mode on Gemma 4 NVFP4 over 128GB of unified memory.",[925,111,926,3271],"gemma","eW_CgXb6lefyl0iqqb9zJuMzoQ1DKrdWm0MBjkvg9lA",{"id":3274,"title":3275,"body":3276,"demo":97,"description":3294,"extension":99,"featured":101,"image":3295,"meta":3296,"name":3275,"navigation":101,"path":3297,"seo":3298,"stem":3299,"summary":3300,"tags":3301,"url":3305,"when":3306,"where":3307,"__hash__":3308},"work\u002Fwork\u002Fanton.md","Anton",{"type":7,"value":3277,"toc":3292},[3278,3285],[10,3279,3280,3281,3284],{},"Anton is a personal agent OS, deployed on a DGX Spark on my home network. About ten specialized domain agents (home, media, coder, research, admin, knowledge, syndic, dev, quality) coordinate via typed delegates. Skills run in a sandboxed Deno runtime, hot-reloadable. Layered memory (context, history, facts), policy engine evaluating every action, thread-aware interruption with cancel cascade, nightly self-improvement loop driven by issues, evaluation infrastructure under ",[39,3282,3283],{},"packages\u002Fagent-quality",". Local LLM inference via vLLM behind a LiteLLM gateway over Tailscale.",[10,3286,3287,3288,3291],{},"Anton is a decade-later realization of ",[1566,3289,3290],{},"Bots for Humanity",", a personal-agent \u002F digital-twin concept I worked on as my MBA capstone in 2015 and presented to contacts at Facebook. Design philosophy: ingest cheap, process lazy, enrich on demand. Explore agentic, build deterministic.",{"title":89,"searchDepth":90,"depth":90,"links":3293},[],"Anton is a personal agent OS, deployed on a DGX Spark on my home network. About ten specialized domain agents (home, media, coder, research, admin, knowledge, syndic, dev, quality) coordinate via typed delegates. Skills run in a sandboxed Deno runtime, hot-reloadable. Layered memory (context, history, facts), policy engine evaluating every action, thread-aware interruption with cancel cascade, nightly self-improvement loop driven by issues, evaluation infrastructure under packages\u002Fagent-quality. Local LLM inference via vLLM behind a LiteLLM gateway over Tailscale.","\u002Fwork\u002Fanton.png",{},"\u002Fwork\u002Fanton",{"description":3294},"work\u002Fanton","Personal agent OS that runs my family in production. Custom TypeScript runtime, sandboxed Deno skills, layered memory, policy engine, nightly self-improvement loop.",[109,3302,3303,110,3304,111],"typescript","deno","self-hosted","https:\u002F\u002Fgithub.com\u002Flucbocahut\u002Fanton","2024-12-01","Paris","2f-67YAX0igzpgv35vGGYxy1_1oMWhP04me-AzX3Tmc",1780849288651]