Matthew Cheng

This is part 2 of my writing around building an autonomous MVP builder. See my initial thoughts about agentic product development for context.

Since writing that first doc on August 29th, I've been playing around with building an agentic dev team and a Linear clone to manage it all. One of the key things I've been indexing for is cost (and by extension, token efficiency). I've refused to spend more than $50 a month on this (and even then, $50 seems a bit steep to me). I'll be doing a separate writeup about that whole process, how my mentality has changed, and where I'm taking it moving forward.

As much as being heads down working on infrastructure for agent orchestration has been fun, the best way to test my work is to have my infrastructure go and actually build a real product - and that first iteration is llm-guessr.

What is llm-guessr?

llm-guessr is a word guessing game where a player attempts to guess the word an LLM is "thinking". The game was 100% coded, tested, deployed, and maintained (kind of) by a small squadron of agents running a mixture of Claude Code and Cursor Agents CLI.

The game itself is meant to showcase how LLMs generate the next token to fill in a sentence, and also to demonstrate how my DIY agentic orchestration tool can ideate, develop, and ship products independently without much human intervention.

The process

I started with a super basic idea. My partner had told me about Semantle, and I thought it would be fun to make a similar version of it but based on an LLM's probability distribution.

I spent about 2 minutes kind of just brain dumping into a .md file:

my initial doc

i want to build a simple word guess game, similar to semantle or contexto, where a player types words into a text box and searches. unlike the other two examples that i want to clone, I want to use this as a way to showcase the non determinism of an LLM - since an LLM generates words based on some sort of weighting? Instead of using word2vec to vectorize based on semantic meaning and calculating similarity score on that, i want the game to be about guessing what's in the LLM's "brain".

why:
i find it interesting how LLMs and generally how ML systems "think". I know that most people do not know how they "think", and people treat these models as if they are deterministic rather than the non deterministic black box that they actually are. I want this project to be a showcase of how humans can't deterministically predict what a model outputs, and how human brains will try really hard to come up with some pattern to rationalize non deterministic systems.
i think this is a project that i'd do out of personal interest and with a thesis around human behavior when it comes to non determinism.

Some limitations:
we can also do 1 word per day, or even week if it impacts cost. we want to minimize server side and client side computation. if we can bulk compute daily that's ok. We can start with fewer "games" at lower frequency, or even manually trigger a game. the main limitation is cost (compute, hosting, etc) and latency.

UI should be able to get refined and tuned later.

Do not write any code. Do not focus on UI. Your first step is to call out my assumptions and discuss how we can implement this. discuss what type of architecture is needed. how do we come up with an LLM style similarity score? Do we need to use word2vec? is there open source libraries that can help us there? How can we accomplish my goal

Starting the work

I fired this off on a rainy Monday morning on my subway ride to work. The first stop of this doc was to Claude Code in planning mode, where it took this information and gave me a few options on implementation. It helped me choose what models to use (it gave me a choice between GPT-2 and GPT-3.5, with a heavier weight towards GPT-2 for cost savings, since I could just run it locally for free!).

After barely glancing at its proposal, I approved whatever Claude Code said.

The agentic process

Creating tasks

I have a task-manager agent that just reads the plan from Claude Code and converts it into atomic tasks that can be assigned to an engineering agent (eng-agent) or a human. The task-manager agent created 23 total tasks, 16 of which were assigned to an eng-agent and 7 to myself. The tasks took a test-first approach, sequencing an agent to write tests first before then having another agent write the actual functionality.

The human tasks created were mostly around creating prompt ideas for the game, doing DNS config or SSL config, or testing.

Pulling tasks and executing

At this point in my infrastructure work, my eng-agents would automatically run every few hours and pull a task from the top, execute on a feature branch, deploy its work to a "staging environment", and then request review.

Most actual coding tasks, after being marked as ready for review, would trigger a code-review agent, which would look at the task itself, compare the work created by the eng-agent, and then give a summary for approval or failure. A few times, the code-review agent found that the eng-agent would do the hackiest "solution" to check off the task requirements. For example:

One of the tasks was to write API tests (before developing the actual backend). Part of the final requirements for this task was that all tests should fail, since the backend didn't exist yet. However, instead of writing actual tests that would fail due to the non-existence of any backend, the eng-agent wrote test stubs that were hardcoded to fail. 😂

The code-review agent was able to catch these types of errors before they hit my own desk.

When we moved into actual functionality development, I was a lot more involved in testing. Most of the time this looked like me loading the staging environment on my phone's mobile browser, clicking around and trying things out, finding issues, and then dumping all my comments into my Linear clone. The agent would see the failure reasons and comments, and then iterate again to improve its work.

The first deployment

By Thursday afternoon, I had a mostly working product that was hosted in a production environment. I played around with it and kept running into some frustrations around UX, the prompts, and the words it was generating, and ended up having to go back to plan mode to just dump out some feedback.

One of the biggest notes was to upgrade from GPT-2 to GPT-3.5: GPT-2 was outputting all sorts of weird words. (I had chosen GPT-2/3.5 specifically because they support logprobs, which let me get probability distributions for multiple words instead of just one word). One of the tasks it spun up for itself was to replace GPT-2 with API calls for GPT-3.5, which it ground out overnight as I slept.

I was meant to demo the early form of llm-guessr to a few friends at a mini startup/builder event my friend and I were running, but I ended up forgetting to trigger things on Friday evening...

Finishing things up and shipping products out

On Saturday morning, with about an hour left before my demo, I fired off an agent to handle deployment and game generation. I shut my laptop and got on the train downtown.

By the time I started to demo, the deployment had already completed. I didn't realize this until I was watching a friend play the game, and I was seeing a lot of the UX updates show up.

It turns out that the agent was running on my VPS even while my local device was shut, and it managed to just run without me thinking. A win for running things with full permissions on my VPS instance!

What's next

The game is in a good enough state where I'm not too concerned about improving it. I could have my agents add some basic analytics and monitor for some unpaid organic usage, but llm-guessr is pretty much done, and was done in a week from idea to production!

On the infrastructure side, I'm continuing to refine and automate a lot of the work here. There's a lot of stuff my agents are doing that should not be LLM tasks - these are deterministic, repetitive tasks (for example, handling staging deployments, managing worktrees, etc.), and a lot of the work required is for me to identify which portions should be bash scripts or cron jobs and which portions actually require an AI agent.

On the more business-focused side, my thesis has been getting refined a lot more about the economics of scale for software development, and how having significantly lower costs to build and ship can lead to the ability to do hyper-niche targeting of end audiences (see my initial thoughts about agentic product development). One of the next experiments I have my agents running is to go and build some basic, hyper-niche SaaS tools, for me to go play around with distribution and sales for. So far, at an early intercept point, I've seen it build a pretty robust SaaS tool within a week and a half for a pretty specifically segmented audience. I'll share about this some more as well later on.

An ask

You've made it this far - thank you. I'd love to get some feedback on how I'm doing things, how I can improve this work, and if there's anything else I should be trying out!

llm-guessr: Building a Product with Autonomous Agents in One Week