2025-08-19
It's been a while since I've attempted to use LLMs to solve Connections puzzles.
92 entries
It's been a while since I've attempted to use LLMs to solve Connections puzzles.
It started because I was using the OpenAI completion API to try several different models while building Tomo.
If you've read any of my writing in the past year, you're probably aware I've heavily adopted agents to build much of the software I write now. What I've done less of is write about the strategies I've used to do this.
Who is finding LLMs useful and who is not? And why is this the case?
Learned more about the post-training phase of fine-tuning LLMs and how the model initially goes through a pre-training phase. From there, it is fine-tuned to contribute to a token stream with a human user, using prompt tokens to demarcate whether a message was written by the user or the assistant.
Goose is a CLI language model-based agent. Goose exposes a chat interface and uses tool calling (mostly to invoke shell commands) to accomplish the objective prompted by the user. These tasks can include everything from writing code to running tests to converting a folder full of mov files to...
Today, I set out to add an llms.txt to this site. I've made a few similar additions in the past with raw post markdown files and a search index. Every time I try and change something with outputFormats in Hugo, I forget one of the steps, so in writing this up, finally I'll have it for next time.
Today, Anthropic entered the LLM code tools party with Claude Code.
I had an interesting realization today while doing a demo building a web app with Cursor. I was debugging an issue with an MCP server, trying to connect it to Cursor's MCP integration. The code I was using was buggy, and I'd never tried this before (attempting it live was probably a fool's errand...
I don't know anything about rice disease but apparently these are various rice diseases and this is what they look like. Jeremy Howard, Fast.ai Course Lesson 6 I have no idea if Jeremy had this in mind when he said this (alluding to the fact he doesn't know about the subject area, but when...
I'm working on a conversation branching tool called "Delta" (for now). The first thing that led me to this idea came from chatting with Llama 3.2 and experimenting with different system prompts. I was actually trying to build myself a local version of an app I've been fascinated by called Dot.
I explored how embeddings cluster by visualizing LLM-generated words across different categories. The visualizations helped build intuition about how these embeddings relate to each other in vector space. Most of the code was generated using Sonnet.
Language models are more than chatbots - they're tools for thought. The real value lies in using them as intellectual sounding boards to brainstorm, refine and challenge our ideas.
Having completed lesson 5 of the FastAI course, I prompted Claude to give me some good datasets upon which to train a random forest model. This housing dataset from Kaggle seemed like a nice option, so I decided to give it a try. I am also going to try something that Jeremy Howard recommended for...
In this notebook/post, we're going to be using the markdown content from my blog to try a language model. From this, we'll attempt to prompt the model to generate a post for a topic I might write about.
I recently found Joe's article, We All Know AI Can’t Code, Right?.
I wanted to get more hands-on with the language model trained in chapter 12 of the FastAI course, so I got some Google Colab credits and actually ran the training on an A100. It cost about $2.50 and took about 1:40, but generally worked quite well. There was a minor issue with auto-saving the...
I had the idea to try and use a language model as a random number generator. I didn't expect it to actually work as a uniform random number generator but was curious to see what the distribution of numbers would look like.
I've been prompting models to output JSON for about as long as I've been using models. Since text-davinci-003, getting valid JSON out of OpenAI's models didn't seem like that big of a challenge, but maybe I wasn't seeing the long tails of misbehavior because I hadn't massively scaled up a use...
I've been wanting to create a chat component for this site for a while, because I really don't like quoting conversations and manually formatting them each time. When using a model playground, usually there is a code snippet option that generates Python code you can copy out intro a script. Using...
Research and experimentation with models presents different problems than I am used to dealing with on a daily basis. The structure of what you want to try out changes often, so I understand why some folks prefer to use notebooks. Personally, notebooks haven't caught on for my so I'm still just...
I'm trying something a bit new, writing some of my thoughts about how the future might look based on patterns I've been observing lately.
Model-based aggregators
Sabrina wrote an interesting write up on solving a math problem with gpt-4o. It turned out the text-only, chain-of-thought approach was the best performing, which is not what I would have guessed.
Generative AI and language models are fun to play with but you don't really have something you can confidently ship to users until you test what you've built.
I read Jason, Ivan and Charles' blog post on Modal about fine tuning an embedding model. It's a bit in the weeds of ML for me but I learn a bit more every time I read something new.
The following prompt seems to be quite effective at leaking any pre-prompting done to a language model
For me, invoking a language model using a playground (UI) interface is the most common approach for my usage. Occasionally, it can be helpful to use the a CLI to directly pipe output into a model. For example
I enjoyed this article by Ken about production LLM use cases with OpenAI models. When it comes to prompts, less is more
Gemini Pro 1.5 up and running. I've said this before but I will say it again -- the fact that I don't need to deal with GCP to use Google models gives me joy.
I've been digging more into evals. I wrote a simple Claude completion function in openai/evals to better understand how the different pieces fit together. Quick and dirty code:
I can't believe I am saying this but if you play around with language models locally, a 1 TB drive, might not be big enough for very long.
One of the greatest misconceptions concerning LLMs is the idea that they are easy to use. They really aren’t: getting great results out of them requires a great deal of experience and hard-fought intuition, combined with deep domain knowledge of the problem you are applying them to.
Did a bit more work on a LLM evaluator for connections. I'm mostly trying it with gpt-4 and claude-3-opus. On today's puzzle, the best either did was 2/4 correct. I'm unsure how much more improvement is possible with prompting or even fine tuning, but it's an interesting challenge.
Setup a Temporal worker in Ruby and got familiar with its ergonomics.
I spent yesterday and today working through the excellent guide by Alex on using sqlite-vss to do vector similarity search in a SQLite database. I'm particularly interested in the benefits one can get from having these tools available locally for getting better insights into non-big datasets with a...
Ideal bookmarking from Swyx on Latent Space with Soumith Chintala. create synthetic data off of your retrieved documents and then fine tune on that
I've been meaning to try out Simon's llm package for a while now. From reading the docs and following the development, it's a modular, meet-you-where-you-are CLI for running LLM inference locally or using almost any API out there. In the past, I might have installed this with brew, but we run nix...
A very timely (for me) article by Hamel about understanding what a language model prompt abstraction library is doing before blindly adopting it. This really aligned with a lot of my own thoughts on the matter, right down to it's praise of Jason's instructor library baseline example.
OpenAI popularized a pattern of streaming results from a backend API in realtime with ChatGPT. This approach is useful because the time a language model takes to run inference is often longer than what you want for an API call to feel snappy and fast. By streaming the results as they're produced,...
Hardly seemed with a TIL post because it was too easy, but I learned gpt-4 is proficient at building working ffmpeg commands. I wrote the prompt convert m4a to mp3 with ffmpeg
Disclaimer: I am not a security expert or a security professional.
Edit (2024-07-21): Vercel has updated the ai package to use different abstractions than the examples below. Consider reading their docs first before using the example below, which is out of date.
I spent another hour playing around with different techniques to try and teach and convince gpt-4 to play Connections properly, after a bit of exploration and feedback. I incorporated two new techniques Asking for on category at a time, then giving the model feedback (correct, incorrect, 3/4) Using...
I started playing the NYTimes word game "Connections" recently, by the recommendation of a few friends. It has the type of freshness that Wordle lost for me a long time ago. After playing Connections for a few days, I wondered if an OpenAI language model could solve the game (the objective is to...
After some experimentation with GitHub Copilot Chat, my review is mixed. I like the ability to copy from the sidebar chat to the editor a lot. It makes the chat more useful, but the chat is pretty chatty and thus somewhat slow to finish responding as a result. I've also found the inline generation...
I would love if OpenAI added support for presetting a max_tokens url parameter in the Playground. Something as simple as this:
I'm betting OpenAI will soon have a Cloud Storage product like Google Drive or iCloud for ChatGPT Plus users. Having your personal data available in the context of a language model is a massive value add. With a product like, OpenAI can fully support use cases like "summarize my notes for the week"...
Playing with Rivet and OpenInterpreter
It's much easier to test Temporal Workflow in Python by invoking the contents of the individual Activities first, in the shell or via a separate script, then composing them into a Workflow. I need to see if there's a better way to surface exceptions and failures through Temporal directly to make...
Language models and prompts are magic in a world of deterministic software. As prompts change and use cases evolve, it can be difficult to continue to have confidence in the output of a model. Building a library of example inputs for your model+prompt combination with annotated outputs is critical...
Simon wrote an excellent post on the current state of the world in LLMs.
First attempt
It will be interested to see if or when we hit scaling limits to training more powerful models and what our new bottleneck becomes. For now, there appears to be a lot of greenfield.
While not an entirely unique perspective, I believe Apple is one of the best positioned companies to take advantage of the recent improvements in language models. I expect more generic chatbots will continue to become commodities whereas Apple will build a bespoke, multi-modal assistant with access...
promptfoo is a Javascript library and CLI for testing and evaluating LLM output quality. It's straightforward to install and get up and running quickly. As a first experiment, I've used it to compare the output of three similar prompts that specify their output structure using different modes of...
I tried out Llama 2 today using ollama. At first pass, it seemed ok a writing Python code but I struggled to get it to effective generate or adhere to specific schema. I'll have to try a few more things but my initial impressions are mixed (relative to OpenAI models).
It's hard to think because it's hard to think. - Github Copilot
Meta released Llama 2 yesterday and the hype has ensued. While it's exciting to see more powerful models become available, a model with weights is not the same as an API. It is still far less accessible.
Some unstructured thoughts on the types of tasks language models seem to be good (and bad) at completing:
Experimenting with using a language model to improve the input prompt, then use that output as the actual prompt for the model, then returning the result. It's a bit of a play on the "critique" approach. Some of the outputs were interesting but I need a better way to evaluate the results.
I've been following Jason's working experimenting with different abstractions for constructing prompts and structuring responses. I've long felt that building prompts with strings is not the type of developer experience that will win the day. On the other hand, I'm weary of the wrong abstraction...
I've been thinking about the concept of "prompt overfitting". In this context, there is a distinction between model overfitting and prompt overfitting. Say you want to use a large language model as a classifier. You may give it several example inputs and the expected outputs. I don't have hard data...
This past week, OpenAI added function calling to their SDK. This addition is exciting because it now incorporates schema as a first-class citizen in making calls to OpenAI chat models. As the example code and naming suggest, you can define a list of functions and schema of the parameters required...
Richard WM Jones booted Linux 292,612 to find a bug where it hangs on boot. I loved reading the recounting of his process to accomplish this, by bisecting through the different versions of Linux and boot each thousands of times to determine whether the version contained the bug.
Today, I played around with Matt Rickard's ReLLM library, another take on constraining LLM output, in this case, with regex. I tried to use it to steer a language model to generate structure (JSON) from unstructured input. This exercise is sort of like parsing or validating JSON with regex -- it's...
I tried out jsonformer to see how it would perform with some of structured data use cases I've been exploring.
I've been following Eric's posts about SudoLang since the first installment back in March. I've skimmed through the spec and the value proposition quite compelling. SudoLang seeks to allow programmers all levels to instruct LLMs and can also be transpiled into your programming language of...
I've written several posts on using JSON and Pydantic schemas to structure LLM responses. Recently, I've done some work using a similar approach with protobuf message schemas as the data contract. Here's an example to show what that looks like.
NVIDIA researchers introduce an LLM-based agent with "lifelong learning" capabilities that can navigate, discover, and accomplish goals in Minecraft without human intervention.
The Alexandria Index is building embeddings for large, public data sets, to make them more searchable and accessible.
I've seen a lot of "GPT detection" products floating around lately. Sebastian discusses some of the products and their approaches in this article. Some products claim to have developed an "algorithm with an accuracy rate of text detection higher than 98%". Unfortunately, this same algorithm...
Brex wrote a nice beginner guide on prompt engineering.
Plenty of data is ambiguous without additional description or schema to clarify its meaning. It's easy to come up with structured data that can't easily be interpreted without its accompanying schema. Here's an example:
LMQL is a SQL-like programming language for interacting with LMs. It takes a declarative approach to specifying the output constraints for a language model, with a SQL flavor.
marvin's @ai_model decorator implements something similar to what I had in mind for extracting structured data from an input to a language model.
Restricting the next predicted token to adhere to a specific context free grammar seems like a big step forward in weaving language models into applications.
Using system prompts provides an intuitive separation for input and output schema from input content.
With the support of GPT-4, I feel unstoppable. The overnight surge in productivity is intoxicating, not for making money or starting a business, but for the sheer joy of continuously creating ideas from my mind, which feels like happiness. \- Ke Fang
I wrote a few paragraphs disagreeing with Paul's take, asserting that, like Simon suggests, we should think of language models like ChatGPT as a “calculator for words”.
Code needs structure output
The most popular language model use cases I've seen around have been chatbots agents chat your X use cases
It's necessary to pay attention to the shape of a language model's response when incorporating it as a component in a software application. You can't programmatically tap into the power of a language model if you can't reliably parse its response. In the past, I have mostly used a combination of...
Experimenting with Auto-GPT
Auto-GPT is a popular project on Github that attempts to build an autonomous agent on top of an LLM. This is not my first time using Auto-GPT. I used it shortly after it was released and gave it a second try a week or two later, which makes this my third, zero-to-running effort.
I believe that language models are most useful when available at your fingertips in the context of what you're doing. Github Copilot is a well known application that applies language models in this manner. There is no need to pre-prompt the model. It knows you're writing code and that you're going...
Over the the years, I've developed a system for capturing knowledge that has been useful to me. The idea behind this practice is to provide immediate access to useful snippets and learnings, often with examples. I'll store things like: Amend commit message with tags like #git, #commit, and #amend...
I know a little about nix. Not a lot. I know some things about Python virtual environments, asdf and a few things about package managers. I've heard the combo of direnv and nix is fantastic from a number of engineers I trust, but I haven't had the chance to figure out what these tools can really...
I came upon https://gpa.43z.one today. It's a GPT-flavored capture the flag. The idea is, given a prompt containing a secret, convince the LM to leak the prompt against prior instructions it's been given. It's cool way to develop intuition for how to prompt and steer LMs. I managed to complete all...
Attempts to thwart prompt injection
I've been experimenting with ways to prevent applications for deviating from their intended purpose. This problem is a subset of the generic jailbreaking problem at the model level. I'm not particularly well-suited to solve that problem and I imagine it will be a continued back and forth between...
Jailbreaking as prompt injection
I've been keeping an eye out for language models that can run locally so that I can use them on personal data sets for tasks like summarization and knowledge retrieval without sending all my data up to someone else's cloud. Anthony sent me a link to a Twitter thread about product called deepsparse...
If you want to try running these examples yourself, check out my writeup on using a clean Python setup.
Since the launch of GPT3, and more notably ChatGPT, I’ve had a ton of fun learning about and playing with emerging tools in the language model space.