Build with language models via llm

llm (previously) is a tool Simon Willison is working on for interacting with large language models, running via API or locally.

I set out to use llm as the glue for prototyping tools to generate embeddings from one of my journals so that I could experiment with search and clustering on my writings. Approximately, what I’m building is an ETL workflow: extract/export data from my journals, transform/index as searchable vectors, load/query for “what docs are similar to or match this query”.

Extract and transform, approximately

Given a JSON export from DayOne, this turned out to be a matter of shell pipelines. After some iterations with prompting (via Raycast’s GPT3.5 integration), I came up with a simple script for extracting entries and loading them into a SQLite database containing embedding vectors:

#!/bin/sh
# extract-entries.sh
# $ ./extract-entries Journals.json

file=$1
cat $file |
  jq '[.entries[] | {id: .uuid, content: .text}]' |
  llm embed-multi journals - \ # [1]
    --format json \
    --model sentence-transformers/all-MiniLM-L6-v2 \ # [2]
    --database journals.db \
    --store

A couple things to note here:

The placement of the - parameter matters here. I’m used to placing it at the end of the parameter list, but that didn’t work. The llm embed-multi docs suggest that --input is equivalent, but I think that’s a docs bug (the parameter doesn’t seem to exist in the released code).
I’m using locally-run model to generate the embeddings. This is very cool!

In particular, llm embed-multi takes one JSON doc per line, expecting id/content keys, and “indexes” those into a database of document/embedding rows. (If you’re thinking “hey, it’s SQLite, that has full search, why not both: yes, me too, that’s what I’m hoping to accomplish next!)

I probably could have just built this by iterating on shell commands, but I like editing with a full-blown editor and don’t particularly want to practice at using the zsh builtin editor. 🤷🏻‍♂️

Load, of a sort

Once that script finishes (it takes a few moments to generate all the embeddings), querying for documents similar to a query text is also straightforward:

# Query the embeddings and pretty display the results
# query.sh
# ./query.sh "What is good in life?"
query=$1

llm similar journals \
  --number 3 \
  --content "$query" |
  jq -r -c '.content' | # [1]
  mdcat # [2]

Of note, two things that probably should have been more obvious to me:

I don’t need to write a for-loop in shell to handle the output of llm similar; jq basically has an option for that
Pretty-printing Markdown to a terminal is trivial after brew install mdcat

I didn’t go too far into clustering, which also boils down to one command: llm cluster journals 10. I hit a hiccup wherein I couldn’t run a model like LLaMa2 or an even smaller one because of issues with my installation.

Things I learned!

jq is very good on its own!
- and has been for years, probably!
- using a copilot to help me take the first step with syntax using my own data is the epiphany here
llm is quite good, doubly so with its growing ecosystem of plugins
- if I were happier with using shells, I could have done all of this in a couple relatively simple commands
- it provides an adapter layer that makes it possible to start experimenting/developing against usage-priced APIs and switch to running models/APIs locally when you get serious
it’s feasible to do some kinds of LLM work on your own computer
- in particular, if you don’t mind trading your own time getting your installation right to gain independence from API vendors and usage-based pricing

Mission complete: I have a queryable index of document vectors I can experiment with for searching, clustering, and building applications on top of my journals.