Local VLMs Have Improved
About 6 months ago, I experimented with running a few different multi-modal (vision) language models on my Macbook. At the time, the results weren't so great.
6 entries
About 6 months ago, I experimented with running a few different multi-modal (vision) language models on my Macbook. At the time, the results weren't so great.
I've done some experimentation extracting structured data from documents using VLMs. A summary of one approach I've tried can be found in my repo, impulse. I've found using Protobufs to be a relatively effective approach for extracting values from documents. The high-level idea is you write a...
In light of OpenAI releasing structured output in the model API, let's move output structuring another level up the stack to the microservice/RPC level.
I tried stacking multiple pages of a pdf vertically as a single image to a model, then doing data extraction from this. It didn't work. I imagine this is because models aren't trained on much data like this. The inference seemed to output made up data. Multiple studies have shown that...
I attempted to reproduce the results for one task from the VLMs are Blind paper. Specifically, Task 1: Counting line intersections. I ran 150 examples of lines generated by the code from the project with line thickness 4.
I spent some time experimenting with multi-modal model (also called vision models on the ollama site) to see how they perform. You try these out with the CLI ollama run <model> but I opted to use the ollama Python client.