Simon Willison On X: "Anyone Got A Lead On A Good Embedding Model That Can Embed Both Images And Text Into The Same Space, So You Can Search For "Dog" And Get Back Images Most Likely To Contain A Dog? It Looks Like Visualbert Is One, What Are Others?" / X
Here's my thought on Simon Willison on X: "Anyone got a lead on a good embedding model that can embed both images and text into the same space, so you can search for "dog" and get back images most likely to contain a dog? It looks like VisualBERT is one, what are others?" / X
Kinda mindblown that this is even possible. This is so far outside of my current thinking that i didn't even think of an elegant way to implement semantic search accross images and text at the same time. I know it happens at Google, but I envision that as still text search accross tags and meta data about the image.
Based on the number of responses CLIP is the thing that does this.