Kinda mindblown that this is even possible. This is so far outside of my current thinking that i didnβt even think of an elegant way to implement semantic search accross images and text at the same time. I know it happens at Google, but I envision that as still text search accross tags and meta data about the image.
Based on the number of responses CLIP is the thing that does this.
Note
This post is a thought. Itβs a short note that I make about someone elseβs content online #thoughts