CLIP (Contrastive Language–Image Pre-training) is a technique which efficiently learns visual concepts from natural language supervision. CLIP has found applications in stable diffusion.
This repository aims act as a POC in exploring the ability to use CLIP for video search using natural language outlined in the article found here.
Adapted for a natual language video search engine found here
- libzmq
- python >= 3.8
- go >= 1.18
- start up the inference zmq server found in the
./inference
directorypython3 zmq_server.py
. - start up the go server with
go run main.go
.
Before running this example, please ensure that your environment is correctly configured and the application is running without errors.
- index the video clips provided by the
examples/videos
.
curl -X POST -d '{"videoURI": "<path_to_dir>/examples/videos/<video_name>.mp4" }' http://localhost:3000/insert
note: it can take a moment for the video to become searchable.
- then search for a video
curl -X POST -d '{"input": "a man cutting pepper", "maxResults": 1 }' http://localhost:3000/search
- CLI to remove manual setup process
- ability to add dedicated inference machines (currently limited to same host)