I'm sharing tingy - this is a service that allows you to upload a video and query it using a text description. For each upload, you can make two queries and they can be any text that you wish (in English).
Here's an example: You have 4 hours worth of security footage. During those 4 hours you know that someone stole a bike. You could query the video for "person riding bike".
I'm looking for some test users - please reach out here if you would like to trial tingy. I would be happy to set you up with a free account.
Interesting idea nonetheless
its great in theory but real world creates harder challenges. the real answer is to keep track of all the dogs that pass by and then compare the before/after scenes for differences. i'm doing this with my garage door camera for detecting water (for a completely different problem)
Maybe one way is to let users upload it for free, interactively search the first 10s for free, and then require payment to look through the rest of the video. If I felt confident that it was going to work, and I had a real reason to search through the video, I would probably pay between $1 and $5, depending on the length of the video. Much more than that and I'd rather just scrub through it myself. I guess the length of the video is roughly proportional to how much I'm willing to pay to not have to scrub through it myself.
The other thing is that it's much easier to scrub through and find an image, rather than voice. If I want to find a bike, I can just scrub through at 10x and look for bikes, a bike isn't going to just appear in a single frame and disappear in the blink of an eye. If it also searched voice, it would be worth even more.
A few thoughts/questions here:
1. What markets and use-cases were you thinking of when building out this MVP? The applications could be broad enough, but it seems like you expect CLIP to handle bespoke query results and hope that they return a result that is relevant. Also what might be interesting to test if you search for something that doesn't exist in the video, can you handle that well-enough (assuming it's just a simple threshold you're picking to identify relevant search results)?
2. Licensing is something that has always piqued my curiosity when it comes to ML-based apps. Do you have a sense of the commercial-use for models such as CLIP, especially when the datasets that they were probably trained on were not permitted for commercial-use? This also applies to the raw video data uploaded by the user.
- home security
- searching through long home videos
- production companies with large video archives (this would require more tooling)
I am unsure whether to focus on one of these groups or to go for a more generic tool. I'll add a video demo to the landing page. So far, for all the tests I've performed the ML model can generalize well enough to cover this range of uses.
Licensing: I need to research this further. I'm not sure how the licensing changes due to the fact that I've also fine-tuned the model on my own data.
FWIW I recall having seen something similar with Google Cloud's Video Intelligence API (https://towardsdatascience.com/building-an-ai-powered-search...). Building something generic would make it especially hard to get right, especially if your users want high precision-recall from their search results.
Re: licensing, the world of startups is somewhat of a wild-west these days with folks offering pre-trained models as-a-service without really thinking about the licensing implications (both on the dataset and model front). Huggingface is a classic example, and they seem to suggest that it's perfectly OK to fine-tune and use commercially (https://github.com/huggingface/transformers/issues/3357#issu...), but I'm not certain that their lawyers would put it the same way.
I totally understand if you'd like to keep some/all of this secret, but I thought it's worth a shot :)
ML: fine-tuned CLIP model. Each video frame is embedded using CLIP and then the image embedding is compared against the text query embedding
Architecture: everything is serverless using AWS lambda. Basic flow is: video upload to storage, lambda for converting video to still frames, perform ML inference on each frame, aggregate inference results and create customer output.