We were approached by a startup based in the US with a problem: as smartphone usage increases, so does the number of pictures families accumulate, making it difficult to locate specific images. The startup wanted to create an app that would allow users to search for images using natural language, like “pictures of me and my wife on the beach eating ice cream.” However, they didn’t know how to extract this information from the images. That’s where we came in.
We analyzed the latest artificial intelligence models and determined not only was image search possible, but also a range of additional features the startup hadn’t considered. Our goal was to develop a system that could:
- Detect all elements in an image, including people, locations, colors, and activities (using object detection and image captioning)
- Match the user’s query with the elements detected in the image using natural language processing (NLP)
- Return the most relevant images with minimal latency (using clustering)
We took a pragmatic approach, focused on delivering a successful outcome for the client. Our solution was developed iteratively, with each version bringing the client closer to their desired result. We implemented state-of-the-art object detection and image captioning, which allowed us to group individual image elements into descriptive sentences. Then, using contrastive learning, we were able to match these captions with user queries, yielding the desired results.
Finding the perfect solution for this project was an iterative process, with each version the client was getting closer to its desired outcome!
95% Image Extraction
The system is able to extract the desired images using only natural language queries
96% Image Captioning
The captions generated successfully described all the traits required by the customer
100% Flexible System
The system has been designed in such a way that new features can be easily integrated