Overview
We were approached by a startup based in the US with a problem: as smartphone usage increases, so does the number of pictures families accumulate, making it difficult to locate specific images. The startup wanted to create an app that would allow users to search for images using natural language, like “pictures of me and my wife on the beach eating ice cream.” However, they didn’t know how to extract this information from the images. That’s where we came in.
CHALLENGES
We analyzed the latest artificial intelligence models and determined not only was image search possible, but also a range of additional features the startup hadn’t considered. Our goal was to develop a system that could:
- Detect all elements in an image, including people, locations, colors, and activities (using object detection and image captioning)
- Match the user’s query with the elements detected in the image using natural language processing (NLP)
- Return the most relevant images with minimal latency (using clustering)




Benefits
We took a pragmatic approach, focused on delivering a successful outcome for the client. Our solution was developed iteratively, with each version bringing the client closer to their desired result. We implemented state-of-the-art object detection and image captioning, which allowed us to group individual image elements into descriptive sentences. Then, using contrastive learning, we were able to match these captions with user queries, yielding the desired results.
The Solution
Finding the perfect solution for this project was an iterative process, with each version the client was getting closer to its desired outcome!
Object Detection
Using the state-of-the-art models for OD wasn't enough, since our client wanted a general description of the image
Image Captioning
Adding IC grouped the individual detections into a descriptive sentence, which matched the client's requirement
Natural Language Processing
Use Contrastive Learning to match user queries with image captions yielded the results that the client was looking for!
The Results
95% Image Extraction
The system is able to extract the desired images using only natural language queries
96% Image Captioning
The captions generated successfully described all the traits required by the customer
100% Flexible System
The system has been designed in such a way that new features can be easily integrated