A warehouse manager in Birmingham gets a call: someone walked off with £10,000 of stock. He has 14 cameras and 30 days of footage. His team spends four hours scrubbing through it just to find 40 seconds of useful video.
That's not a security system. That's an expensive hard drive. The cameras saw everything and told him nothing — which is the gap AI actually closes. Below are the four levels of what's possible with the cameras you already own, from smart recording all the way to a system you can ask questions.
Prefer to watch? The full walkthrough is above. This post goes deeper on the tools and the trade-offs at each level.
For five years I've built computer-vision security systems professionally — most recently a weapon-detection system for a US firm running over a thousand live camera streams at once. The thread running through every level below is one number: how many alerts a day a human actually has to look at. We'll take the Birmingham theft from 10,000 a day down to 10 — and then past detection entirely.
Level 1 · Smart recording
Core tech: Motion detection
Level 1 is an NVR — a network video recorder. A computer sitting on your network, quietly recording the streams from every camera you care about, so the footage is ready when you need it. The Birmingham warehouse already had this. Most businesses do. The interesting question is what you use to detect something.
The simplest answer is motion detection. The most basic version compares two consecutive frames and flags any difference. The problem: it flags everything — a curtain in a draught, a tree, a fan spinning in the corner. In the warehouse, that's roughly 10,000 detections a day. Useless.
Basic motion — flags everything
sequence filter ⟶ only real movement
Look at a sequence of frames instead of two, and constant motion — the curtain, the fan — drops away. Only genuine change survives.
That sequence-based approach is the upgrade. It can't tell a person from a stray cat, but it kills the constant-motion noise — bringing the warehouse from 10,000 down to about 2,000 detections a day. Those are your level-1 alerts.
Open source
Frigate
Free, self-hosted. Drop it on any box, point it at your cameras, scrub footage with buttery-smooth timeline scrolling.
Commercial
Milestone XProtect
Enterprise VMS — hardware and software bundled. Less upfront headache.
Avigilon
Same enterprise category. Plug-and-play, but you're inside a locked ecosystem.
Sequence-based motion filtering removes the curtain-and-fan noise that basic two-frame detection drowns in.
alerts / dayLevel 2 · Real-time alerts
Core tech: Object detection
Level 2 is real-time alerts, and the technology that unlocks it is object detection. You hand the model an image and it tells you what's in it and where: "three people, here, here and here" — each wrapped in a bounding box. Now an alert can mean something specific instead of "something moved."
A quick myth-buster: people love the phrase "AI security camera," but the AI almost never runs on the camera. It runs on a separate server the camera feeds into.
Detect the person
A box, a label, a confidence score — only when a human is actually in frame.
Then filter the staff
Face recognition quietly drops everyone on the payroll out of your alert stream.
What a model can detect depends entirely on what it was trained on. Train on the COCO dataset and you get 80 everyday classes — person, car, dog, bag. For trespassing in a forbidden zone, that's already enough: detect a person where one shouldn't be, and you're done.
Harder cases need their own data. For Angel Protection, we built firearm detection for US schools under two brutal constraints: a top-down CCTV angle, and a gun that appears only a few dozen pixels wide. Off-the-shelf datasets had neither — so we built one. Here's how that system works across 1,000+ cameras →
Open source
YOLO + COCO
Real-time detector across 80 common classes. Great for trespassing, forbidden objects, and quality-control 'missing part' checks.
Custom-trained YOLO
When your objects or angles aren't in public datasets — the route we took for firearm detection.
Commercial
Verkada
Cloud-first, subscription, plug-and-play. Detects what it's pre-trained on.
Rhombus
Same category — managed cloud analytics, less room for the truly custom.
Person detection alone cuts 2,000 to ~250. Add face recognition to exclude known staff and it lands near 50 — finally a manageable number.
alerts / dayLevel 3 · Behavioural analysis
Core tech: Video understanding
Levels 1 and 2 look at single frames. Level 3 looks at sequences — analysing the video, not the image. That shift unlocks an entirely different class of question: not "is there a person?" but "what is that person doing?"
Tracking a path over time
Tracking
DeepSORT
Pose
joints + classifier
Intent
loiter / fall / fight
Tracking (DeepSORT and friends) follows each person across frames — perfect for loitering and store flow analysis. Pose estimation maps the body's joints, and feeding that to a classifier catches fights, violence, or — more usefully — a person falling. Detect more than one object class at once and "person + bag, then person leaves" becomes unattended-luggage detection. The hardware bump over level 2 is real but modest.
Open source
DeepSORT
Multi-object tracking — turn detections into trajectories for loitering and flow analysis.
Pose estimation + classifier
Joint detection feeding a small model — fight, fall, and violence detection.
Commercial
Gentec-style VMS
Subscription analytics: trespass, forklift near-miss, PPE violations, slip-and-fall.
Sport-AI platforms
Hosted behaviour analytics — no on-prem servers, plug-and-play.
Behaviour analysis keeps only the alerts that look like genuine loitering or intent — the warehouse's 50 a day becomes about 10.
alerts / dayNotice what we haven't used yet
Levels 2 and 3 are unmistakably AI — and not a single LLM in sight. Read the news and you'd think AI is large language models. But everything so far — object detection, tracking, pose, behaviour — was solved without them. That matters, because the next level is where LLMs finally earn their (considerable) keep.
Level 4 · Conversational AI
Core tech: Vision-language models
Level 4 is conversational. You query your footage the same way you'd ask ChatGPT or Claude a question — in plain language. The interesting things you want to know are rarely the ones you set up a detector for in advance, and this is what finally lets you go looking for them.
Did anyone hang around the loading bay after 6pm last week?
Two instances. Tuesday 18:42 — a person waited 6 minutes by the side door, no vehicle. Thursday 19:05 — same individual, returned and looked into the bay.
The catch: this takes a lot more horsepower. Even the smaller vision-language models need far beefier hardware than anything at levels 1–3, which means real money. The pay-off isn't a smaller alert count — it's open-ended exploration. Instead of waiting for a theft, you can ask: "is anyone behaving like they're casing the place?" and go looking before it happens.
Open source
VideoLLaMA3
Open vision-language model for video — the kind of backbone we reach for when a client needs natural-language search over their own footage.
Commercial
Coram AI
Natural-language video search as a managed product.
Cortex-style platforms
Ask-your-footage search, hosted — convenience over control.
This isn't about pushing the alert count lower. It's about asking questions you never built a detector for — and catching the thing before it becomes a 4-hour scrub.
So — which level are you at?
The whole journey, in one view. Each step is a real decision, not a forced upgrade.
The Birmingham warehouse went from 10,000 meaningless detections a day to 10 worth looking at — and with level 4, to spotting the planning before the theft ever happened. Once you know which level you're standing on and the problem you're actually trying to solve, the next step gets obvious.
Gradient Insight
Cameras recording, but not telling you anything?
Security, retail, manufacturing, logistics — if your cameras are stuck at level 1, we'll figure out which level actually solves your problem. No pitch, just a conversation to see if it's a fit.