The 4 Levels of AI Security Cameras: From Recording to Conversation

A warehouse manager in Birmingham gets a call: someone walked off with £10,000 of stock. He has 14 cameras and 30 days of footage. His team spends four hours scrubbing through it just to find 40 seconds of useful video.

That's not a security system. That's an expensive hard drive. The cameras saw everything and told him nothing — which is the gap AI actually closes. Below are the four levels of what's possible with the cameras you already own, from smart recording all the way to a system you can ask questions.

Prefer to watch? The full walkthrough is above. This post goes deeper on the tools and the trade-offs at each level.

For five years I've built computer-vision security systems professionally — most recently a weapon-detection system for a US firm running over a thousand live camera streams at once. The thread running through every level below is one number: how many alerts a day a human actually has to look at. We'll take the Birmingham theft from 10,000 a day down to 10 — and then past detection entirely.

Level 1 · Smart recording

Core tech: Motion detection

Level 1 is an NVR — a network video recorder. A computer sitting on your network, quietly recording the streams from every camera you care about, so the footage is ready when you need it. The Birmingham warehouse already had this. Most businesses do. The interesting question is what you use to detect something.

The simplest answer is motion detection. The most basic version compares two consecutive frames and flags any difference. The problem: it flags everything — a curtain in a draught, a tree, a fan spinning in the corner. In the warehouse, that's roughly 10,000 detections a day. Useless.

How basic motion detection works

frames match → ignore

t − 1

−

nothing changed

Subtract one frame from the next. Identical frames cancel to nothing — ignore. When something moves, the changed pixels light up and you have motion. Simple, fast, and exactly why a curtain or a spinning fan sets it off too.

That sequence-based approach is the upgrade. It can't tell a person from a stray cat, but it kills the constant-motion noise — bringing the warehouse from 10,000 down to about 2,000 detections a day. Those are your level-1 alerts.

Open source

Frigate

Free, self-hosted. Drop it on any box, point it at your cameras, scrub footage with buttery-smooth timeline scrolling.

Commercial

Milestone XProtect

Enterprise VMS — hardware and software bundled. Less upfront headache.

Avigilon

Same enterprise category. Plug-and-play, but you're inside a locked ecosystem.

10,000 2,000

Sequence-based motion filtering removes the curtain-and-fan noise that basic two-frame detection drowns in.

alerts / day

Level 2 · Real-time alerts

Core tech: Object detection

Level 2 is real-time alerts, and the technology that unlocks it is object detection. You hand the model an image and it tells you what's in it and where: "three people, here, here and here" — each wrapped in a bounding box. Now an alert can mean something specific instead of "something moved."

A quick myth-buster: people love the phrase "AI security camera," but the AI almost never runs on the camera. It runs on a separate server the camera feeds into.

Detect the person

person 0.98

A box, a label, a confidence score — only when a human is actually in frame.

Then filter the staff

Unknown visitor ALERT

Warehouse staff #4 known

Unknown visitor ALERT

Manager known

Face recognition quietly drops everyone on the payroll out of your alert stream.

What a model can detect depends entirely on what it was trained on. Train on the COCO dataset and you get 80 everyday classes — person, car, dog, bag. For trespassing in a forbidden zone, that's already enough: detect a person where one shouldn't be, and you're done.

Harder cases need their own data. For Angel Protection, we built firearm detection for US schools under two brutal constraints: a top-down CCTV angle, and a gun that appears only a few dozen pixels wide. Off-the-shelf datasets had neither — so we built one. Here's how that system works across 1,000+ cameras →

Open source

YOLO + COCO

Real-time detector across 80 common classes. Great for trespassing, forbidden objects, and quality-control 'missing part' checks.

Custom-trained YOLO

When your objects or angles aren't in public datasets — the route we took for firearm detection.

Commercial

Verkada

Cloud-first, subscription, plug-and-play. Detects what it's pre-trained on.

Rhombus

Same category — managed cloud analytics, less room for the truly custom.

2,000 50

Person detection alone cuts 2,000 to ~250. Add face recognition to exclude known staff and it lands near 50 — finally a manageable number.

alerts / day

Level 3 · Behavioural analysis

Core tech: Video understanding

Levels 1 and 2 look at single frames. Level 3 looks at sequences — analysing the video, not the image. That shift unlocks an entirely different class of question: not "is there a person?" but "what is that person doing?"

Tracking a path over time

ID #7 · loitering 4m12s

Tracking

DeepSORT

Pose

joints + classifier

Intent

loiter / fall / fight

Tracking (DeepSORT and friends) follows each person across frames — perfect for loitering and store flow analysis. Pose estimation maps the body's joints, and feeding that to a classifier catches fights, violence, or — more usefully — a person falling. Detect more than one object class at once and "person + bag, then person leaves" becomes unattended-luggage detection. The hardware bump over level 2 is real but modest.

Open source

DeepSORT

Multi-object tracking — turn detections into trajectories for loitering and flow analysis.

Pose estimation + classifier

Joint detection feeding a small model — fight, fall, and violence detection.

Commercial

Genetec

Enterprise VMS with AI analytics built in — loitering, abnormal behaviour, PPE compliance.

Spot AI

Subscription video AI agents — forklift near-miss, falls, PPE violations, unattended workstations.

50 10

Behaviour analysis keeps only the alerts that look like genuine loitering or intent — the warehouse's 50 a day becomes about 10.

alerts / day

Notice what we haven't used yet

Levels 2 and 3 are unmistakably AI — and not a single LLM in sight. Read the news and you'd think AI is large language models. But everything so far — object detection, tracking, pose, behaviour — was solved without them. That matters, because the next level is where LLMs finally earn their (considerable) keep.

Level 4 · Conversational AI

Core tech: Vision-language models

Level 4 is conversational. You query your footage the same way you'd ask ChatGPT or Claude a question — in plain language. The interesting things you want to know are rarely the ones you set up a detector for in advance, and this is what finally lets you go looking for them.

Did anyone hang around the loading bay after 6pm last week?

Two instances. Tuesday 18:42 — a person waited 6 minutes by the side door, no vehicle. Thursday 19:05 — same individual, returned and looked into the bay.

Ask anything about your footage…

The catch: this takes a lot more horsepower. Even the smaller vision-language models need far beefier hardware than anything at levels 1–3, which means real money. The pay-off isn't a smaller alert count — it's open-ended exploration. Instead of waiting for a theft, you can ask: "is anyone behaving like they're casing the place?" and go looking before it happens.

Open source

Qwen3-VL

Open vision-language model — the kind of backbone we reach for when a client needs natural-language search over their own footage.

Commercial

Coram AI

ChatGPT-style natural-language video search as a managed product.

Ambient.ai

Conversational forensic search built on reasoning vision-language models.

This isn't about pushing the alert count lower. It's about asking questions you never built a detector for — and catching the thing before it becomes a 4-hour scrub.

So — which level are you at?

The whole journey, in one view. Each step is a real decision, not a forced upgrade.

Level 1 10,000

Basic motion

Level 1 2,000

Smart motion

Level 2 250

Object detection

Level 2 50

+ Face filtering

Level 3 10

Behaviour analysis

Level 4 Ask

Conversational AI

The Birmingham warehouse went from 10,000 meaningless detections a day to 10 worth looking at — and with level 4, to spotting the planning before the theft ever happened. Once you know which level you're standing on and the problem you're actually trying to solve, the next step gets obvious.

Gradient Insight

Cameras recording, but not telling you anything?

Security, retail, manufacturing, logistics — if your cameras are stuck at level 1, we'll figure out which level actually solves your problem. No pitch, just a conversation to see if it's a fit.

Book a discovery call See a level-2 system in production

The 4 Levels ofAI Security Cameras

So — which level are you at?

Cameras recording, but not telling you anything?

The 4 Levels of
AI Security Cameras