All articles
Computer VisionSecurity AIVideo AnalyticsGuide

The 4 Levels of
AI Security Cameras

From smart recording all the way to a system you can talk to. What AI can actually do with the cameras you already have — and how each level turns 30 days of dead footage into something you can act on.

10 min read
By Iu Ayala, Gradient Insight

Alerts per day

noise → signal

L1
10,000

Basic motion — every curtain, fan & shadow

L1
2,000

Smart motion — constant motion filtered out

L2
250

Object detection — only when a person appears

L2
50

+ Face filter — known staff excluded

L3
10

Behaviour analysis — only genuine loitering

L4
Ask

Conversational AI — explore before it happens

4
Levels of camera AI
recording → conversation
10k→10
Daily alerts, filtered
from noise to signal
£10k
Stock walked out
the Birmingham warehouse
4 hrs
To find 40 seconds
of footage that mattered

A warehouse manager in Birmingham gets a call: someone walked off with £10,000 of stock. He has 14 cameras and 30 days of footage. His team spends four hours scrubbing through it just to find 40 seconds of useful video.

That's not a security system. That's an expensive hard drive. The cameras saw everything and told him nothing — which is the gap AI actually closes. Below are the four levels of what's possible with the cameras you already own, from smart recording all the way to a system you can ask questions.

Prefer to watch? The full walkthrough is above. This post goes deeper on the tools and the trade-offs at each level.

For five years I've built computer-vision security systems professionally — most recently a weapon-detection system for a US firm running over a thousand live camera streams at once. The thread running through every level below is one number: how many alerts a day a human actually has to look at. We'll take the Birmingham theft from 10,000 a day down to 10 — and then past detection entirely.

1

Level 1 · Smart recording

Core tech: Motion detection

Level 1 is an NVR — a network video recorder. A computer sitting on your network, quietly recording the streams from every camera you care about, so the footage is ready when you need it. The Birmingham warehouse already had this. Most businesses do. The interesting question is what you use to detect something.

The simplest answer is motion detection. The most basic version compares two consecutive frames and flags any difference. The problem: it flags everything — a curtain in a draught, a tree, a fan spinning in the corner. In the warehouse, that's roughly 10,000 detections a day. Useless.

Basic motion — flags everything

sequence filter ⟶ only real movement

Look at a sequence of frames instead of two, and constant motion — the curtain, the fan — drops away. Only genuine change survives.

That sequence-based approach is the upgrade. It can't tell a person from a stray cat, but it kills the constant-motion noise — bringing the warehouse from 10,000 down to about 2,000 detections a day. Those are your level-1 alerts.

Open source

Frigate

Free, self-hosted. Drop it on any box, point it at your cameras, scrub footage with buttery-smooth timeline scrolling.

Commercial

Milestone XProtect

Enterprise VMS — hardware and software bundled. Less upfront headache.

Avigilon

Same enterprise category. Plug-and-play, but you're inside a locked ecosystem.

10,000 2,000

Sequence-based motion filtering removes the curtain-and-fan noise that basic two-frame detection drowns in.

alerts / day
2

Level 2 · Real-time alerts

Core tech: Object detection

Level 2 is real-time alerts, and the technology that unlocks it is object detection. You hand the model an image and it tells you what's in it and where: "three people, here, here and here" — each wrapped in a bounding box. Now an alert can mean something specific instead of "something moved."

A quick myth-buster: people love the phrase "AI security camera," but the AI almost never runs on the camera. It runs on a separate server the camera feeds into.

Detect the person

person 0.98

A box, a label, a confidence score — only when a human is actually in frame.

Then filter the staff

Unknown visitor ALERT
Warehouse staff #4 known
Unknown visitor ALERT
Manager known

Face recognition quietly drops everyone on the payroll out of your alert stream.

What a model can detect depends entirely on what it was trained on. Train on the COCO dataset and you get 80 everyday classes — person, car, dog, bag. For trespassing in a forbidden zone, that's already enough: detect a person where one shouldn't be, and you're done.

Harder cases need their own data. For Angel Protection, we built firearm detection for US schools under two brutal constraints: a top-down CCTV angle, and a gun that appears only a few dozen pixels wide. Off-the-shelf datasets had neither — so we built one. Here's how that system works across 1,000+ cameras →

Open source

YOLO + COCO

Real-time detector across 80 common classes. Great for trespassing, forbidden objects, and quality-control 'missing part' checks.

Custom-trained YOLO

When your objects or angles aren't in public datasets — the route we took for firearm detection.

Commercial

Verkada

Cloud-first, subscription, plug-and-play. Detects what it's pre-trained on.

Rhombus

Same category — managed cloud analytics, less room for the truly custom.

2,000 50

Person detection alone cuts 2,000 to ~250. Add face recognition to exclude known staff and it lands near 50 — finally a manageable number.

alerts / day
3

Level 3 · Behavioural analysis

Core tech: Video understanding

Levels 1 and 2 look at single frames. Level 3 looks at sequences — analysing the video, not the image. That shift unlocks an entirely different class of question: not "is there a person?" but "what is that person doing?"

Tracking a path over time

ID #7 · loitering 4m12s

Tracking

DeepSORT

Pose

joints + classifier

Intent

loiter / fall / fight

Tracking (DeepSORT and friends) follows each person across frames — perfect for loitering and store flow analysis. Pose estimation maps the body's joints, and feeding that to a classifier catches fights, violence, or — more usefully — a person falling. Detect more than one object class at once and "person + bag, then person leaves" becomes unattended-luggage detection. The hardware bump over level 2 is real but modest.

Open source

DeepSORT

Multi-object tracking — turn detections into trajectories for loitering and flow analysis.

Pose estimation + classifier

Joint detection feeding a small model — fight, fall, and violence detection.

Commercial

Gentec-style VMS

Subscription analytics: trespass, forklift near-miss, PPE violations, slip-and-fall.

Sport-AI platforms

Hosted behaviour analytics — no on-prem servers, plug-and-play.

50 10

Behaviour analysis keeps only the alerts that look like genuine loitering or intent — the warehouse's 50 a day becomes about 10.

alerts / day

Notice what we haven't used yet

Levels 2 and 3 are unmistakably AI — and not a single LLM in sight. Read the news and you'd think AI is large language models. But everything so far — object detection, tracking, pose, behaviour — was solved without them. That matters, because the next level is where LLMs finally earn their (considerable) keep.

4

Level 4 · Conversational AI

Core tech: Vision-language models

Level 4 is conversational. You query your footage the same way you'd ask ChatGPT or Claude a question — in plain language. The interesting things you want to know are rarely the ones you set up a detector for in advance, and this is what finally lets you go looking for them.

Did anyone hang around the loading bay after 6pm last week?

Two instances. Tuesday 18:42 — a person waited 6 minutes by the side door, no vehicle. Thursday 19:05 — same individual, returned and looked into the bay.

Ask anything about your footage…

The catch: this takes a lot more horsepower. Even the smaller vision-language models need far beefier hardware than anything at levels 1–3, which means real money. The pay-off isn't a smaller alert count — it's open-ended exploration. Instead of waiting for a theft, you can ask: "is anyone behaving like they're casing the place?" and go looking before it happens.

Open source

VideoLLaMA3

Open vision-language model for video — the kind of backbone we reach for when a client needs natural-language search over their own footage.

Commercial

Coram AI

Natural-language video search as a managed product.

Cortex-style platforms

Ask-your-footage search, hosted — convenience over control.

This isn't about pushing the alert count lower. It's about asking questions you never built a detector for — and catching the thing before it becomes a 4-hour scrub.

So — which level are you at?

The whole journey, in one view. Each step is a real decision, not a forced upgrade.

Level 1 10,000
Basic motion
Level 1 2,000
Smart motion
Level 2 250
Object detection
Level 2 50
+ Face filtering
Level 3 10
Behaviour analysis
Level 4 Ask
Conversational AI

The Birmingham warehouse went from 10,000 meaningless detections a day to 10 worth looking at — and with level 4, to spotting the planning before the theft ever happened. Once you know which level you're standing on and the problem you're actually trying to solve, the next step gets obvious.

Gradient Insight

Cameras recording, but not telling you anything?

Security, retail, manufacturing, logistics — if your cameras are stuck at level 1, we'll figure out which level actually solves your problem. No pitch, just a conversation to see if it's a fit.