All articles
Computer VisionEdge AISecurity AICase Study

How We Deployed Real-Time
Weapon Detection
Across 1,000+ CCTV Cameras

Three architecture decisions — edge over cloud, distributed over monolithic, fine-tuned over pre-trained — cut hardware costs by 90% and put a live security system into US schools and hospitals. Here's exactly how we built it.

14 min read
By Iu Ayala, Gradient Insight
AI weapon detection with bounding boxes on a school security camera feed
97.3%
Detection accuracy
near-zero false positives
90%
Hardware cost reduction
vs. initial monolithic design
1,000+
Live cameras
US schools & hospitals
<1s
End-to-end latency
event to alert

In the LLM era, detecting an object in an image isn't impressive. Any capable model does it. But doing that same thing in real-time — across over a thousand cameras simultaneously, running 24/7 in live US high schools and hospitals — that's a different problem entirely.

This is the story of how we built that system for Angel Protection: the three decisions that shaped the architecture, and why the obvious answer was wrong every time.

Prefer to watch? The full video walkthrough is above. The post goes deeper on implementation detail.

First: why not just use an LLM?

LLMs handle images now. The question is legitimate. Here's the answer in one table.

LLM (Claude Opus)
Edge AI Hardware
$0.014 per image
$2,000 one-time hardware cost
$2,000 after just 2 hours
Runs for years — same cameras
Scales linearly with usage
Fixed cost, zero marginal cost
Requires cloud connectivity
Operates fully offline

Running 10 cameras at 2 frames per second costs $2,000 every two hours with a frontier LLM. That same $2,000 buys the hardware that runs a purpose-built detection model for years across those same ten cameras. LLMs are general-purpose tools. This is not a general-purpose problem.

Decision 1: Edge over Cloud

Angel Protection's instinct was already correct. Three forces confirm why.

When Angel Protection came to us, they had a system in mind: cameras at each location feeding a local server, that server runs the model, detections go up to the cloud for 911 alerts. Their question was whether that held at scale, or whether cloud processing made more sense.

The cloud had a genuine case — elastic scaling, less on-site hardware, push model updates without touching physical machines. For a lot of systems, cloud is the right call. We worked through the numbers to find out if this was one of them.

Bandwidth

Edge: Only send alerts — not footage. Bandwidth requirement disappears.
Cloud: 380 MB/s continuously from 100 cameras at 2fps. Most schools can't sustain that.

Latency

Edge: Inference runs locally. Detection triggers in milliseconds.
Cloud: Full network round-trip on top of inference time. In a school corridor, that costs real seconds.

Privacy

Edge: Raw footage never leaves the building. Only a timestamp and cropped image go anywhere.
Cloud: Raw CCTV footage of people is legally sensitive. Sending it to the cloud creates real compliance exposure.

Cost at scale

Edge: Fixed hardware cost per site. Predictable and owned.
Cloud: Elastic billing is a double-edged sword — unpredictable at 1,000+ cameras.

The number that settled it

One 1080p frame is 1.9 MB. A hundred cameras at 2fps — that's 380 MB/s continuously, from every location. Schools and hospitals don't have that kind of uplink, and where they do, you're competing with everything else on the network. On the edge, the only thing that ever leaves the building is a timestamp and a cropped image.

"Latency and privacy just confirmed what the bandwidth number had already decided. Cloud inference adds a full network round-trip on top of model processing time — when you're detecting a weapon in a school corridor, that round-trip costs real seconds."

Decision 2: Distributed over Monolithic

Edge was settled. The first version of the architecture had a problem we didn't expect.

The first version was one machine per location — CPU and GPU in the same box, handling everything: decoding the RTSP streams, running motion detection as a pre-filter, then running the model on frames that had actual movement.

Motion detection was a smart first optimisation. Most of the time a corridor is empty, so most frames never reach the model. But running that check across thousands of camera feeds every second is constant CPU work — and decoding raw compressed video streams sits on top of that.

Version 1 — Resource utilisation

CPU ~95%

Stream decoding + motion detection — CPU never got a break

GPU (model inference) <30%

The most expensive component — sitting idle, waiting for frames

The CPU was always near 100%. The GPU — the part actually running the model — was sitting below 30%. The most expensive component in the whole setup was basically idle. We were paying GPU-machine prices and the GPU was barely working. At this scale, that makes the whole project much more expensive than it needs to be.

We were already running high-spec CPUs — scaling up wasn't the answer. So we changed the shape of the system entirely.

Architecture: Before → After

v1 — Monolithic (one machine does everything)

Cam 1
Cam 2
Cam N
CPU + GPU box
decode + detect + infer
Cloud alerts

v2 — Distributed (split the work)

Cam 1 · Cam 2
Cam 3 · Cam 4
Cam N
cameras
CPU node A
CPU node B
CPU node N
decode + motion
motion frames
GPU node
inference only
runs at capacity
Cloud
alerts

The result of splitting CPU and GPU work

1/10th

the hardware cost

Not from a better model or smarter cloud setup — just from deciding that decode and inference shouldn't live on the same machine.

Decision 3: Fine-tuned over Pre-trained

A brilliant architecture means nothing if the model gets it wrong. This is the part most people underestimate.

The model receives an image and tells you whether any target objects are present. The assumption many projects make: grab something pre-trained on a standard dataset — COCO, Open Images — point it at the feed, and you're done.

We found two specific gaps that ruled that out for this deployment.

Dataset
Size
Perspective
Resolution
CCTV weapons
COCO
330K images
Eye-level
High
Open Images
9M images
Eye-level / press
High
Object365
2M images
Eye-level
High
Our CCTV dataset
Custom curated
Top-down
Grainy / compressed

GAP 1 — PERSPECTIVE

CCTV cameras look down. Datasets don't.

Dataset image
Eye-level, close-up, clear
CCTV reality
cam
Top-down, distant, grainy

Models trained on press photos have never seen a weapon from above at corridor distance. The perspective gap is total.

GAP 2 — OBJECT SIZE

At camera distance, it's ~30 pixels wide.

Dataset image
Object fills frame
CCTV reality
~30px in full frame

Detecting something 30 pixels wide under real low-light conditions is a task standard models were never optimised for.

So we fine-tuned. Fine-tuning means you take a model already trained on a large dataset — which gives it a general understanding of the visual world — and then continue training it on your specific domain data. The base model already knows what a weapon looks like in general. You're teaching it what one looks like specifically in your deployment conditions.

We built our own dataset: CCTV-angle images, sourced from other datasets, search engines, social media, and news footage. Around 1,500 images per class is a good starting point — but quality matters more than quantity.

Annotation quality is the overlooked multiplier

Tight annotation

Box hugs the object — model learns the exact shape

Loose annotation

Box includes background — model learns noise

Tight, accurate annotations on 1,500 images will outperform sloppy annotations on 5,000. The model learns exactly what you draw. This is where most DIY fine-tuning efforts fail.

"We went deep enough into this that we ended up implementing YOLOv10 from scratch in PyTorch — full control over the architecture, no licensing constraints, optimised for the exact conditions this system runs in."

Results

The system is live. These are the numbers from production.

97.3%
Firearm detection accuracy
Validated across real-world school environments with diverse camera angles, lighting conditions, and partial occlusions — with near-zero false positives.
90%
Reduction in hardware costs
Splitting CPU and GPU workloads across commodity nodes eliminated the need for expensive all-in-one GPU servers at every site.
10×
More camera feeds processed
The same GPU budget that previously handled a handful of streams now powers 1,000+ simultaneous RTSP feeds.
~1s
End-to-end detection latency
From a weapon appearing in frame to an alert reaching operators. Motion-activated scanning eliminates wasted inference cycles.

"The architecture developed delivered outstanding threat detection accuracy while significantly reducing costs. Any company looking to push the boundaries of what's possible with AI would be fortunate to work with them."

Lewis Matthews
CEO, Angel Protection

Open Source: AngelCV

A side effect of going deep enough to build from scratch.

We implemented YOLOv10 in PyTorch from scratch — full architecture, no third-party weight dependencies, no AGPL licensing constraints. The client let us open-source it. It's called AngelCV.

Apache 2.0

Commercial use, free

YOLOv10

Built in PyTorch

Sensible defaults

Works out of the box

Simple interface, sensible defaults. Works out of the box. Apache 2.0 — use it commercially, free.

GitHub

The open source was a side effect. The actual goal was a model that works at 2am in a school corridor, picking up something the size of a postage stamp in a grainy top-down feed. And it does.

The three decisions — summed up

1

Edge over Cloud

Bandwidth alone ruled cloud out. Latency and privacy confirmed it. Only alerts leave the building.

2

Distributed over Monolithic

CPU was the bottleneck. Split decode and inference across cheap commodity nodes → GPU runs at capacity → 1/10th the hardware cost.

3

Fine-tuned over Pre-trained

Public datasets don't have top-down CCTV perspective or sub-50px weapon images. Domain-specific data + tight annotations changed accuracy entirely.

None of those were the default answer. They only make sense when you look closely at the actual constraints of the deployment. That's what bespoke design gives you — not just better numbers, but a system that couldn't have been built any other way.

Gradient Insight

Cameras deployed and not doing enough?

Security, manufacturing, logistics — if cameras are already running and not extracting value from what they see, this is the kind of project we take on. Free discovery call, no commitment.