Real-Time Weapon Detection Across 1,000+ Cameras: Architecture & Results

In the LLM era, detecting an object in an image isn't impressive. Any capable model does it. But doing that same thing in real-time — across over a thousand cameras simultaneously, running 24/7 in live US high schools and hospitals — that's a different problem entirely.

This is the story of how we built that system for Angel Protection: the three decisions that shaped the architecture, and why the obvious answer was wrong every time.

Prefer to watch? The full video walkthrough is above. The post goes deeper on implementation detail.

First: why not just use an LLM?

LLMs handle images now. The question is legitimate. Here's the answer in one table.

LLM (Claude Opus)

Edge AI Hardware

$0.014 per image

$2,000 one-time hardware cost

$2,000 after just 2 hours

Runs for years — same cameras

Scales linearly with usage

Fixed cost, zero marginal cost

Requires cloud connectivity

Operates fully offline

Running 10 cameras at 2 frames per second costs $2,000 every two hours with a frontier LLM. That same $2,000 buys the hardware that runs a purpose-built detection model for years across those same ten cameras. LLMs are general-purpose tools. This is not a general-purpose problem.

Decision 1: Edge over Cloud

Angel Protection's instinct was already correct. Three forces confirm why.

When Angel Protection came to us, they had a system in mind: cameras at each location feeding a local server, that server runs the model, detections go up to the cloud for 911 alerts. Their question was whether that held at scale, or whether cloud processing made more sense.

The cloud had a genuine case — elastic scaling, less on-site hardware, push model updates without touching physical machines. For a lot of systems, cloud is the right call. We worked through the numbers to find out if this was one of them.

Bandwidth

Edge: Only send alerts — not footage. Bandwidth requirement disappears.

Cloud: 380 MB/s continuously from 100 cameras at 2fps. Most schools can't sustain that.

Latency

Edge: Inference runs locally. Detection triggers in milliseconds.

Cloud: Full network round-trip on top of inference time. In a school corridor, that costs real seconds.

Privacy

Edge: Raw footage never leaves the building. Only a timestamp and cropped image go anywhere.

Cloud: Raw CCTV footage of people is legally sensitive. Sending it to the cloud creates real compliance exposure.

Cost at scale

Edge: Fixed hardware cost per site. Predictable and owned.

Cloud: Elastic billing is a double-edged sword — unpredictable at 1,000+ cameras.

The number that settled it

One 1080p frame is 1.9 MB. A hundred cameras at 2fps — that's 380 MB/s continuously, from every location. Schools and hospitals don't have that kind of uplink, and where they do, you're competing with everything else on the network. On the edge, the only thing that ever leaves the building is a timestamp and a cropped image.

"Latency and privacy just confirmed what the bandwidth number had already decided. Cloud inference adds a full network round-trip on top of model processing time — when you're detecting a weapon in a school corridor, that round-trip costs real seconds."

Decision 2: Distributed over Monolithic

Edge was settled. The first version of the architecture had a problem we didn't expect.

The first version was one machine per location — CPU and GPU in the same box, handling everything: decoding the RTSP streams, running motion detection as a pre-filter, then running the model on frames that had actual movement.

Motion detection was a smart first optimisation. Most of the time a corridor is empty, so most frames never reach the model. But running that check across thousands of camera feeds every second is constant CPU work — and decoding raw compressed video streams sits on top of that.

Version 1 — Resource utilisation

CPU ~95%

Stream decoding + motion detection — CPU never got a break

GPU (model inference) <30%

The most expensive component — sitting idle, waiting for frames

The CPU was always near 100%. The GPU — the part actually running the model — was sitting below 30%. The most expensive component in the whole setup was basically idle. We were paying GPU-machine prices and the GPU was barely working. At this scale, that makes the whole project much more expensive than it needs to be.

We were already running high-spec CPUs — scaling up wasn't the answer. So we changed the shape of the system entirely.

Architecture: Before → After

v1 — Monolithic (one machine does everything)

Cam 1

Cam 2

Cam N

CPU + GPU box

decode + detect + infer

Cloud alerts

v2 — Distributed (split the work)

Cam 1 · Cam 2

Cam 3 · Cam 4

Cam N …

cameras

CPU node A

CPU node B

CPU node N

decode + motion

motion frames

GPU node

inference only

runs at capacity

Cloud
alerts

The result of splitting CPU and GPU work

1/10th

the hardware cost

Not from a better model or smarter cloud setup — just from deciding that decode and inference shouldn't live on the same machine.

Decision 3: Fine-tuned over Pre-trained

A brilliant architecture means nothing if the model gets it wrong. This is the part most people underestimate.

The model receives an image and tells you whether any target objects are present. The assumption many projects make: grab something pre-trained on a standard dataset — COCO, Open Images — point it at the feed, and you're done.

We found two specific gaps that ruled that out for this deployment.

Dataset

Size

Perspective

Resolution

CCTV weapons

COCO

330K images

Eye-level

High

Open Images

9M images

Eye-level / press

High

Object365

2M images

Eye-level

High

Our CCTV dataset

Custom curated

Top-down

Grainy / compressed

GAP 1 — PERSPECTIVE

CCTV cameras look down. Datasets don't.

Dataset image

Eye-level, close-up, clear

CCTV reality

Top-down, distant, grainy

Models trained on press photos have never seen a weapon from above at corridor distance. The perspective gap is total.

GAP 2 — OBJECT SIZE

At camera distance, it's ~30 pixels wide.

Dataset image

Object fills frame

CCTV reality

~30px in full frame

Detecting something 30 pixels wide under real low-light conditions is a task standard models were never optimised for.

So we fine-tuned. Fine-tuning means you take a model already trained on a large dataset — which gives it a general understanding of the visual world — and then continue training it on your specific domain data. The base model already knows what a weapon looks like in general. You're teaching it what one looks like specifically in your deployment conditions.

We built our own dataset: CCTV-angle images, sourced from other datasets, search engines, social media, and news footage. Around 1,500 images per class is a good starting point — but quality matters more than quantity.

Annotation quality is the overlooked multiplier

Tight annotation

Box hugs the object — model learns the exact shape

Loose annotation

Box includes background — model learns noise

Tight, accurate annotations on 1,500 images will outperform sloppy annotations on 5,000. The model learns exactly what you draw. This is where most DIY fine-tuning efforts fail.

"We went deep enough into this that we ended up implementing YOLOv10 from scratch in PyTorch — full control over the architecture, no licensing constraints, optimised for the exact conditions this system runs in."

Results

The system is live. These are the numbers from production.

97.3%

Firearm detection accuracy

Validated across real-world school environments with diverse camera angles, lighting conditions, and partial occlusions — with near-zero false positives.

90%

Reduction in hardware costs

Splitting CPU and GPU workloads across commodity nodes eliminated the need for expensive all-in-one GPU servers at every site.

10×

More camera feeds processed

The same GPU budget that previously handled a handful of streams now powers 1,000+ simultaneous RTSP feeds.

~1s

End-to-end detection latency

From a weapon appearing in frame to an alert reaching operators. Motion-activated scanning eliminates wasted inference cycles.

"The architecture developed delivered outstanding threat detection accuracy while significantly reducing costs. Any company looking to push the boundaries of what's possible with AI would be fortunate to work with them."

Lewis Matthews

CEO, Angel Protection

Open Source: AngelCV

A side effect of going deep enough to build from scratch.

We implemented YOLOv10 in PyTorch from scratch — full architecture, no third-party weight dependencies, no AGPL licensing constraints. The client let us open-source it. It's called AngelCV.

Apache 2.0

Commercial use, free

YOLOv10

Built in PyTorch

Sensible defaults

Works out of the box

Simple interface, sensible defaults. Works out of the box. Apache 2.0 — use it commercially, free.

GitHub

The open source was a side effect. The actual goal was a model that works at 2am in a school corridor, picking up something the size of a postage stamp in a grainy top-down feed. And it does.

The three decisions — summed up

Edge over Cloud

Bandwidth alone ruled cloud out. Latency and privacy confirmed it. Only alerts leave the building.

Distributed over Monolithic

CPU was the bottleneck. Split decode and inference across cheap commodity nodes → GPU runs at capacity → 1/10th the hardware cost.

Fine-tuned over Pre-trained

Public datasets don't have top-down CCTV perspective or sub-50px weapon images. Domain-specific data + tight annotations changed accuracy entirely.

None of those were the default answer. They only make sense when you look closely at the actual constraints of the deployment. That's what bespoke design gives you — not just better numbers, but a system that couldn't have been built any other way.

Gradient Insight

Cameras deployed and not doing enough?

Security, manufacturing, logistics — if cameras are already running and not extracting value from what they see, this is the kind of project we take on. Free discovery call, no commitment.

Book a discovery call Read the full case study

How We Deployed Real-TimeWeapon DetectionAcross 1,000+ CCTV Cameras

First: why not just use an LLM?

Decision 1: Edge over Cloud

Decision 2: Distributed over Monolithic

Decision 3: Fine-tuned over Pre-trained

Results

Open Source: AngelCV

Cameras deployed and not doing enough?

How We Deployed Real-Time
Weapon Detection
Across 1,000+ CCTV Cameras