Security is a False Positive Problem

To stop drowning in your security issues, go upstream

Dec 11, 2024

In this post, I cover:

How key areas in security, especially vulnerabilities and alerts, are incredibly false positive prone
How handling these false positive prone interrupts adds little long-term value and leads to us drowning in noise
How to get beyond this by (mostly) ignoring these interrupts and instead go upstream

False Positives

Security is a false positive problem. We are drowning in them. By false positive, I mean anything that we (or our scanners, tools, intel feeds, senior engineers, interns, LLMs, whatever) identify as a potential threat when it never actually was or was going to be. A true positive would be something we identified as a threat, and it totally was.

There are two huge sources of false positives in the security world: vulnerabilities and alerts.

Vulnerabilities

Vulnerabilities come from your first-party code, its dependencies, the 3rd party systems you run (your databases, other infra, hopefully not MOVEit), and the software dependencies in the containers and server OS that run your software.

Some vulnerabilities are very bad1. The point of security is to prevent them, or at least mitigate them so they can't be exploited. It's our job, and we, as security practitioners, have devoted a lot of effort (and frantic pings to engineers) to managing them. Many consider vulnerability management the most important job of security teams.

This focus usually comes without looking at the data, though. If you do, it turns out most vulnerabilities are not bad for us: While there were 26,447 identified vulnerabilities in 2023 (a huge number!), only 109 were actually exploited and ended up on CISA's Known Exploited Vulnerabilities (KEV) list.2 That's a much smaller number. Only 0.41% of vulnerabilities are actually being exploited, and this is a consistent trend. That’s a 99.59% false positive rate for your vulnerability management program. It's usually even higher! If you're a cloud-based, SaaS-based company running on AWS, use Macs and Google, and don't use Cisco, Fortinet, or something else silly, only like 8 of those exploited vulns would apply to you. If you primarily run your own software, it’s even less: only 2% of software dependency vulns are even theoretically reachable for attack. Your best possible false positive rate is 98%, a distressingly high number, and the actual exploitable percentage will be far, far lower.

You may argue that many of these don't count as false positives. Perhaps this is a cockpit door problem, where vulnerabilities don't get exploited precisely because we patch them, making them less useful to try exploiting. That's probably not the case; it's safe to say that we as an industry don't patch well, and vulns are often exploited before the patch is widely applied. So we didn’t avoid these vulnerabilities because we were patching so quickly. They were just never going to be exploited.

Alerts

Here, alerts mean anything found by a detection system you may have (log management platform, security information and event management (SIEM), an intrusion detection system (IDS) like AWS GuardDuty, etc) that your team is meant to look into. You have something that is looking at actions performed by your systems or users, and alerting you when those actions match some pattern that's potentially malicious.

Common examples of what you’d alert on include:

If an employee laptop downloads a known malicious file
If an employee account downloads 500GB of customer data
If an employee account has 5 log-in failures in a row
If an employee account makes a significant production infrastructure change

These range from pretty likely to indicate something bad is afoot3, to highly likely that the employee is doing their job or just typoed their password a few times.

If you look at common examples of detection rules you should consider from the OSS project Sigma or from the SIEM vendor Panther, you will see a similar pattern; a few rules that almost certainly indicate malicious activity, and many that could be malicious activity, but will almost always actually be benign actions by your users or employees as part of their daily work.

So, if you enable all of those detection rules, something truly horrible will happen: you'll get deluged by alerts, of which a colossal number will be false positives. It will likely make the 99.4% false positive rate for vulnerabilities look pretty damn nice. Security teams often scale up large security operations teams (or outsource them to a cheaper country or vendor) to try to handle the load, and at best barely get by.

Drowning

In his book Upstream, Dan Heath popularizes a parable (potentially from Irving Zola) about public health:

You and a friend are having a picnic by the side of a river. Suddenly you hear a shout from the direction of the water—a child is drowning. Without thinking, you both dive in, grab the child, and swim to shore. Before you can recover, you hear another child cry for help. You and your friend jump back in the river to rescue her as well. Then another struggling child drifts into sight… and another… and another. The two of you can barely keep up. Suddenly, you see your friend wading out of the water, seeming to leave you alone. “Where are you going?” you demand. Your friend answers, “I’m going upstream to tackle the guy who’s throwing all these kids in the water.”

The lesson is that you sometimes must ignore those in grave peril to save more people in the long run. If you are a doctor working in a malaria-infested, deeply impoverished area, you can treat some number of people a day. And you want to spend all of your time helping patients, saving as many lives as you can.

But so many need treatment, far more than you can manage. To get these excess people treated, you have to spend time not treating people. Instead, you need to spend more time setting up infrastructure, finding staff, and getting money from Bill Gates to fund everything. You do this so you can treat more people over time. When there are thousands of patients, it’s more useful to build a hospital than to take appointments.

You are just as responsible for the death of that extra person you could have treated in a year’s time if you had built up your capacity, as you are for the death of the person you refused to treat today so you could build that hospital. How's that for an absurd trolly problem?

As in public health, we in security are also drowning in issues we need to solve, and we can't possibly solve them all. And we don't even have Bill Gates’ money to save us.

You have no idea how uninterested DALL-E & Gemini were in making this image.

Go Upstream

For both vulnerabilities and alerts, it’s not that fixing them isn’t important. Obviously they are: the vulnerabilities that do get exploited and the attacks that happen from them can be crippling. The point is not that we can ignore our vulnerabilities; it’s that manually patching them is a low-value way to improve the situation.

The only way to deal with false positive problems on this scale is to go upstream. Rather than trying to manage the overwhelming flow of issues, you have to make changes to reduce the flow coming in.

System Improvement Over Burndowns

Many security teams have vulnerability burndown charts. These are like sprint burndown charts but more depressing: they show the list of vulnerabilities, usually ranked by severity, how long they've been in the system without being patched, and how the total number changes over time. Many detection teams have a giant queue of alerts they need to investigate and confirm if something bad happened or not. The goal in both cases is to get to inbox zero: no vulns, no alerts (or at least none breaking your service level objective).

Neither of these activities adds value when you're processing false positives. You are no less likely to be vulnerable or to survive a compromise in 3 months, and you didn't prevent a compromise today. Since most of the things we deal with are false positives, you're just flailing about as more and more issues pass by you.

You need to get out of the water and work to reduce the rate of issues. You need to spend (almost) all of your time on proactive, project work to slow the flow.

There are a few ways to do this.

Find Leverage Points

Leverage points are places in a system where a single change or initiative can significantly impact its output. These are the projects and initiatives that can dramatically improve the system you're trying to optimize. As Archimedes said:

Give me a lever long enough and a fulcrum on which to place it, and I shall move the world.

Getting the big enough lever is the tricky bit, though. These initiatives will dramatically reduce our risk, or at least the number of (false and very occasionally true) positives we have to look at and worry that we're missing something by ignoring them. These will depend on your particular organization. To be more helpful, they may look like some of these:

Vulnerabilities:
- Minimized containers and automated container version bumping on new releases or deployments
- Automated patching tools, and useful test suites to make that tractable
- Tooling to give devs information on the riskiness and toil from their dependencies
- Phasing out old, unpatchable systems. Or anything made by Cisco, Fortinet, or Citrix.
Alerts
- Streamlining your processes for processing alerts
- Writing useful runbooks to ease your response efforts
- Systems for alert enrichment, automatic confirmation, or automatic response

With these initiatives, your issue count is (mostly) no longer a metric you actively work on. Rather, it becomes a lagging indicator you use to assess the effectiveness of your proactive initiatives. Good vulnerability management doesn’t mean your devs patch their vulns quickly. It means they never have to.

These leverage points primarily deal with reducing the total number of vulnerabilities or alerts you have to deal with. While that certainly adds value, this can only go so far. Attackers will always find novel tactics. Many vulnerabilities are exploited as zero days, which your vulnerability management can do absolutely nothing about. We can’t just spend time getting to inbox zero here.

Build Robustness

It’s even more valuable to make your system robust by doing proactive work to make exploitation less impactful, so you care less about the vulns that do exist or the alerts you miss.

Let's say you have a server-side request forgery (SSRF) vuln in an application. An SSRF is a vulnerability where an attacker can cause your app to make a request with the permissions and access your app has (usually more than the attacker has) and often returns the results of that request back to the attacker. This is a great way to extract data about a system; usually internal services are trusted! This is a vulnerability you can patch, and an attack vector you can write detection rules to catch.

What if, instead, you build a little robustness into your system? You disable the legacy AWS Instance Metadata Service (IMDSv1) so that an attacker can no longer extract AWS creds. You add service isolation via security groups, kubernetes configuration, mTLS, or iptables to restrict access to anything on your network that you don't intend. You require auth on all your data stores. Not only are all these things good to do anyway, but if you do them, you can never be hurt by an SSRF vulnerability ever again.

This is a much better state than having merely patched all of your SSRF vulns! You're now protected from the vulns you don't know about yet. If you want, it lets you ignore all future SSRF vulns, or at least significantly reduce their priority on your to-do list. That's a much more zen place to be.

In the same way, it's far better to add an SCP to your AWS account to prevent some malicious action than to alert. Never alert on something you can just as easily block.

There are many vulnerability and threat classes like this, a lot of them solved by the same sort of proactive interventions. An ounce of prevention is worth a pound of cure, and that's where you should spend most of your resources.

Ignore the Fires

We all know the meme:

Create meme "meme dog in a burning house, meme dog on fire, this is ...

What you don't see is the only way to deal with this problem. You have to ignore most fires and prioritize. There is always going to be more than you can possibly deal with, and you need to focus on the hottest fires. Here are some ways to do this:

Reduce scope. It is common to focus only on the critical and high-severity stuff and ignore all low and medium-severity issues. Having a threshold is a good idea; you should just probably include a lot of highs too, until you’ve reduced the flow.
Aggressively timebox. Cap the time you spend on false positive-prone interrupts. Severely. Ignore everything once you hit that timebox for the day or week, and turn off the least important sources of issues until you’re meeting your timebox again.

If you're making a change to focus on your problems upstream, the extent you filter your interrupts is going to feel uncomfortable. It should be! These are legitimate fires you're ignoring. But it's okay, you're only temporarily ignoring them. You'll come back to them when you can actually put them out.

No, Not Vendors

If you follow the industry, you'll know how many vendors exist to try and help you prioritize these for you and help sort out false positives. Shouldn't getting Wiz, Orca, or Dazz be a key step in managing this problem?

I don't find that vendor magic to help much here, for a few reasons:

Liability avoidance. That prioritization that vendors sell isn’t meant to make your security program optimal. It’s meant to allow the vendor to say they help while avoiding liability if you are breached (or at least make sure you won’t be super pissed at them). So they only follow some external standard of priority or otherwise grossly overestimate it. Better (for them) to be cautious with their prioritizations, even if it makes you less secure overall by wasting time on false positives.
Lack of Context. Their advice won't know the details of your environment, so it won't be well prioritized for you anyway. If you want to have some external prioritization to help you out, "patch everything that ends up on the KEV List" is basically as good as what any vendor is going to sell you.
Cost. You're still spending resources; usually lots of them. These vendors tend to be very expensive, 6 or 7 or 8 figures, depending on your size and your skill at negotiating. The money you spend on them is better spent on reducing flows of issues, not prioritizing them, and there are few situations where vendors are the ideal solution for improving those flows.

Actionable Advice

Stop drowning. Whatever your awful flow of pain is, hard timebox the time you spend on them, below a number that feels comfortable for you. You have my permission to stop being stressed about what you have to ignore, if that helps at all.

Spend the rest of your time going upstream, doing proactive things to reduce the flow of issues. You have to actually do this! You don’t have my permission to ignore your issues completely! This is meant to better direct your efforts, not to be an excuse to do nothing at all about your vulnerability load.

Spend that proactive time finding leverage points to significantly reduce the flow of issues for you to deal with. Or spend it making your system secure such that you care less about the issues. Something productive!

Citation Needed

There are those who will gripe about using KEV as a ground truth list of exploited vulnerabilities. The argument goes that KEV is only the obvious vulnerabilities that impact organizations that are both mature enough and open about breaches to report them, so this will miss less obvious vulns and anything where the org avoided disclosing the breach. Those are fair points, but as far as I know, KEV is the best we have, so we should use it. The big security vendors like Crowdstrike, PAlto, Cisco Talos, etc report to CISA, so their data is contained in KEV. There are plenty of other reports that indicate more vulns facing exploitation, but they usually come from vulnerability management vendors with a vested interest in making you more concerned, and to do that they usually inflate the number by including vulns with detected scanning activity to honeypots. These are very often security researchers finding things to write blogs about, so it inflates the numbers.

Though not always; there are whole justice department lawsuits around the fact that IT orgs don’t understand that malware researchers need to have malware on their computers.

Security Is

Discussion about this post