Alert Tuning Best Practices for Security Operations (SOC)

As security practitioners, it’s harder than ever to keep up with threat activity (if you’d like to commiserate with me, check out our previous blog detailing Security Operations Center (SOC) challenges.) In an ideal world, your team would have time to investigate every alert, check the applicability of every vulnerability, and exercise and test your incident response playbooks on a regular basis. In reality, it’s becoming nearly impossible for teams to do just one of these things extremely well.

More often than not, organizations are tuning their alerts to fit the size of their security team and incurring the risk of disabling or ignoring some subset of detection signals. It’s without question that some level of tuning needs to happen, but how do you go about making those decisions while minimizing “over-tuning”?

In this blog post, we will discuss some of the alert tuning best practices SOC teams should implement. These recommendations are in part inspired from the 2011 film, Moneyball, which tells the true story of how the Oakland Athletics general manager, Billy Beane (played by Brad Pitt), was forced to innovate with roster selection in the midst of resource constraints for bringing on new players. The parallels are clear for global SOC teams today, so we’re covering a simple repeatable strategy you can employ to make systematic improvements to your detection logic without incurring unnecessary risk.

At a high level:

Take notes from Billy Beane in Moneyball – use historical data to drive decision making around your alert adjustments rather than operating off of “gut feel”.
Focus your tuning efforts on alerts with low (or no) True Positive rates that consume large amounts of team time. These are the alert fatigue factories that create risk, drain analyst energy, and produce little to no results.
Make every alert work for you. Define clear requirements for detections to take the thought exercise out of what alerts should and shouldn’t make it to production.

Step 1: Understand your alert universe

The first step to make better alert tuning decisions is to understand what’s consuming your team’s time and mental capacity, and weighing that against detection accuracy. At a high level, you’re trying to minimize the amount of time spent on detections that aren’t providing direct value to the team.

Collect metadata around your alerts that have been dispositioned over the last 90 days. This may be stored within a SIEM, case management solution, or individual security tools.

Specifically:

Alert Name - What’s the name of the alert?
Alert Source - What tool or data source was used to generate this activity?
Alert Count - How many times did an alert fire?
Total Time Investigated - What’s the sum of time spent (from when the alert was acknowledged to resolved) on this alert?
Median Investigation Time - What’s the median time to investigate this alert? (Median metrics by nature are more resilient to outliers and will provide a more accurate picture of cognitive load)‍
Efficacy - What percentage of these alerts were true positives? (True Positive / Total Alert Count)

Understand that these 6 fields are just starting dimensions that you might expand upon, and it might not be easy to collect all of these metrics. As an example, gathering “Total Time Investigated” might be challenging for small teams that don’t retain a record of when triage started on an alert. In that case, you might use the time elapsed between “Alert Created” to “Alert Closed”, which most teams should be able to derive, to achieve a similar effect – with the caveat that your data will be skewed by Dwell Time.

Step 2: Prioritize tuning actions based on alert analysis

Patterns become more apparent in this dataset when plotting the records on a chart. For this framework, I recommend using “Efficacy” as the Y-axis, “Total Time Investigated” as the X-axis, and scaling the data point size based on “Alert Count”.

In the event you don’t have your own data visualization tool like a SIEM, we’ve gone ahead and made a simple Python script to produce the chart like the one below.

After reviewing the visualization, you’ll notice a few patterns start to emerge:

The top half of the chart are your “winners” – detections that successfully identified a threat. Alerts in the upper right quadrant typically represent long-standing incidents that required extensive scoping following a malicious determination or might signal training opportunities if the “median investigation time” is significantly higher than others for a given data point. High “median investigation times” are good indicators for cognitive load, and you may be able to improve the outcome of a detection by drafting more complete playbooks or investigative steps to better support other analysts. Best practice would be for rule authors to provide next investigative steps in any custom detection your team builds to reduce this problem.
The bottom half of the charts are your “losers”, or detections that rarely identified a threat. Alerts in the lower right quadrant represent your most impactful false positive tuning opportunities. These alerts represent a dangerous recipe for alert fatigue: high median investigation time (cognitive load), low efficacy, and high total investigative time spent. Prioritize these.
The left side of the chart often highlights areas where simple automation may be able to ease workload. Humans are quickly able to make an accurate decision based on the information at hand, and a few programmatically added steps may significantly increase your response time (and eliminate manual review).

For tuning purposes, prioritize the top few alerts in the bottom right quadrant – especially those with high median investigation times – and proceed to the next step.

Step 3: Take action with second-order questions

Visualizing your alert management highlights the sore spots that impact daily analysis, but it won’t clearly articulate what action should be taken next.

For next steps, you should investigate the individual alerts composing a data point to deepen your understanding of the issue. Each alert may have its nuances, but ask these questions consistently:

Has there ever been a true positive for this alert?
This should give you a sense of what yield you’re getting for the time spent investigating.
Can 90% or more of the alert volume be eliminated with a simple adjustment of rule logic?
This should cover most of your “runaway” alert scenarios where there’s a new rule or faulty rule that exists that’s triggering a small subset of actions in your environment.
Is this detection uniquely capable of identifying a threat?
It’s important to remember that threats don’t happen in isolation. Threat actors must perform a series of actions in order to infiltrate and achieve their objectives in your environment. As an analyst, ideally you’d want to disrupt an attack as early in the kill chain as possible, but that’s often not feasible given the volume of normal activity that happens in early stages of threat activity. Consider whether other detection signals would identify this type of threat during this part of the investigative lifecycle or shortly thereafter.

If the answer is “no” for all of these, disable the alert. If “yes” was answered for either #2 or #3, I would opt to tune the alert logic and review if the shift was effective during the next tuning review. Ideally for #2 , the goal would be to tune the alert to shift it to the lower left quadrant for automation.

Record your decisions in a standing team document to codify the decision. When possible, it’s best practice to shift alerts that are disabled to an “Informational” type severity for audit purposes.

When auditing tuning, you’ll want to look at the total alert count for a rule over a 6 month period. A significant drop in alert volume (single digits) might represent vendor logic changes and warrant revisiting production use.

Step 4: Decisions at scale

With an understanding of your current alert management position, you can put requirements around what detections must look like in the SOC in order to be reviewed by the team. This transfers “gut feel” tuning into a system that’s documentable, scalable, and easy to implement across the team.

Are we missing coverage for this MITRE Tactic / Technique?
Does it meet our efficacy threshold?
Does it meet our threshold for maximum cumulative time for false positives?

By answering these questions and regularly performing the tuning process listed above (or your own version of it), you can evaluate historical investigative trends, prioritize addressing most impactful alerts, and make decisions that ensure holistic threat coverage..

Closing Thoughts

Managing upstream detections is one of the core ways the security teams can combat alert fatigue and threat actors. Following best practices for alert tuning creates a natural balance between over-tuning and under-tuning alert signals so that SOCs can operate effectively, audit changes, and maintain their sanity.

While today alert tuning is a necessity to manage risk (and sanity) for security operations teams, we believe in a future for analysts that eliminates the need for tedious and repetitive tasks like alert tuning. That’s why we’re building Prophet AI for Security Operations to triage and investigate every alert on your behalf and avoid managing the upstream alert problem altogether.

If you’re interested in seeing how Prophet Security can help you triage and investigate alerts 10 times faster, request early access to Prophet Security today!

Text Link