Should You Own Security Ingestion Pipelines?

Tradeoffs in security data lakes

The idea of using a security data lake to power your threat detection workloads is nothing new. Since the launch of Snowflake’s AI Data Cloud for Cybersecurity many of our customers have chosen to use a security data lake architecture as the backend for their SIEMs. They leverage a connected application architecture for the cost savings, performance benefits and downstream analytic use cases while still having out of the box functionality that a SIEM vendor provides.

A flowchart depicting data sources feeding into a Snowflake-based security analytics system. On the left, data sources include cloud platforms (AWS, Google Cloud, Azure), hosts (Linux, Windows, Red Hat, SentinelOne, CrowdStrike), networks (Cisco, Palo Alto Networks, Cloudflare, Corelight, Fortinet), and applications (Okta, Zscaler, Office 365, Google Suite, Exabeam). Data is collected by a Connected App, which enriches, normalizes, and detects threats before sending alerts. Data is also stored — Snowflake Connected Application Architecture

While all our connected application partners offer out-of-the-box detections and a familiar user interface for investigations and crafting detections, some also support the ability to bring your own pipelines. Customers often ask if there is a “best practice” or Snowflake recommendation. In short, there is not. Which architecture you choose is almost entirely a matter of what is the best fit for your organization. Let’s talk about tradeoffs and how you can best evaluate.

The primary benefit to owning your own ingestion pipelines is flexibility. As companies shift to viewing security logging as a data problem, integrating into existing organizational processes or using data that has already been brought into Snowflake for other teams, such as asset management or HR data, allows customers to avoid double ingestion and simplify their architectures. Organizations may have hard compliance requirements that require very specific management of ingesting and processing data, and so managing pipelines may be the best way to meet those requirements. Furthermore, as data silos are broken down, threat detection may only be an auxiliary workload to larger security analytics use cases, and so organizations may decide that a “bolt-on” approach is better for them.

The primary tradeoff is complexity. The cost associated with maintaining pipelines can grow exponentially as the number of sources brought in increases. Upstream sources will occasionally change schemas and pipelines sometimes break for “no reason.” If you have a strong centralized data team, they may be able to build and maintain these for you, but consider that security often has different SLA requirements than other business units. If you have a critical investigation happening at 2 AM on New Year’s Day, will that team be around to onboard essential logs for you? Security ETL tooling can help mitigate the complexity and maintenance requirements, in which case the tradeoffs shift to procurement and management of an additional vendor.

One thing intentionally omitted above is vendor lock in. Everyone’s needs are different but in general, using vendor supplied ingestion tooling and migrating to a self managed model later does incur a significantly larger amount of technical debt when compared to building it yourself from the start. The technical debt it does create can be largely mitigated by loosely coupling when using the data connected applications ingest for downstream workloads.

In short, there are technical advantages and tradeoffs to owning your own pipeline. In some cases these tradeoffs are critical, but in many cases, organizations find capabilities such as a strong detection library, investigation tooling, AI and cost are more important when selecting a connected SIEM.

To learn more about ingestion of security data into Snowflake see this guide.
To learn more about our connected application ecosystem see this report

This article originally appeared on Medium.