Normalization in Security Data Lakes

Early this summer I had the opportunity to present at my favorite conference, FWD:Cloudsec. I presented on the specific topic of data normalization for security data lakes. The conference was full of practitioners, many of whom were in the early stages of building or thinking about a security data lake and so I presented at a technical but general level. The following is a video of my talk. For those who prefer to read, see below for a write up based on my speaker notes.

Understanding the Context: Why Data Normalization Matters

Historically, security teams are among the last to embrace cloud technologies, however, with the industry’s shift to the cloud, security teams had no choice but to adapt. The cloud brought with it a massive increase in data volume, and with that came the need to manage and make sense of this data effectively.

One of the critical tasks in this new landscape is data normalization — getting disparate data sources into a common schema. However, normalization is not always necessary or beneficial in every case.

Why Normalize? The Pros and Cons

Normalization is often driven by the need to make data interoperable with other tools or systems. For example, Security Information and Event Management (SIEM) systems often require data to be in a specific format to function correctly. Normalization can also simplify querying by standardizing field names, reducing the cognitive load on analysts who no longer need to remember the specific names used by different data sources.

However, normalization is not without its downsides. For some organizations, the time and computational resources required to normalize data are not justified by the benefits. In cases where data sources are not compared directly, or where the existing formats are already sufficient for the organization’s needs, normalization might be an unnecessary step.

The Controversy of Cloud Audit Log Normalization

One of the more controversial areas of normalization is with cloud audit logs. AWS Security Lake supports normalization of these logs for AWS, some practitioners like this and some resist this approach. Different cloud platforms — like AWS and GCP — have distinct characteristics, and trying to force a one-size-fits-all normalization can strip away important context. For organizations using only one or two cloud platforms, the return on the effort to normalize may not be worth it, especially if the primary use case is analytics or investigation rather than cross-cloud threat detection.

When and Where to Normalize

Normalization across all data sources is a myth — naming conventions are hard enough to enforce, let alone maintaining a consistent schema across evolving datasets. The key is knowing when and where to normalize.

At the Source: Normalizing data at the source can be efficient, especially if vendors provide logs in a standardized format. However, reliance on vendors to maintain these standards can be risky if they do not keep their offerings up to date.
In the Pipeline: Tools like Cribl, Apache NiFi or Fluentd allow for normalization during data transit, which can be more cost-effective. But this approach also has its challenges, especially when dealing with backward compatibility or the need to revert to raw data.
At the Storage Location: Normalizing data after it has been stored offers flexibility, allowing organizations to maintain raw data while creating views for compatibility. This approach, however, adds complexity and potential costs due to the need for maintaining multiple data transformation pipelines.

Choosing a Standard

When it comes to selecting a standard for normalization, don’t overthink it. The choice of standard is less critical than ensuring that the chosen approach fits your specific needs and use cases.

Don’t Do Anything: For some organizations, not normalizing at all might be a viable option, especially if they do not require complex querying or cross-data source analysis.
Build It Yourself: Many organizations opt to create their own standards, normalizing only the fields they need for their specific operations.
Pick a Premade Standard: There are several established standards available, such as the Elastic Common Schema (ECS) and the Open Cybersecurity Schema Framework (OCSF). Each has its strengths, with ECS being widely used and recently donated to the OpenTelemetry project, and OCSF being a newer, community-driven standard with growing support. (Read my thoughts on OCSF here)

Conclusion: There’s No One Answer

In closing, there is no on answer for data normalization. The best approach is to be pragmatic — understand the specific needs of your organization, choose the right points in your data pipeline to apply normalization, and be prepared to make trade-offs.

The cloud has brought new challenges to security professionals, but with thoughtful planning and the right tools, these challenges can be managed effectively, allowing teams to leverage the power of cloud-based data without being overwhelmed by its scale and complexity.

This article originally appeared on Medium.