Skip to content

Handle structured log messages #131027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

eyalkoren
Copy link
Contributor

@eyalkoren eyalkoren commented Jul 10, 2025

Closes #130333

In order to reuse code from JsonProcessor, I needed to make the ingest-otel module dependent on the ingest-common module. For this, I made IngestCommonPlugin an ExtensiblePlugin. Maybe there's a better way to enable such a dependency.

Copy link
Contributor

github-actions bot commented Jul 10, 2025

🔍 Preview links for changed docs

@eyalkoren eyalkoren self-assigned this Jul 13, 2025
@eyalkoren eyalkoren changed the title WIP: handle structured log messages Handle structured log messages Jul 13, 2025
@eyalkoren eyalkoren added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jul 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @eyalkoren, I've created a changelog YAML for you.

@eyalkoren eyalkoren marked this pull request as ready for review July 13, 2025 14:08
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looked at the documentation, left a comment

@eyalkoren
Copy link
Contributor Author

@flash1293 Thanks for reviewing the docs! I will correct what you've pointed out.
More importantly - if have a chance, take a look at the yaml tests, they are the best indication of whether we cover everything we had in mind.

@flash1293
Copy link
Contributor

I checked those, they look good, maybe we should capture the edge case discussed above via test as well - what happens if message is an object or a number or so.

@eyalkoren
Copy link
Contributor Author

eyalkoren commented Jul 14, 2025

what happens if message is an object or a number or so.

I can add tests for that, although this PR doesn't deal with such cases based on our decision

@flash1293
Copy link
Contributor

Sure, it would just make sure that it behaves as we expect

@eyalkoren
Copy link
Contributor Author

eyalkoren commented Jul 15, 2025

@flash1293 I applied your suggestions in 7208fce.

As I promised, the new logic that handles structured message doesn't touch anything that is not String, it just leaves it as is.

However, this highlighted something: the existing normalization logic treats any type of message field value the same and moves it to body.text. I think we can't do anything else if we want to keep this processor backwards compatible because we need to ensure that the indexing of body.text is equivalent to the indexing of message in non-processed documents.
For example, if you send a document with a scalar message - it won't be rejected. Elasticsearch will evaluate its string representation and store it as match_only_text. If you send a message with an object value - it will be rejected (or redirected to the failure store after #131261 is merged). I believe we must do the same with body.text.

So I am not proposing to change anything, but raising this for awareness and additional consideration.

@flash1293
Copy link
Contributor

Cool, thanks!

However, this highlighted something: the existing normalization logic treats any type of message field value the same and moves it to body.text

Yeah, I think that's OK and expected. Elasticsearch will either cast it or if that doesn't work it will set the ignored fields, so we should have a sensible behavior every time.

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eyalkoren eyalkoren requested a review from a team July 16, 2025 11:41
@lukewhiting lukewhiting requested a review from Copilot July 16, 2025 11:50
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for handling structured log messages in the normalize_for_stream processor by detecting JSON format in the message field and processing it based on ECS format conventions. The implementation reuses code from the JsonProcessor by making the ingest-otel module dependent on the ingest-common module.

Key changes include:

  • Enhanced message field processing to detect and parse JSON structures
  • ECS vs non-ECS format detection based on @timestamp field presence
  • Module dependency updates to enable code reuse from ingest-common

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
NormalizeForStreamProcessor.java Added structured message detection and processing logic with ECS format handling
NormalizeForStreamProcessorTests.java Added comprehensive test cases for ECS and non-ECS JSON message normalization
20_normalize_json_message.yml Added integration tests for structured message processing scenarios
IngestCommonPlugin.java Made plugin extensible to allow dependency from other modules
build.gradle Added dependency configuration for ingest-common module
module-info.java Updated module dependencies to include required logging and ingest-common
normalize-for-stream.md Added documentation for structured message field processing
131027.yaml Added changelog entry for the new feature
Comments suppressed due to low confidence (5)

modules/ingest-otel/src/main/java/org/elasticsearch/ingest/otel/NormalizeForStreamProcessor.java:66

  • [nitpick] The logger variable should follow Java naming conventions and be named 'logger' instead of 'log'.
    private static final Logger log = LogManager.getLogger(NormalizeForStreamProcessor.class);

modules/ingest-otel/src/main/java/org/elasticsearch/ingest/otel/NormalizeForStreamProcessor.java:143

  • [nitpick] Use 'logger' instead of 'log' to follow Java naming conventions.
            log.warn("Failed to parse structured message, keeping it as a string in 'body.text' field: {}", e.getMessage());

modules/ingest-otel/src/main/java/org/elasticsearch/ingest/otel/NormalizeForStreamProcessor.java:124

  • [nitpick] Use 'logger' instead of 'log' to follow Java naming conventions.
                            log.debug(

modules/ingest-otel/src/main/java/org/elasticsearch/ingest/otel/NormalizeForStreamProcessor.java:130

  • [nitpick] Use 'logger' instead of 'log' to follow Java naming conventions.
                            log.debug(

modules/ingest-otel/src/main/java/org/elasticsearch/ingest/otel/NormalizeForStreamProcessor.java:138

  • [nitpick] Use 'logger' instead of 'log' to follow Java naming conventions.
                        log.debug("Structured message is not a JSON object, keeping it as a string in 'body.text' field: {}", message);

@dakrone dakrone requested review from lukewhiting and removed request for a team July 22, 2025 15:47
message = message.trim();
if (message.startsWith("{") && message.endsWith("}")) {
// if the message is a JSON object, we assume it is a structured log
Object parsedMessage = JsonProcessor.apply(message, true, true);
Copy link
Contributor

@lukewhiting lukewhiting Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we already parse JSON in other pipelines but any new security concerns here as we are parsing arbitrary and possibly unsanitized inputs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new thing here is the fact that it will be on by default to logs-*-* data streams.
I am not familiar with the security concerns that may be related to such JSON parsing.
As for data sanitation - I guess that if sanitation is either applied on the message field or on any other fields after the JSON parsing - this is equivalent to existing sanitation of non-structured logs.
But let me ask for input on both issues.

Copy link
Contributor

@flash1293 flash1293 Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small correction - it will be on for logs,logs.*, not logs-*-*.

About security concerns - are you concerned about the case where an attacker could control the message, but not the whole document, with this approach essentially giving them control over the whole document? It's a good thought, I think it's true but I don't see how it would be problematic. The whole idea of the logs stream is that it can handle arbitrary incoming data in a sensible, opinionated way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a couple of things: Is there any way for a malicious payload to "break out" and mess up the document.

I'm also wondering in general how secure the JSON parsing code is given it will receive potentially end user crafted payloads (If the operator is logging user input from their apps) but I think that's a lesser concern as I guess we already use it elsewhere and we have the whole sandboxing / entitlements system protecting ES as a whole.

Copy link
Contributor

@lukewhiting lukewhiting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks sound and tests look good. Just a couple of quick questions around performance and security just to make sure my assumptions match reality (My assumption being that everything looks in line with how this will be used 😄)

@eyalkoren
Copy link
Contributor Author

eyalkoren commented Jul 24, 2025

@felixbarny @flash1293 please see @lukewhiting's comments above and provide your feedback on these issues.

I was mostly concerned about the dependency on another module - not because I think of any specific issue with that, mainly because I am not aware if there are hidden implications that I am not aware of.

@flash1293
Copy link
Contributor

Discussed this again with @LucaWintergerst and we decided to move forward with this approach - let's accept the tradeoffs and document it well.

Reasoning: We feel confident that this is the better default behavior. While there is a small potential for undesired behavior, users can work around that in various ways - the convenience for many weighs stronger than that.

@flash1293
Copy link
Contributor

@lukewhiting could you approve from your side? I think Eyal did a good job documenting the behavior in this PR already.

Copy link
Contributor

@lukewhiting lukewhiting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following in person call with Joe and Luca, I'm happy that we understand the risks of the current approach and they are good to proceed.

@felixbarny
Copy link
Member

The change LGTM. We've shied away from parsing arbitrary JSON before because with dynamic mappings, it can easily lead to a mess. However, streams changes the equation by disabling dynamic mappings and this processor is specifically meant for streams. I also like the different parsing logic for ECS and non-ECS JSON. It's orthogonal to this but I'm thinking it would make sense to have an OTel processor for parsing ECS logs as well, inspired by what's done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature Team:Data Management Meta label for data/management team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse structured messages when normalizing log events
5 participants