Add XmlProcessor initial implementation #130337

marc-gr · 2025-06-30T14:18:38Z

This PR creates a new XML processor that achieves feature parity with Logstash's XML filter.

⚙️ Configuration Options

processors:
  - xml:
      field: "xml_data"
      target_field: "parsed"
      to_lower: false
      # Logstash-compatible options
      xpath:
        "/root/item/@id": "item_id"
        "//product/name/text()": "product_name"
      namespaces:
        "ns": "http://example.com/namespace"
      force_array: true
      force_content: false
      remove_namespaces: false
      ignore_empty_value: true
      parse_options: "strict"

🏗️ Architecture

Streaming SAX Parser: Optimal memory usage for large XML documents
Selective DOM Building: Only builds DOM when XPath expressions are configured
Pre-compiled XPath: XPath expressions compiled at processor creation for performance
Security: Enhanced XXE protection with secure parser factory configurations

📚 Documentation

Documentation includes:

Complete configuration reference
XPath expression examples
Namespace configuration guide

Logstash differences

ignore_empty_value behaves a bit different than suppress_empty, but I think it matches better with other processors behavior. It could be adapted, or even add both, but I found it confusing.

Closes #97364

github-actions · 2025-06-30T14:18:49Z

🔍 Preview links for changed docs:

🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes.

elasticsearchmachine · 2025-07-01T08:33:36Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-07-01T08:34:15Z

Hi @marc-gr, I've created a changelog YAML for you.

- Replace XMLStreamReader with SAX parser + DOM for XPath support - Add XPath extraction, namespaces, strict parsing, content filtering - New options: force_array, force_content, remove_namespaces, store_xml - Enhanced security with XXE protection and pre-compiled XPath expressions - Full test coverage and updated documentation

github-actions · 2025-07-04T13:57:51Z

🔍 Preview links for changed docs

- Fix test assertion for remove_namespaces feature - Use StandardCharsets.UTF_8 instead of string literal - Replace string reference comparison with isEmpty() - Move regex pattern to static final field for performance

… into feat/xml-processor

Copilot

Pull Request Overview

Adds an initial implementation of a new XmlProcessor to parse XML input into JSON-like structures with feature parity to Logstash’s XML filter, alongside configuration options, factory validation, and documentation.

Introduce XmlProcessor with streaming SAX parsing, optional DOM building for XPath, and secure defaults.
Add end-to-end tests (XmlProcessorTests) and factory validation tests (XmlProcessorFactoryTests).
Register the processor in the plugin, update module-info, documentation, and changelog.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java	New XML parsing processor implementation
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorTests.java	End-to-end tests for XML parsing behavior
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorFactoryTests.java	Tests for factory config and validation
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/IngestCommonPlugin.java	Register `XmlProcessor` in the plugin registry
modules/ingest-common/src/main/java/module-info.java	Add `requires java.xml` for XML APIs
docs/reference/enrich-processor/xml-processor.md	Documentation for XML processor
docs/reference/enrich-processor/toc.yml	Add entry for `xml-processor.md`
docs/reference/enrich-processor/index.md	Include `xml` processor in the index
docs/changelog/130337.yaml	Changelog entry for PR

Comments suppressed due to low confidence (1)

docs/reference/enrich-processor/xml-processor.md:9

The implementation actually uses a streaming SAX parser with optional DOM building for XPath. Update this description to reflect the streaming-based approach for accurate documentation.

Parses XML documents and converts them to JSON objects using a DOM parser. This processor efficiently handles XML data with a single-parse architecture that supports both structured output and XPath extraction for optimal performance.

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java

… into feat/xml-processor

PeteGillinElastic

Hi @marc-gr. First of all, thanks again for the contribution.

I figured that a sensible way to approach the review here is to cover the behaviour in a first round, and then cover the implementation later. Since you helpfully included comprehensive documentation, I figured that would be a good way into it. As such, in this review I've looked at those files, but I haven't looked at the actual Java code at all. I'll do that in a later round.

Overall, I think that the behaviour looks sensible, and the docs are nice and clear. I've left a few comments, which are mostly presentational, or asking for clarification.

PeteGillinElastic · 2025-07-09T09:29:07Z

docs/reference/enrich-processor/index.md

@@ -159,6 +159,9 @@ Refer to [Enrich your data](docs-content://manage-data/ingest/transform-enrich/d
 [`split` processor](/reference/enrich-processor/split-processor.md)
 :   Splits a field into an array of values.

+[`xml` processor](/reference/enrich-processor/xml-processor.md)
+:   Parses XML documents and converts them to JSON objects.
+


Nit: It looks like the contents of this section are in alphabetical order. So can you move this after the trim processor below?

PeteGillinElastic · 2025-07-09T09:55:34Z