Skip to content

Add XmlProcessor initial implementation #130337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

marc-gr
Copy link

@marc-gr marc-gr commented Jun 30, 2025

This PR creates a new XML processor that achieves feature parity with Logstash's XML filter.

⚙️ Configuration Options

processors:
  - xml:
      field: "xml_data"
      target_field: "parsed"
      to_lower: false
      # Logstash-compatible options
      xpath:
        "/root/item/@id": "item_id"
        "//product/name/text()": "product_name"
      namespaces:
        "ns": "http://example.com/namespace"
      force_array: true
      force_content: false
      remove_namespaces: false
      ignore_empty_value: true
      parse_options: "strict"

🏗️ Architecture

  • Streaming SAX Parser: Optimal memory usage for large XML documents
  • Selective DOM Building: Only builds DOM when XPath expressions are configured
  • Pre-compiled XPath: XPath expressions compiled at processor creation for performance
  • Security: Enhanced XXE protection with secure parser factory configurations

📚 Documentation

Documentation includes:

  • Complete configuration reference
  • XPath expression examples
  • Namespace configuration guide

Logstash differences

  • ignore_empty_value behaves a bit different than suppress_empty, but I think it matches better with other processors behavior. It could be adapted, or even add both, but I found it confusing.

Closes #97364

Copy link
Contributor

github-actions bot commented Jun 30, 2025

🔍 Preview links for changed docs:

🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes.

@marc-gr marc-gr force-pushed the feat/xml-processor branch from 95df637 to 67dd264 Compare June 30, 2025 14:28
@marc-gr marc-gr requested a review from Copilot June 30, 2025 14:29
@marc-gr marc-gr marked this pull request as ready for review June 30, 2025 14:29
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 30, 2025
@marc-gr marc-gr added the Team:Security Meta label for security team label Jun 30, 2025
@elasticsearchmachine elasticsearchmachine removed the Team:Security Meta label for security team label Jun 30, 2025
Copilot

This comment was marked as outdated.

@marc-gr marc-gr added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jul 1, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @marc-gr, I've created a changelog YAML for you.

- Replace XMLStreamReader with SAX parser + DOM for XPath support
- Add XPath extraction, namespaces, strict parsing, content filtering
- New options: force_array, force_content, remove_namespaces, store_xml
- Enhanced security with XXE protection and pre-compiled XPath expressions
- Full test coverage and updated documentation
Copy link
Contributor

github-actions bot commented Jul 4, 2025

@marc-gr marc-gr requested a review from Copilot July 4, 2025 13:58
Copilot

This comment was marked as outdated.

marc-gr added 2 commits July 4, 2025 16:22
- Fix test assertion for remove_namespaces feature
- Use StandardCharsets.UTF_8 instead of string literal
- Replace string reference comparison with isEmpty()
- Move regex pattern to static final field for performance
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds an initial implementation of a new XmlProcessor to parse XML input into JSON-like structures with feature parity to Logstash’s XML filter, alongside configuration options, factory validation, and documentation.

  • Introduce XmlProcessor with streaming SAX parsing, optional DOM building for XPath, and secure defaults.
  • Add end-to-end tests (XmlProcessorTests) and factory validation tests (XmlProcessorFactoryTests).
  • Register the processor in the plugin, update module-info, documentation, and changelog.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java New XML parsing processor implementation
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorTests.java End-to-end tests for XML parsing behavior
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorFactoryTests.java Tests for factory config and validation
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/IngestCommonPlugin.java Register XmlProcessor in the plugin registry
modules/ingest-common/src/main/java/module-info.java Add requires java.xml for XML APIs
docs/reference/enrich-processor/xml-processor.md Documentation for XML processor
docs/reference/enrich-processor/toc.yml Add entry for xml-processor.md
docs/reference/enrich-processor/index.md Include xml processor in the index
docs/changelog/130337.yaml Changelog entry for PR
Comments suppressed due to low confidence (1)

docs/reference/enrich-processor/xml-processor.md:9

  • The implementation actually uses a streaming SAX parser with optional DOM building for XPath. Update this description to reflect the streaming-based approach for accurate documentation.
Parses XML documents and converts them to JSON objects using a DOM parser. This processor efficiently handles XML data with a single-parse architecture that supports both structured output and XPath extraction for optimal performance.

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @marc-gr. First of all, thanks again for the contribution.

I figured that a sensible way to approach the review here is to cover the behaviour in a first round, and then cover the implementation later. Since you helpfully included comprehensive documentation, I figured that would be a good way into it. As such, in this review I've looked at those files, but I haven't looked at the actual Java code at all. I'll do that in a later round.

Overall, I think that the behaviour looks sensible, and the docs are nice and clear. I've left a few comments, which are mostly presentational, or asking for clarification.

@@ -159,6 +159,9 @@ Refer to [Enrich your data](docs-content://manage-data/ingest/transform-enrich/d
[`split` processor](/reference/enrich-processor/split-processor.md)
: Splits a field into an array of values.

[`xml` processor](/reference/enrich-processor/xml-processor.md)
: Parses XML documents and converts them to JSON objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It looks like the contents of this section are in alphabetical order. So can you move this after the trim processor below?


### Force content mode

When `force_content` is `true`, all element text content is stored under the special `#text` key:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still slightly unclear what this does. As shown in the previous section, it seems to include some #text keys without this option, but not all. Could you perhaps explain what the behaviour without this option is, and for your example maybe you could choose something which nicely illustrates the two behaviours (with and without this option) and show both for contrast?

"foo": {
"b": "bold",
"i": "italic",
"#text": "This text is and this is !"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, I was somewhat surprised at this is behaviour. The text and the tags are all there, but the information about how they are ordered and interleaved has been thrown away.

You're the one who has actual use-cases, and you know how logstash behaves, so if you tell me that this is a sensible thing to do then I'll believe you. I just wanted to check that it's deliberate.

| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document. |
| `ignore_failure` | no | `false` | Ignore failures for the processor. When `true` and XML parsing fails, adds `_xmlparsefailure` tag to the document. See [Handling pipeline failures](docs-content://manage-data/ingest/transform-enrich/ingest-pipelines.md#handling-pipeline-failures). |
| `to_lower` | no | `false` | Convert XML element names and attribute names to lowercase. |
| `ignore_empty_value` | no | `false` | If `true`, the processor will filter out null and empty values from the parsed XML structure, including empty elements, elements with null values, and elements with whitespace-only content. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we consider maybe calling this remove_empty_value? Or even remove_empty_values? It seems to me that this would make it seem logically related to remove_namespaces, rather than to ignore_missing and ignore_failure, which more accurately reflects what it does.

| `remove_namespaces` | no | `false` | If `true`, removes namespace prefixes from element and attribute names. |
| `force_content` | no | `false` | If `true`, forces text content and attributes to always parse to a hash value with `#text` key for content. |
| `force_array` | no | `false` | If `true`, forces all parsed values to be arrays. Single elements are wrapped in arrays. |
| `parse_options` | no | - | Controls XML parsing behavior. Set to `"strict"` for strict XML validation that fails fast on invalid content. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to make this a string-valued option where "strict" is the only valid value, rather than e.g. a boolean strict_parsing option?

| --- | --- | --- | --- |
| `field` | yes | - | The field containing the XML string to be parsed. |
| `target_field` | no | `field` | The field that the converted structured object will be written into. Any existing content in this field will be overwritten. |
| `store_xml` | no | `true` | If `true`, stores the parsed XML structure in the target field. If `false`, only XPath extraction results are stored. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in figuring that target_field is ignored if this is set? I think it might be worth stating that explicitly.

}
]
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please forgive my ignorance, it's a long time since I've dealt with XML namespaces (or XML at all, to be honest!). Why do we need to include the namespace mappings in the processor definition, when — in this example, at least — the same mappings are provided via the xmlns:* attributes in the source document?

}
```

### Force array behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of wonder whether this section adds much value, although I feel less strongly about this than about the lower case one.

"docs": [
{
"_source": {
"xml_content": "<catalog><book><title>Invalid XML with control character</title></book></catalog>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm just being blind here, but... I don't see any control character here?

@@ -0,0 +1,6 @@
pr: 130337
summary: Add `XmlProcessor` initial implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is going into the release notes, so we should phrase it in a more user-facing way. Something like "Add xml ingest processor for parsing XML", maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ingest Pipeline] XML Processor
3 participants