-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Add XmlProcessor initial implementation #130337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔍 Preview links for changed docs: 🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes. |
5bd50d5
to
95df637
Compare
95df637
to
67dd264
Compare
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @marc-gr, I've created a changelog YAML for you. |
- Replace XMLStreamReader with SAX parser + DOM for XPath support - Add XPath extraction, namespaces, strict parsing, content filtering - New options: force_array, force_content, remove_namespaces, store_xml - Enhanced security with XXE protection and pre-compiled XPath expressions - Full test coverage and updated documentation
🔍 Preview links for changed docs |
- Fix test assertion for remove_namespaces feature - Use StandardCharsets.UTF_8 instead of string literal - Replace string reference comparison with isEmpty() - Move regex pattern to static final field for performance
… into feat/xml-processor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds an initial implementation of a new XmlProcessor
to parse XML input into JSON-like structures with feature parity to Logstash’s XML filter, alongside configuration options, factory validation, and documentation.
- Introduce
XmlProcessor
with streaming SAX parsing, optional DOM building for XPath, and secure defaults. - Add end-to-end tests (
XmlProcessorTests
) and factory validation tests (XmlProcessorFactoryTests
). - Register the processor in the plugin, update module-info, documentation, and changelog.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java | New XML parsing processor implementation |
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorTests.java | End-to-end tests for XML parsing behavior |
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorFactoryTests.java | Tests for factory config and validation |
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/IngestCommonPlugin.java | Register XmlProcessor in the plugin registry |
modules/ingest-common/src/main/java/module-info.java | Add requires java.xml for XML APIs |
docs/reference/enrich-processor/xml-processor.md | Documentation for XML processor |
docs/reference/enrich-processor/toc.yml | Add entry for xml-processor.md |
docs/reference/enrich-processor/index.md | Include xml processor in the index |
docs/changelog/130337.yaml | Changelog entry for PR |
Comments suppressed due to low confidence (1)
docs/reference/enrich-processor/xml-processor.md:9
- The implementation actually uses a streaming SAX parser with optional DOM building for XPath. Update this description to reflect the streaming-based approach for accurate documentation.
Parses XML documents and converts them to JSON objects using a DOM parser. This processor efficiently handles XML data with a single-parse architecture that supports both structured output and XPath extraction for optimal performance.
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @marc-gr. First of all, thanks again for the contribution.
I figured that a sensible way to approach the review here is to cover the behaviour in a first round, and then cover the implementation later. Since you helpfully included comprehensive documentation, I figured that would be a good way into it. As such, in this review I've looked at those files, but I haven't looked at the actual Java code at all. I'll do that in a later round.
Overall, I think that the behaviour looks sensible, and the docs are nice and clear. I've left a few comments, which are mostly presentational, or asking for clarification.
@@ -159,6 +159,9 @@ Refer to [Enrich your data](docs-content://manage-data/ingest/transform-enrich/d | |||
[`split` processor](/reference/enrich-processor/split-processor.md) | |||
: Splits a field into an array of values. | |||
|
|||
[`xml` processor](/reference/enrich-processor/xml-processor.md) | |||
: Parses XML documents and converts them to JSON objects. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It looks like the contents of this section are in alphabetical order. So can you move this after the trim
processor below?
|
||
### Force content mode | ||
|
||
When `force_content` is `true`, all element text content is stored under the special `#text` key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still slightly unclear what this does. As shown in the previous section, it seems to include some #text
keys without this option, but not all. Could you perhaps explain what the behaviour without this option is, and for your example maybe you could choose something which nicely illustrates the two behaviours (with and without this option) and show both for contrast?
"foo": { | ||
"b": "bold", | ||
"i": "italic", | ||
"#text": "This text is and this is !" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, I was somewhat surprised at this is behaviour. The text and the tags are all there, but the information about how they are ordered and interleaved has been thrown away.
You're the one who has actual use-cases, and you know how logstash behaves, so if you tell me that this is a sensible thing to do then I'll believe you. I just wanted to check that it's deliberate.
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document. | | ||
| `ignore_failure` | no | `false` | Ignore failures for the processor. When `true` and XML parsing fails, adds `_xmlparsefailure` tag to the document. See [Handling pipeline failures](docs-content://manage-data/ingest/transform-enrich/ingest-pipelines.md#handling-pipeline-failures). | | ||
| `to_lower` | no | `false` | Convert XML element names and attribute names to lowercase. | | ||
| `ignore_empty_value` | no | `false` | If `true`, the processor will filter out null and empty values from the parsed XML structure, including empty elements, elements with null values, and elements with whitespace-only content. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we consider maybe calling this remove_empty_value
? Or even remove_empty_values
? It seems to me that this would make it seem logically related to remove_namespaces
, rather than to ignore_missing
and ignore_failure
, which more accurately reflects what it does.
| `remove_namespaces` | no | `false` | If `true`, removes namespace prefixes from element and attribute names. | | ||
| `force_content` | no | `false` | If `true`, forces text content and attributes to always parse to a hash value with `#text` key for content. | | ||
| `force_array` | no | `false` | If `true`, forces all parsed values to be arrays. Single elements are wrapped in arrays. | | ||
| `parse_options` | no | - | Controls XML parsing behavior. Set to `"strict"` for strict XML validation that fails fast on invalid content. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to make this a string-valued option where "strict"
is the only valid value, rather than e.g. a boolean strict_parsing
option?
| --- | --- | --- | --- | | ||
| `field` | yes | - | The field containing the XML string to be parsed. | | ||
| `target_field` | no | `field` | The field that the converted structured object will be written into. Any existing content in this field will be overwritten. | | ||
| `store_xml` | no | `true` | If `true`, stores the parsed XML structure in the target field. If `false`, only XPath extraction results are stored. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in figuring that target_field
is ignored if this is set? I think it might be worth stating that explicitly.
} | ||
] | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please forgive my ignorance, it's a long time since I've dealt with XML namespaces (or XML at all, to be honest!). Why do we need to include the namespace mappings in the processor definition, when — in this example, at least — the same mappings are provided via the xmlns:*
attributes in the source document?
} | ||
``` | ||
|
||
### Force array behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of wonder whether this section adds much value, although I feel less strongly about this than about the lower case one.
"docs": [ | ||
{ | ||
"_source": { | ||
"xml_content": "<catalog><book><title>Invalid XML with control character</title></book></catalog>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm just being blind here, but... I don't see any control character here?
@@ -0,0 +1,6 @@ | |||
pr: 130337 | |||
summary: Add `XmlProcessor` initial implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is going into the release notes, so we should phrase it in a more user-facing way. Something like "Add xml
ingest processor for parsing XML", maybe.
This PR creates a new XML processor that achieves feature parity with Logstash's XML filter.
⚙️ Configuration Options
🏗️ Architecture
📚 Documentation
Documentation includes:
Logstash differences
ignore_empty_value
behaves a bit different thansuppress_empty
, but I think it matches better with other processors behavior. It could be adapted, or even add both, but I found it confusing.Closes #97364