-
Notifications
You must be signed in to change notification settings - Fork 996
Description
Is your feature request related to a problem? Please describe.
In unstructured/partition/md.py
, the partition_md
method converts Markdown to HTML using a fixed extension list: html = markdown.markdown(text, extensions=["tables"])
. This causes an issue when processing code blocks containing #
comments, as they get incorrectly parsed as <h1>
tags.
Describe the solution you'd like
Modify partition_md
to accept custom Markdown extensions via kwargs:
- Read
extensions
parameter from method kwargs - Default to
["tables"]
if not specified (backward compatible) - Pass extensions to
markdown.markdown()
- Allows users to handle special cases (e.g. add
"fenced_code"
for code blocks)
Describe alternatives you've considered
Security Impact: None (parameter addition only)
Backward Compatibility:
- Low risk: Existing calls without
extensions
kwarg remain unchanged - Medium risk: If users already pass
extensions
in kwargs (unlikely given current usage)
Recommended Implementation Approach:
- Defensive Kwargs Handling:
def partition_md( filename: str | None = None, file: IO[bytes] | None = None, text: str | None = None, url: str | None = None, metadata_filename: str | None = None, metadata_last_modified: str | None = None, **kwargs: Any,) -> list[Element]: .... extensions = kwargs.pop('extensions', ["tables"]) if not (isinstance(extensions, list) and all(isinstance(ext, str) for ext in extensions)): logging.warning( f"Ignoring invalid 'extensions' argument (expected list of strings): {extensions!r}" ) extensions = ["tables"] html = markdown.markdown(text, extensions=extensions) ....
Additional context
The following is the test markdown text
# create the container
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
And the test code before modified
from unstructured.partition.md import partition_md
elements = partition_md(text=text)
for i, el in enumerate(elements):
print(f"\n--- Element {i} ---")
print(f"Type: {type(el).__name__}")
print(f"Category: {getattr(el, 'category', 'N/A')}")
print(f"Text: {el.text!r}")
And the outputs:
--- Element 0 ---
Type: Text
Category: UncategorizedText
Text: '```bash'
--- Element 1 ---
Type: Title
Category: Title
Text: 'create the container'
--- Element 2 ---
Type: Text
Category: UncategorizedText
Text: 'docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest ```'
Here is the test code after modified
from unstructured.partition.md import partition_md
elements = partition_md(text=text, **{'extensions': ['fenced_code']})
for i, el in enumerate(elements):
print(f"\n--- Element {i} ---")
print(f"Type: {type(el).__name__}")
print(f"Category: {getattr(el, 'category', 'N/A')}")
print(f"Text: {el.text!r}")
And the outputs:
Type: NarrativeText
Category: NarrativeText
Text: '# create the container\ndocker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest'