Skip to content

feat/Support custom extensions for partition_md #4006

@Chenyl-Sai

Description

@Chenyl-Sai

Is your feature request related to a problem? Please describe.
In unstructured/partition/md.py, the partition_md method converts Markdown to HTML using a fixed extension list: html = markdown.markdown(text, extensions=["tables"]). This causes an issue when processing code blocks containing # comments, as they get incorrectly parsed as <h1> tags.

Describe the solution you'd like
Modify partition_md to accept custom Markdown extensions via kwargs:

  1. Read extensions parameter from method kwargs
  2. Default to ["tables"] if not specified (backward compatible)
  3. Pass extensions to markdown.markdown()
  4. Allows users to handle special cases (e.g. add "fenced_code" for code blocks)

Describe alternatives you've considered
Security Impact: None (parameter addition only)
Backward Compatibility:

  • Low risk: Existing calls without extensions kwarg remain unchanged
  • Medium risk: If users already pass extensions in kwargs (unlikely given current usage)

Recommended Implementation Approach:

  1. Defensive Kwargs Handling:
    def partition_md(
     filename: str | None = None,
     file: IO[bytes] | None = None,
     text: str | None = None,
     url: str | None = None,
     metadata_filename: str | None = None,
     metadata_last_modified: str | None = None,
     **kwargs: Any,) -> list[Element]:
        ....
     extensions = kwargs.pop('extensions', ["tables"]) 
     if not (isinstance(extensions, list) and all(isinstance(ext, str) for ext in extensions)):
         logging.warning(
             f"Ignoring invalid 'extensions' argument (expected list of strings): {extensions!r}"
         )
         extensions = ["tables"]
     html = markdown.markdown(text, extensions=extensions)
        ....

Additional context
The following is the test markdown text

# create the container
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest

And the test code before modified

from unstructured.partition.md import partition_md
elements = partition_md(text=text)
for i, el in enumerate(elements):
    print(f"\n--- Element {i} ---")
    print(f"Type: {type(el).__name__}")
    print(f"Category: {getattr(el, 'category', 'N/A')}")
    print(f"Text: {el.text!r}") 

And the outputs:

--- Element 0 ---
Type: Text
Category: UncategorizedText
Text: '```bash'

--- Element 1 ---
Type: Title
Category: Title
Text: 'create the container'

--- Element 2 ---
Type: Text
Category: UncategorizedText
Text: 'docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest ```'

Here is the test code after modified

from unstructured.partition.md import partition_md
elements = partition_md(text=text, **{'extensions': ['fenced_code']})
for i, el in enumerate(elements):
    print(f"\n--- Element {i} ---")
    print(f"Type: {type(el).__name__}")
    print(f"Category: {getattr(el, 'category', 'N/A')}")
    print(f"Text: {el.text!r}") 

And the outputs:

Type: NarrativeText
Category: NarrativeText
Text: '# create the container\ndocker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest'

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions