Skip to content

Plain text encoded email body is not decoded correctly #98

@arbazkhan002

Description

@arbazkhan002

Describe the bug
A plain text encoded body is not decoded correctly. In technical terms, 001E data encoding stream is read as 001F instead. This is because PROPS_ID_MAP maps the stream 1000 (body tag) to 001F: code here

Expected behavior
001E encoding should be read as 001E encoding

Screenshots
Showing how my stream (directory_name) has both name and data encoding there:
Screen Shot 2019-12-26 at 10 03 00 PM

Additional context
Using PROPS_ID_MAP as a reference for property details doesn't work well in practice and should only be used as last resort unless directory_entry_name doesn't have that information (or the encoding is not recognized).

For more information, this is how I am reading the output:

import email
from msg_parser import MsOxMessage
from msg_parser.email_builder import EmailFormatter
textfile = <path_to_msg_file>
msg_obj = MsOxMessage(textfile)
email_obj = EmailFormatter(msg_obj)
eml_content = email_obj.build_email()
text = get_email_text(email.message_from_file(StringIO(eml_content)))

def get_email_text(msg) -> (str, str):
    text = None
    for part in msg.walk():
        if part.get_content_type() == "text/plain":
            text = part.get_payload(decode=True).decode('utf-8')
    return text

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions