Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e-mail messages with application/rtf body are imported as attachments, not message body #3897

Open
vsessink opened this issue May 2, 2024 · 5 comments
Labels
bug Things that should work, but don’t ingest-file Moderate Issue that may require attention

Comments

@vsessink
Copy link
Contributor

vsessink commented May 2, 2024

While importing an e-mail-archive in the (IMHO cursed) .PST-format, I came across a mailbox having all application/rtf for body type.

Content-Type: application/rtf
Content-Transfer-Encoding: base64
Content-Disposition: attachment; 
        filename*=utf-8''rtf-body.rtf;
        filename="rtf-body.rtf"

Yep, that's right: Content-Disposition: attachment, but still this is the actual e-mail body.

Now in Aleph, these messages will show up as empty, with rtf-body.rtf document as attachment.

I tried to work around it by unpacking the mail archive manually with readpst; then fixing the messages with a small python script (essentially replacing the rtf part with an html part. I used python's email.parser and simply checked if the first content_type would be application/rtf - if so, pipe that through unrtf and repack the message. Filthy, but working for the mail box itself).

This workaround would not help in Aleph, because the mime detection wizardry afterwards recognized text/html for mime type, instead of message/rfc822 - and actual attachments of the message would not be recognized anymore.

The latter may count as a separate bug: a message that starts with the following should IMHO not be detected as text/html?

Status: RO
User-Agent: none
From: "Firstname Lastname" <MAILER-DAEMON>
Subject: FW: Ticket 08-05
To:  Name (Company Name)
Date: Tue, 09 May 2022 14:41:50 +0000
Message-Id: <AM5PR04MB53161D02E214FEEC35C2156FB6466@AM0PR04MB9122.eurprd02.prod.outlook.com>
X-libpst-forensic-sender: /O=EXCHANGELABS/OU=EXCHANGE ADMINISTRATIVE GROUP (FYDIBOHF23SPDLT)/CN=RECIPIENTS/CN=09C4BB2213F35544FEBBBBF1FD14B522
MIME-Version: 1.0
Content-Type: multipart/mixed;
        boundary="--boundary-LibPST-iamunique-887075155_-_-"


----boundary-LibPST-iamunique-887075155_-_-
Content-Type: text/html; charset="utf-8"

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
@vsessink
Copy link
Contributor Author

vsessink commented May 2, 2024

This (previously WIP, now abandoned) mentions the same problems alephdata/ingest-file#20

Where is the mime detection done? I think it could work to try fix the output of readpst - by adding transport headers or otherwise; fix the RTF-parts of the e-mails, too. As I already walk over all e-mails to fix the RTF-parts, adding required headers for mime detection (message/rfc822 instead of text/html) could work, too.

@vsessink
Copy link
Contributor Author

vsessink commented May 2, 2024

Messages can pretty easily be "tricked" into being message/rfc822, by simply adding Received: from localhost (127.0.0.1) at the top of the message. IMHO as a workaround for the current state of things, this could be done right after readpst. I will investigate.

@vsessink
Copy link
Contributor Author

vsessink commented May 3, 2024

In order to fix messages that have an RTF-only message body, I'm manually starting a Python script:

#!/usr/bin/python3
import base64
import os
import sys
import re
import mimetypes
import email
from email.policy import default
from email.parser import BytesParser
import subprocess

plcy=default.clone(refold_source='none')
for fname in sys.argv[1:]:
  try:
    mail=open(fname,'rb')
  except:
    print(fname, "not found.")
    continue
  msg = BytesParser(policy=plcy).parse(mail)
  mail.close()
  totaal=list(msg.walk())
  if (len(totaal)<2):
    continue
  if (totaal[1].get_content_type() == 'application/rtf'):
    print("Converting", fname)
    html=subprocess.run(['/usr/bin/unrtf'], input=totaal[1].get_content(), capture_output=True).stdout
    totaal[1].set_content(html, maintype='text',subtype='html')
    try:
      mail=open(fname,'w')
    except:
      print("Error writing")
      continue
    print(totaal[0], file=mail)
    mail.close()

It's a hack. But it works and it really helps the search process. This could be run right after readpst but I really don't think this is production quality. Anyway, maybe it helps someone make a proper fix.

@Rosencrantz Rosencrantz added bug Things that should work, but don’t Moderate Issue that may require attention labels Jun 4, 2024
@vsessink
Copy link
Contributor Author

vsessink commented Jul 2, 2024

OK, here's more analysis and an awful corner case. I'm documenting it here because I don't think there's a better place. I unpacked a pst file with the regular readpst -e -D -8 -cv . One of these messages contains two message/rfc822 attachments having rtf-body.rtf for content type. So the script above should be made recursive.

Then the awful part of the finding is, that my attachments begin with

Content-Type: message/rfc822

>From "[email protected]" Tue Oct  4 14:22:48 2023

That shouldn't happen, the readpst man page says that for -e This format has no from quoting. (Where from quoting means prepending the word From with a > character). However, it apparently does. You must remove the >, otherwise the EmailMessage is wrongly interpreted.

@vsessink
Copy link
Contributor Author

vsessink commented Jul 2, 2024

Looking briefly, the From quoting problem is in readpst, libpst/src/readpst.c, where write_embedded_message writes out messages with the "From" quoting parameter ("embedding") being 1.

write_normal_email(f_output, "", item, MODE_NORMAL, 0, pf, save_rtf,
 1, extra_mime_headers);

@stchris stchris transferred this issue from alephdata/ingest-file Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that should work, but don’t ingest-file Moderate Issue that may require attention
Projects
None yet
Development

No branches or pull requests

3 participants