Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E-mail messages sometimes detected as text/html or text/plain (instead of message/rfc822) #3895

Open
vsessink opened this issue May 3, 2024 · 5 comments
Labels
bug Things that should work, but don’t ingest-file Moderate Issue that may require attention

Comments

@vsessink
Copy link
Contributor

vsessink commented May 3, 2024

As reported in #3897: mail files sometimes end up being recognized as either text/html or text/plain. This happens for example when ingesting .pst files: their outgoing mail messages don't have Received: headers but instead seem to start with a header Status: RO.

@vsessink
Copy link
Contributor Author

vsessink commented May 3, 2024

Analysis

Please note that the root cause of this problem is using libmagic, which actually is a sort of we-don't-know-how-it-works-but-it-seems-to-work type of file type / mime-type detection. It can do wonders but it can also get things horribly wrong.

A proper fix would be to make use of the fact that readpst spits out its e-mails with a clear .eml file name extension, so we already know that they're message/rfc822. Ingesting the resulting files should be made aware of the mime-type - instead of trying to re-evaluate (doing it wrong). But that's beyond scope here.

Workaround

Hand importing PST archives works best as follows:

  • use readpst as used in ingestors/email/outlookpst.py, i.e. readpst -e -D -8 -cv
  • libmagic will detect message/rfc822 if a message begins with a Received header. This apparently doesn't need to be a proper RFC2822 compliant header, just adding Received: from localhost (127.0.0.1) on top of the message will do.
  • Thus, a simple script to only fix the problematic messages could be:
find -type f -name '*.eml' -print0|xargs -0 file --mime-type|grep -v message/rfc822|cut -f1 -d:|while read f; do sed -i '1iReceived: from localhost (127.0.0.1)' "$f"; done

Fix (Dirty)

  • But I'm actually thinking that a simpler fix would be to just add Received: from localhost (127.0.0.1) to every message: find -type f -name '*.eml' -print0|xargs -0 sed -i '1iReceived: from localhost (127.0.0.1)' and do this right after calling readpst.

Please note that I do not know what happens if an Outlook / Exchange mailbox would contain an actual attachment with the name 123.eml. Does readpst work around this? Does it overwrite the 123.eml mail message? The above script would surely "enhance" this e-mail-attachment, too - even if it weren't an actual .eml file. But that's for another time.

@Rosencrantz Rosencrantz added bug Things that should work, but don’t Moderate Issue that may require attention labels Jun 4, 2024
@tillprochaska
Copy link
Contributor

Hi @vsessink, thanks for the detailed analysis, this is really, really helpful! I agree that the proper solution here would be either

  • to pass the correct mime type to the DirectoryIngestor as the preferred MIME type or
  • to select the ingestor based on detected MIME type and the extension (where we currently select the ingestor based on detected MIME type and only fall back to the extension if no MIME type could be detected).

@vsessink
Copy link
Contributor Author

vsessink commented Jun 4, 2024

Please note that the detected MIME type is the very problem, as detected means using libmagic.

@vsessink
Copy link
Contributor Author

BTW, @tillprochaska would it be possible to get a Slack account? I'm not really qualifying :-( as I'm an open source guy without investigative journalists as customers; I do have a small law firm as a customer and they have a couple of cases emerging from the Luanda Leaks - but still, this isn't a non profit use. But I'm willing to help the project. (And ATM, I'm having problems setting up Aleph in the non-developer-version; I don't think my question about that qualifies as a bug, more of a mailinglist question but you don't seem to have one, or do you?)

@Rosencrantz
Copy link
Contributor

Hi @vsessink

Let's get you started with an account on our discourse server to start with. This is the place we like to have our support requests as it serves well as a repository of information for all. Slack has an immediacy which is nice, but struggles with longevity.

You can create an account here https://aleph.discourse.group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that should work, but don’t ingest-file Moderate Issue that may require attention
Projects
None yet
Development

No branches or pull requests

4 participants