Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got "malloc(): unaligned tcache chunk detected Aborted (core dumped)" while using add_redact_annot/apply_redactions #3758

Open
JiahuanChen opened this issue Aug 8, 2024 · 6 comments
Labels
fix developed release schedule to be determined upstream bug bug outside this package

Comments

@JiahuanChen
Copy link

JiahuanChen commented Aug 8, 2024

Description of the bug

I was trying to remove all text from PDF files. My python script looks like the following:

for page in document:
    info = json.loads(page.get_text('json', flags=fitz.TEXTFLAGS_TEXT))
    for block_ind, block in enumerate(info['blocks']):
        for line_ind, line in enumerate(block['lines']):
            for span_ind, span in enumerate(line['spans']):
                # print(span)
                page.add_redact_annot(fitz.Rect(*span['bbox']))
    page.apply_redactions()

This code works well, but notice the # print(span). If I print the infos, i would get malloc(): unaligned tcache chunk detected Aborted (core dumped).

This is really strange to me.

Do I need to upload th PDF files or other informations? Because the files contain personal information, I am not willing to upload it to be honest.

How to reproduce the bug

smiply comment/uncomment the print line would reproduce the bug.

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.10

@JorjMcKie
Copy link
Collaborator

You can send me the file via mail, so it won't be exposed here.
Is this the only file showing the problem?
I also am a little confused:
Why do you extract all text at all if you want to remove it anyway? You can simply add one redaction annotation covering the full page.
But you should add options to apply_redactions that prevent removal of images and graphics.
You don't do that currently albeit your text might overlap such objects...
Anyway, we cannot follow up the problem without a file at hand.

@JiahuanChen
Copy link
Author

Hello, just send you an email with the problem file. It is the only file with the problem.

And by the way, if I apply_redactions each time after add_redact_annot. the code works well -- without error and correct result.

for page in document:
    info = json.loads(page.get_text('json', flags=fitz.TEXTFLAGS_TEXT))
    for block_ind, block in enumerate(info['blocks']):
        for line_ind, line in enumerate(block['lines']):
            for span_ind, span in enumerate(line['spans']):
                print(span)
                page.add_redact_annot(fitz.Rect(*span['bbox']))
                page.apply_redactions()

@JorjMcKie
Copy link
Collaborator

Thanks for the file.
I was able to reproduce the problem - but only under Linux: it runs fine under Windows. I used the following simplified script by the way - no need to make a json string which you immediately convert back to a Python dictionary.
Also note that there is no need to convert 4-tuples to rectangles: all PyMuPDF methods will detect Python sequences where points, rectangles or matrices are expected and does the necessary conversions.

import pymupdf


doc = pymupdf.open("test.pdf")
page = doc[0]
blocks = page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]
spans = [s for b in blocks for l in b["lines"] for s in l["spans"]]
for s in spans:
    page.add_redact_annot(s["bbox"])
page.apply_redactions()
print(f"{len(spans)} annots created")
doc.ez_save("redacted.pdf")

This script runs under Windows, but gets the malloc error under Linux.

So how do you want to proceed: we will need to get the MuPDF team involved for a solution, so they would also need the reproducing file - for which I need your ok.
Of course PyMuPDF and MuPDF are all maintained by the same company Artifex, so confidentiality is secured in any case.

@JiahuanChen
Copy link
Author

Yes sure, you could share the file with your team.

Thank you for the improving codes.

@wapiflapi
Copy link

I'm seeing the same issue on some documents. Unfortunately I'm not able to share them.

Is there a place where we can follow the progress on this issue on MuPDF's side of things ?

In the meantime, did someone find a workaround for this issue when it happens ?

@julian-smith-artifex-com
Copy link
Collaborator

We have a fix for the problem in MuPDF.

I don't yet know when this will be available for use in a PyMuPDF release.

@julian-smith-artifex-com julian-smith-artifex-com added fix developed release schedule to be determined upstream bug bug outside this package labels Oct 28, 2024
julian-smith-artifex-com added a commit that referenced this issue Oct 28, 2024
For #3758. Note that the input file is not public so this test does nothing if
it is not present.
julian-smith-artifex-com added a commit that referenced this issue Oct 28, 2024
For #3758. Note that the input file is not public so this test does nothing if
it is not present.
chris-liddell pushed a commit to ArtifexSoftware/mupdf that referenced this issue Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

4 participants