Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: NameObject doesn't handle all values correctly #2601

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Rak424
Copy link
Contributor

@Rak424 Rak424 commented Apr 16, 2024

Name object encodes all characters from 1 to 255, so it's not correct to use utf-8.

7.3.5 Name objects
Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0).

@pubpub-zz
Copy link
Collaborator

@Rak424
on page 18 of PDF32000-1:2008 you will see
image

The 8bits only encoding will be done with Latin1

@Rak424 Rak424 changed the title NameObject wrong encoding BUG: NameObject doesn't handle all values correctly Apr 17, 2024
@pubpub-zz
Copy link
Collaborator

@Rak424
please provide a pdf with the issue you are trying to fix. without I can not consider your PR as acceptable

@art049
Copy link

art049 commented Apr 17, 2024

Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance. I think it's worth considering in merging this or not.

image

To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.

Copy link

codecov bot commented Apr 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.92%. Comparing base (6d9a7ec) to head (508c82c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2601      +/-   ##
==========================================
- Coverage   94.92%   94.92%   -0.01%     
==========================================
  Files          50       50              
  Lines        8316     8309       -7     
  Branches     1667     1665       -2     
==========================================
- Hits         7894     7887       -7     
  Misses        261      261              
  Partials      161      161              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance.

If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Apr 18, 2024

Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance. I think it's worth considering in merging this or not.

image

To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.

The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions.
Are you dealing with decoding? Encoding?

@art049
Copy link

art049 commented Apr 18, 2024

If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?

It may seem small, but with over 4.5M downloads per month, I think the cumulated impact of such a change is significant.

The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions.
Are you dealing with decoding? Encoding?

Yes maybe this speedup means there is a regression somewhere. Overall the bench includes encoding afaik:
image

@Rak424
Copy link
Contributor Author

Rak424 commented Apr 18, 2024

Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:

bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio)
print(bio.getvalue)
> b'/#C3#BF'

The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.

A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:

NameObject(b"hello")
NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')

utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.

I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.

@pubpub-zz
Copy link
Collaborator

Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:

bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio)
print(bio.getvalue)
> b'/#C3#BF'

b'/\xc3\xbf'.decode("utf-8") returns properly ÿ which is equivalent to b'/\xff'.decode("latin-1")

for me there is no error.

The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.

A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:

NameObject(b"hello")
NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')

utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.

I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.

remember thatthe current code provides with NameObject.CHARSETS a way to add / remove charsets from the possible solutions and the default, based on other PRs uses the utf-8, gbk and then latin-1. all these cases come from actual PDFs. have you experienced any actual issues with some PDF?

@Rak424
Copy link
Contributor Author

Rak424 commented Apr 19, 2024

b"/#FF" corresponds to value b'\xff' and b'/#C3#BF' to value b'\xc3\xbf', that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value, you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255, and current implementation of name object only allows str and not bytes in constructor, so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless. I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf. Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.

@pubpub-zz
Copy link
Collaborator

b"/#FF" corresponds to value b'\xff' and b'/#C3#BF' to value b'\xc3\xbf', that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value,

you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255and current implementation of name object only allows str and not bytes in constructor.

NameObjects are actually string like : the proof is about using the encoding with the '#' character. The byte encoding is just a matter of file encoding.

so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless.

This fix was introduced years ago. I' trying some archeology to find it

I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf.

Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.

I agree that the current implementation does not match the specification:
image

The historical approach in pypdf is more permissive and as we are rewriting all the names with PdfWriter, there is no issue with readers.
Moving to a full compliant solution may be introduce lots of regression that will be very tricky to detect (NameObjects are a sort of lazy linking)
As long as this PR does not fix an existing issue I'm not fond of it.

@MartinThoma / @MasterOdin / @stefan6419846
can you give your opinions?

@stefan6419846
Copy link
Collaborator

I have to admit that it is hard to follow this for me. If I understand it correctly, there are some different interpretations of the PDF specs regarding name objects.

In my understanding, names are just an arbitrary list of bytes. An application has to ensure a consistent handling of them for matching purposes, but as they are mostly irrelevant to the user and primarily about the internal structure, there are no further rules except for predefined sequences, for example as part of dictionary keys.

The newly introduced test sequences from ISO 32000-2:2020, Table 4 pass with the old implementation as well, thus I do not see any real need for changing anything about this unless there is a real example where the current implementation fails and the new one does not.

@Rak424
Copy link
Contributor Author

Rak424 commented Apr 29, 2024

to be more clear, here is two examples that can lead to problems hard to detect:

  1. name value is modified after reading/writing operations:
bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/DocuSign#AE"), None).write_to_stream(bio)
print(bio.getvalue())  # > b'/DocuSign#C2#AE' != b"/DocuSign#AE"
  1. all delimiter characters should be escaped, but it's not the case:
bio = BytesIO()
NameObject("/#()<>[]{}/%").write_to_stream(bio)
print(bio.getvalue())  # > b'/#23#28#29<>[]{}#2F#25'

@pubpub-zz pubpub-zz added the needs-discussion The PR/issue needs more discussion before we can continue label Apr 30, 2024
@pubpub-zz
Copy link
Collaborator

@Rak424 can you provide actual PDF and code that shows some issues and not unitary.
About the two examples you are providing,

  1. even if the binary encoding are different, all names will be encoded in the same way and should match
  2. the PDF specification states:

"Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation."
the current implementation is valid

@Rak424
Copy link
Contributor Author

Rak424 commented May 5, 2024

hi

about example 2, from spec:

c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.
All characters except the white-space characters and delimiters are referred to as regular characters.

It's not a huge problem since it's easy to any parser to handle it, but could be a problem with an extremely strict parser.

The main problem is on example 1:
The original value is b'DocuSign\xae' who has no UTF8 representation and corresponds to 'DocuSign®' in latin1
The final value is b'DocuSign\xc2x\ae' who corresponds to 'DocuSign®' in utf8 and 'DocuSign®' in latin1

So, it means that if you append a pdf, it will creates a modification on these kinds of values. On a signed pdf where you add revocation infos, it will create a modification even if you don't modify the document.

@pubpub-zz
Copy link
Collaborator

@Rak424
I see your example. However both b'DocuSign\xc2\xae' and b'DocuSign\xae' are translated into 'DocuSign®' NameObject.
Again I agree that the 2 forms are binary different so could be considered as different whereas pypdf will consider them as identical, but I do not see actual use cases where both versions will be used in the same PDF.
If a document uses version 2, all instances will be translated for version 1 and the document will be properly read.

If we do a change to keep the bytes internally, b'DocuSign\xae' will be stored, but without changing the arguments, NameObject('DocuSign®') will get internally b'DocuSign\xc2\xae' and matching will not work anymore.
This will require lots of change and will break the API and induce lots of side effects in all programs. This is not acceptable

up to now you have not reported high level PDF that are failing. without I would consider your report as a acceptable deviation of pypdf to the PDF specification

@Rak424
Copy link
Contributor Author

Rak424 commented May 5, 2024

do you understand that b'DocuSign\xc2\xae' is not b'DocuSignx\ae' ?

@pubpub-zz
Copy link
Collaborator

I do if you consider a full compliance to the spec. however as said I do not consider that both can be met at the same time. Also we need a way to build a NameObject from an str that can match one or the other, or both.

Please provide a PDF file showing some issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion The PR/issue needs more discussion before we can continue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants