Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling encodings #2

Open
karenetheridge opened this issue Mar 16, 2014 · 5 comments
Open

handling encodings #2

karenetheridge opened this issue Mar 16, 2014 · 5 comments
Assignees

Comments

@karenetheridge
Copy link
Member

Here is my (very early in the morning) assessment about what we should do regarding scanning the file and encodings (related issue: https://rt.cpan.org/Ticket/Display.html?id=78434)

  • we need different interfaces for new-from-string, new-from-file, new-from-handle. new-from-string will assume decoded characters. new-from-file takes an encoding parameter, and uses binmode(":encoding($encoding)") on the handle. new-from-handle assumes that the caller has already set encoding layers appropriately.
  • new-from-file may not have been provided an encoding parameter. in this case we start reading assuming ascii. If we encounter a 'use utf8;' directive, we back up and start over with the correct encoding applied. We will also flag a warning (error?) if we encounter a 'use utf8' directive when we were explicitly passed a different encoding parameter than 'utf8' (UTF-8?).
  • pod can have a different encoding than the code. when starting to read pod, and we don't already have an encoding supplied (can we assume that new-from-string is decoded and therefore exempt ourself from this step?) and we encounter an '=encoding ...' directive, we decode the extracted pod lines as they are read. Test cases need to consider corpus examples where the code is in a different encoding than the (interleaved) pod (can this happen in any situation where the code is not ascii, which is utf8-compatible?)

Am I missing anything?

@karenetheridge
Copy link
Member Author

Note - that the moment, the current interfaces are new_from_file, new_from_handle, and new_from_module (which uses $PERL5LIB and assumes the usual module-to-filename semantics), and totally ignorant of encodings. Tools such as Dist::Zilla that are encoding-aware have difficulty working with these interfaces - the only one it can really use is new_from_handle and set the encoding layer itself, but this is not convenient if the file is in memory already (as scalarrefs can't be opened on wide characters) -- so in this case, ->new_from_string($file->decoded_content) would be nicest.

Also - we probably don't need the back-up-and-start-reading-again logic, as (needs to be verified?) Perl doesn't even do this, and we should be able to assume that anything before use utf8 or =encoding utf8 is valid ascii. (Does this also hold for UTF-16?)

@dagolden
Copy link

On Sun, Mar 16, 2014 at 8:45 AM, Karen Etheridge
[email protected]:

we need different interfaces for new-from-string, new-from-file,
new-from-handle. new-from-string will assume decoded characters.
new-from-file takes an encoding parameter, and uses
binmode(":encoding($encoding)") on the handle. new-from-handle assumes that
the caller has already set encoding layers appropriately.

I agree except for from-file, which I comment on later.

new-from-file may not have been provided an encoding parameter. in
this case we start reading assuming ascii. If we encounter a 'use utf8;'
directive, we back up and start over with the correct encoding applied. We
will also flag a warning (error?) if we encounter a 'use utf8' directive
when we were explicitly passed a different encoding parameter than
'utf8' (UTF-8?).

How about slurping it raw fully and then working with the resulting data
rather than "backing up"? I also wonder whether we can reliably detect
"use utf8;" without implement a subset of perl's parsing rules. E.g.:

# don't use utf8 because blah blah blah
$foo = <<HERE;
always use utf8 when opening handles
HERE

We also have to consider a pod encoding directive in absence of "use utf8"
as indicative.

pod can have a different encoding than the code. when starting to read
pod, and we don't already have an encoding supplied (can we assume that
new-from-string is decoded and therefore exempt ourself from this step?)
and we encounter an '=encoding ...' directive, we decode the extracted pod
lines as they are read. Test cases need to consider corpus examples where
the code is in a different encoding than the (interleaved) pod (can this
happen in any situation where the code is not ascii, which is
utf8-compatible?)

I suspect that in practice having intentionally different encoding in
code and pod portions is unlikely. I would go further and say that p5p
should make it an error.

@rjbs
Copy link
Member

rjbs commented Mar 18, 2014

I agree with the last bit (which is not to say I disagree with anything else): we should document that =encoding is provided so that a Pod parser can be told the file's encoding, and that it should be the file's encoding, not an encoding applied only to the Pod sections. That'd just be crazy.

@SineSwiper
Copy link

Various meta pragmas "use utf8" indirectly, so trying to parse through Perl code to see if it's actually using it seems plagued with pitfalls. Using something like Encode::Guess might be a better bet, but user-specified encoding trumps all of that. If a user says it's Latin1, then we should consider it to be Latin1, and not question that decision unless Perl just plain can't read it.

@karenetheridge
Copy link
Member Author

On Thu, Mar 27, 2014 at 08:45:18PM -0700, Brendan Byrd wrote:

Various meta pragmas "use utf8" indirectly, so trying to parse through Perl code to see if it's actually using it seems plagued with pitfalls. Using something like Encode::Guess might be a better bet, but user-specified encoding trumps all of that. If a user says it's Latin1, then we should consider it to be Latin1, and not question that decision unless Perl just plain can't read it.

Yes, I think this comes back to "simplicity trumps all", especially since
this module needs to be 1. core only, and 2. fast in the most common cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants