Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Encoding parametric singleton type #9

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

nalimilan
Copy link
Member

First step towards efficient encoders for common encodings,
as well as towards providing information about encodings.

This also allows adding convenience methods to base I/O functions taking
an additional encoding parameter without risking ambiguities.

See the new tests for an illustration of the API.

@ScottPJones What do you think of this PR? I've tried implementing most of the features from quinnj/Strings.jl#3, but with a parametric singleton type Encoding. This allows supporting arbitrary encodings, and generating methods on-the-fly without polluting the methods table with support for all possible encodings.

But I must say I don't know why you need these functions (like codeunit or native_endian), so I cannot tell whether this will work for you.

TODO:

  • classify the encodings currently in encodings_other. Can all of non-UTF/UCS encodings be considered as 8-bit?
  • handle aliases like UTF16LE
  • test the AbstractString convenience methods

@nalimilan nalimilan force-pushed the nl/types branch 2 times, most recently from 3801bec to 18fd160 Compare February 13, 2016 17:34
First step towards efficient encoders for common encodings,
as well as towards providing information about encodings.

This also allows adding convenience methods to base I/O functions taking
an additional encoding parameter without risking ambiguities.
@ScottPJones
Copy link
Contributor

I'll start reviewing this this weekend.
Great to see more being done to handle strings in a good way!

Did you look at the discussions about making the encodings use traits?
Will this be able to handle the having some sort of hierarchy of encodings? (i.e. UTF-16LE / UTF-16BE
being both UTF-16, the only difference being the endianness?)
That is why I wanted native_endian, so that the code could be make more generic, using a simple call to a function that swaps the bytes if not native endian, also codeunit, which would be UInt8 for all byte oriented encodings, but UInt16 for all of the UTF-16* variants, and UInt32 for the UTF-32* ones.

I think the encodings can be classified by the code unit, whether they are native or opposite endian (for cases where the code unit is 2 or 4 bytes), whether they take 1, 2, or more code units to represent each code point, and whether or not the code points are Unicode (UTF-8, UTF-16, UTF-32 and variants),
a subset of Unicode (such as ASCII, ANSI Latin 1, UCS-2), ASCII compatible (such as CP1252, where the first 128 code points are ASCII), or not even ASCII compatible (such as EBCDIC, and a few others).
The distinction between 16-bit UCS-2 (which can be directly indexed) and UTF-16 (which could be called a DWCS, and can't be directly indexed), can be very important for performance.
8-bit character sets are much easier to handle efficiently, and can be done with simple tables, whereas usually for the multibyte (except UTF-8) you need special code + large tables for both directions.
I've had to deal with Shift-JIS, EUC, GB, and Big5 a lot in the past. Note that EUC-JP is not a DBCS, it is a MBCS (characters added by the later standard take 3 bytes).

"1026", "1046", "1047", "10646-1:1993", "10646-1:1993/UCS4",
"437", "500", "500V1", "850", "851", "852", "855", "856", "857",
"860", "861", "862", "863", "864", "865", "866", "866NAV", "869",
"874", "8859_1", "8859_2", "8859_3", "8859_4", "8859_5", "8859_6",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8859_1 is a synonym for ANSI Latin 1, which I think should be classified separately, as it is purely an 8-bit subset of Unicode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as I noted, there's a lot of classification work to do here. I've just started moving a few of these to encodings8 to test how it works.

Anyway, if we want to store more properties about each encoding, we should create an immutable with a few fields, and make an array of that, instead of storing only the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Maybe the table would have just the properties of the canonical encodings, and have another table that mapped all of the string names to the entry in the table?
What ideas do you have for those sorts of structures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push a proposal shortly. Indeed, it sounds like keeping a separate list of aliases will make everything shorter and easier to maintain.

@ScottPJones
Copy link
Contributor

This definitely looks like a good start! I hope you don't mind all the comments!

@ScottPJones
Copy link
Contributor

I was just thinking also, a lot of the classification that I'd like to see can be done programmatically, for example to check if an 8-bit character set is ASCII compatible or not, and whether it is single, double, or multi byte, by running through all of the characters in the Unicode character set and checking the results.
For example, CP864 (an Arabic char set) looks compatible, but it is not
(the % character is replaced by \u066a).

@nalimilan
Copy link
Member Author

I was just thinking also, a lot of the classification that I'd like to see can be done programmatically, for example to check if an 8-bit character set is ASCII compatible or not, and whether it is single, double, or multi byte, by running through all of the characters in the Unicode character set and checking the results.
For example, CP864 (an Arabic char set) looks compatible, but it is not
(the % character is replaced by \u066a).

Actually, I've just bumped into this: http://demo.icu-project.org/icu-bin/convexp?conv=hp-roman8 It seems that ICU provides information about all encodings, and in particular whether it's ASCII-compatible.

@ScottPJones
Copy link
Contributor

Ah, that's great, it also has the information to decide whether it is single, double, or multi code unit, I see.

@nalimilan
Copy link
Member Author

@ScottPJones Please have a look at the stub EncodingInfo type and to the partial list of encodings. Do you think this provides all the information we need?

@ScottPJones
Copy link
Contributor

The new encodinginfo stuff looks much better, yes.
It will be nice if we can come up with a way to automatically generate the tables, from either iconv or ICU.
Another thing might be to somehow have the encoding stuff be able to have the information for the tables for encodings that we directly support, while allowing automatically falling back to iconv for encodings that we simply don't care about that much (like UTF7 and most of the obsolete EUC, GB, Big5, Mac,
etc ones).

@nalimilan
Copy link
Member Author

The new encodinginfo stuff looks much better, yes.
It will be nice if we can come up with a way to automatically generate the tables, from either iconv or ICU.

I think it would be easier to take the code that does the same thing in iconv-lite:
https://github.com/ashtuchkin/iconv-lite/blob/master/generation/gen-sbcs.js
(generated file: https://github.com/ashtuchkin/iconv-lite/blob/master/encodings/sbcs-data-generated.js)

Another thing might be to somehow have the encoding stuff be able to have the information for the tables for encodings that we directly support, while allowing automatically falling back to iconv for encodings that we simply don't care about that much (like UTF7 and most of the obsolete EUC, GB, Big5, Mac, etc ones).

Yes, that was the idea. Using the Tim Holy traits trick based on the encodings info, it should be easy to override the current StringEncoder and StringDecoder where we have a specialized version.

@ScottPJones
Copy link
Contributor

Those generators from iconv-lite look nice (even if they are in JS instead of Julia! ;-) ), I see he does what I'd been talking about, and checks to see if the first half of the table is the same as ASCII.
For a lot of the newer 8-bit ISO character sets, it's also useful to check if the range 0x0:0x9f is identical to ANSI Latin 1/Unicode. Another property that is very useful to keep track of, for efficient converters, that can be figured out by the generator, is to know if all of the characters map to the BMP (because then
smaller tables can be used) (and in fact, if all characters map to a particular section of the BMP, which is frequently the case).
What I wonder, maybe you have some ideas, is what would be the best way to set things up so that we can either directly use iconv or ICU or whatever, if we don't support an encoding, and for the ones we directly support, have different methods for different classes of encodings, for cases where no tables are needed, or where the only differences would be tables, possibly loaded from a binary file at run-time. Python 3 seems to have a framework that allows all of that.

It would really be nice for Julia to have best-in-class support for character sets, encodings & strings, even compared to Python 3 and Swift 2.0!

@nalimilan
Copy link
Member Author

What I wonder, maybe you have some ideas, is what would be the best way to set things up so that we can either directly use iconv or ICU or whatever, if we don't support an encoding, and for the ones we directly support, have different methods for different classes of encodings, for cases where no tables are needed, or where the only differences would be tables, possibly loaded from a binary file at run-time. Python 3 seems to have a framework that allows all of that.

I think traits allow for exactly this kind of thing. You just need to add methods for StringEncoder and StringDecoder based on the information we have about encodings.

@ScottPJones
Copy link
Contributor

bump (even though it's your own PR ;-) )
This has fallen behind the main branch, but still seems a very nice improvement, if you plan to move forward with StringEncodings.jl.

@nalimilan
Copy link
Member Author

That's not the top of my priorities right now, though I'd be happy to review a PR if you want to update it. Do you need a feature in particular?

@ScottPJones
Copy link
Contributor

OK, I'm not sure how I'd make a PR on this PR though.

@nalimilan
Copy link
Member Author

Just open a new PR. Anyway only the second commit is useful here IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants