Skip to content

Commit

Permalink
WIP: store a list of encodings and their properties
Browse files Browse the repository at this point in the history
  • Loading branch information
nalimilan committed Feb 14, 2016
1 parent 4c83568 commit 1671897
Showing 1 changed file with 93 additions and 0 deletions.
93 changes: 93 additions & 0 deletions src/encodings.jl
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,23 @@ end

## Functions giving information about a particular encoding

# NO_ENDIAN: insensitive to endianness
# BIG_ENDIAN: default to big-endian
# LOW_ENDIAN: default to big-endian

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

typo here?

This comment has been minimized.

Copy link
@nalimilan

nalimilan Feb 14, 2016

Author Member

RIght. Don't worry, though, this was just to illustrate my suggestion, it's not yet functional.

# BIG_ENDIAN_AUTO: endianness detection using BOM on input, defaults to big-endian on output
# LOW_ENDIAN_AUTO: endianness detection using BOM on input, defaults to low-endian on output
# NATIVE_ENDIAN_AUTO: endianness detection using BOM on input, defaults to native-endian on output
@enum Endianness NO_ENDIAN BIG_ENDIAN LOW_ENDIAN BIG_ENDIAN_AUTO LOW_ENDIAN_AUTO NATIVE_ENDIAN_AUTO

immutable EncodingInfo
name::ASCIIString
codeunit::Int8 # Number of bytes per codeunit
codepoint::Int8 # Number of bytes per codepoint; for MBCS, negative values give the maximum number of bytes

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

codeunit/codepoint would be: UTF-8 1, -4, UTF-16, 2, -4, UCS-2 2,2, UTF-32 4,4?

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

Maybe codepoint should be defined in terms of codeunits? That would make it easier to check for linear indexing? (i.e. codepoint == 1).

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

Never mind, looking at the table below further - it's just the comment that is wrong, you really did define it in terms of code units.

This comment has been minimized.

Copy link
@nalimilan

nalimilan Feb 14, 2016

Author Member

Makes sense.

lowendian::Endianness # Endianness, if applicable
ascii::Bool # Is the encoding a superset of ASCII?
unicode::Bool # Is the encoding Unicode-compatible?

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

should also indicate whether the character set encoded is a subset of Unicode
(ASCII, ANSI Latin1, UCS-2_), or can represent full Unicode (UTF-8_, UTF-16_, UTF-32_). Maybe subunicode::Bool, fullunicode::Bool?

This comment has been minimized.

Copy link
@nalimilan

nalimilan Feb 14, 2016

Author Member

See next comment about subunicode. Will add fullunicode.

end

"""
native_endian(enc)
Expand Down Expand Up @@ -87,10 +104,86 @@ end

codeunit(enc::AbstractString) = codeunit(Encoding(enc))

const encodings_list2 = EncodingInfo[
EncodingInfo("ASCII", 1, 1, NO_ENDIAN, true, true),

# Unicode encodings
EncodingInfo("UTF-8", 1, -4, NO_ENDIAN, true, true),
EncodingInfo("UTF-16", 2, -2, BIG_ENDIAN_AUTO, false, true), # FIXME: iconv implementations vary regarding endianness

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

These should be -4 for codepoint, since you said those are in bytes.

EncodingInfo("UTF-16LE", 2, -2, LOW_ENDIAN, false, true),
EncodingInfo("UTF-16BE", 2, -2, BIG_ENDIAN, false, true),
EncodingInfo("UTF-32", 4, 1, BIG_ENDIAN_AUTO, false, true), # FIXME: iconv implementations vary regarding endianness
EncodingInfo("UTF-32LE", 4, 1, LOW_ENDIAN, false, true),
EncodingInfo("UTF-32BE", 4, 1, BIG_ENDIAN, false, true),

EncodingInfo("UCS-2", 2, 1, BIG_ENDIAN_AUTO, false, true), # FIXME: iconv implementations vary regarding endianness
EncodingInfo("UCS-2LE", 2, 1, LOW_ENDIAN, false, true),
EncodingInfo("UCS-2BE", 2, 1, BIG_ENDIAN, false, true),

# ISO-8859
EncodingInfo("ISO-8869-1", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-2", 1, 1, NO_ENDIAN, true, true),

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

Except for ISO-8869-1, none of these encodings are Unicode compatible,
the character set values from 0xa0-0xff don't match Unicode.
Basically any character set that iconv can handle will be able to be converted to/from Unicode, AFAIK, what is interesting in the case of ASCII, Latin1, and UCS-2 is that no conversion is needed.

This comment has been minimized.

Copy link
@nalimilan

nalimilan Feb 14, 2016

Author Member

Yeah, I defined it as "can be converted to Unicode", but that's not very useful as it's always true. That said, is it very interesting to know that for ASCII, Latin1, and UCS-2, one directly gets the Unicode value? Only these three encodings have that property AFAIK, and we're not going to write a decoder common to them, are we? Anyway, we can always add this information later if it turns out to be useful.

This comment has been minimized.

Copy link
@ScottPJones

ScottPJones Feb 14, 2016

Contributor

Actually, those properties have been the most important for making optimally performing conversion code in my experience. Validated ASCII can simply be treated as UTF-8,
or, like Latin1, widened directly from 8-bit to 16-bit UCS-2 or UTF-16 or 32-bit UTF-32,
with operations that can operate on whole chunks of data at once using SIMD instructions (I'd hand coded those in the past for different instruction sets).

While it might seem by counting number of encodings, that these are just 3 out of many,
if you look at people's data, I'd say most of the data can be represented by those 3.
If you want to leave it out of the table, I think we should still have methods that return that state, set up for those 3 (more really, because with UCS-2 you really have all the different encodings of that subset of Unicode, i.e. little/big endian, with or without BOM, etc.), and then just having a fallback to false for all other encodings.

This comment has been minimized.

Copy link
@nalimilan

nalimilan Feb 14, 2016

Author Member

What I mean is that you're likely to write one specific encoder for each of these encodings, not a generic one for all three of these. So in the end we don't need a trait to identify them.

I may be wrong, though. Anyway, it doesn't hurt to provide this information, so I'll add it.

EncodingInfo("ISO-8869-3", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-4", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-5", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-6", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-7", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-8", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-9", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-10", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-11", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-12", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-13", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-14", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-15", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("ISO-8869-16", 1, 1, NO_ENDIAN, true, true),

# KOI8 codepages
EncodingInfo("KOI8-R", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("KOI8-U", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("KOI8-RU", 1, 1, NO_ENDIAN, true, true),

# 8-bit Windows codepages
EncodingInfo("CP1250", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1251", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1252", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1253", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1254", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1255", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1256", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1257", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP1258", 1, 1, NO_ENDIAN, true, true),

# DOS 8-bit codepages
EncodingInfo("CP850", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("CP866", 1, 1, NO_ENDIAN, true, true),

# Mac 8-bit codepages
EncodingInfo("MacRoman", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacCentralEurope", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacIceland", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacCroatian", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacRomania", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacCyrillic", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacUkraine", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacGreek", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacTurkish", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacHebrew", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacArabic", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("MacThai", 1, 1, NO_ENDIAN, true, true),

# Other 8-bit codepages
EncodingInfo("HP-ROMAN8", 1, 1, NO_ENDIAN, true, true),
EncodingInfo("NEXTSTEP", 1, 1, NO_ENDIAN, true, true)

# TODO: other encodings (8-bit and others)
]


## Lists of all known encodings taken from various iconv implementations,
## including different aliases for the same encoding


# 8-bit codeunit encodings
const encodings8 = [
"ASCII", "US-ASCII", "us-ascii", "CSASCII",
Expand Down

0 comments on commit 1671897

Please sign in to comment.