Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"iso-8859-1" (latin1) not "windows1252" #5

Open
username1565 opened this issue May 23, 2020 · 7 comments
Open

"iso-8859-1" (latin1) not "windows1252" #5

username1565 opened this issue May 23, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@username1565
Copy link

username1565 commented May 23, 2020

Look here c1193bd
the changes and tests for latin-1 encoding.

Here https://github.com/username1565/text-encoding/blob/dc7b6481e47e731d3ddae0fb0f4cffe876b1efa9/src/encoding/encodings.ts#L313
latin-1 (and synonyms) is switched to windows-1252.

Test:

<script src="https://unpkg.com/@zxing/[email protected]/umd/encoding-indexes.js"></script>
<script src="https://unpkg.com/@zxing/[email protected]/umd/encoding.js"></script>

<script>

var s = ''; for(var i = 0; i<256; i++){s+= String.fromCharCode(i);} console.log('s: \n'+s);	//generate string with all latin-1 characters

var latin1_bytes = new TextEncoding.TextEncoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: false }).encode(s);	//try to encode this
console.log('latin1_bytes', latin1_bytes);	//but receive 384 bytes, not 256 bytes.

var allBytes = new Uint8Array(256); for(var i = 0; i<256; i++){allBytes[i] = i;} console.log('allBytes', allBytes);		//generate all consecutive bytes

var latin1 = new TextEncoding.TextDecoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: true }).decode(allBytes);	//try to decode this as latin-1 string
console.log('latin1: ', latin1, '(latin1 === s)', (latin1 === s));	//show the string and compare it with previous string. ---> received windows-1252 string.

</script>

UPD:
Seems, like I'm already fixed this, in this commits:
username1565@61d721d
username1565@d438e53
username1565@f2d46a7
username1565@82e412c
username1565@728f622
username1565@30e36c0
username1565@dbfbc88

You can see full differences, by compare across forks: master...username1565:master

@username1565
Copy link
Author

Also, you can open Github Pages for your master-branch in the settings of your repositary,
to see the results of browser tests in the browser.

@username1565
Copy link
Author

username1565 commented May 23, 2020

And can you remove TextEncoding,
to don't call new TextEncoding.TextEncoder() and new TextEncoding.TextDecoder()
and leave just old new TextEncoder() and new TextDecoder()
and do not override this when this already defined in browser,
but override this by this command:
https://github.com/zxing-js/text-encoding/blob/master/README.md#non-standard-behavior

?

@odahcam
Copy link
Member

odahcam commented May 30, 2020

latin-1 (and synonyms) is switched to windows-1252.

Awesome! In fact I did created this repo based on these changes, but I had to step back to original implementation for being able to fix all the tests in here. I do plan to update Latin1 and ISO-8859-1 indexes and I'm sure this will help. Thanks.

Also, you can open Github Pages for your master-branch in the settings of your repositary, to see the results of browser tests in the browser.

I do see it running the project locally, I'm not very interested right now. I will create some pages in the future.

And can you remove TextEncoding, and do not override this when this already defined in browser,

Yeah definitely, also I thought I had already published a version where this was done. Here's a script that checks for the polyfill in the latest version: https://codepen.io/odahcam/pen/abvepmQ?editors=1010 Edit: fixed.


It took me a little while to answer, it happens I'm a little busy right now, but I'll keep this work soon.

@odahcam odahcam added the enhancement New feature or request label May 30, 2020
@username1565
Copy link
Author

username1565 commented May 30, 2020

#5 (comment)

Also, you can open Github Pages for your master-branch in the settings of your repositary,
to see the results of browser tests in the browser.

You can see the changes, and the comment for this commit: username1565@5eb0906
I just uploaded the compiled JavaScripts from TypeScript,
and upload this into ./lib-folder,
to make this compatible with @sinonjs's repositary,
then open Github Pages, and add master-branch there,
then created another ./test/browser/libTEST.html, where pathways for already compiled and uploaded ./lib/*.js used for testing.
After this all, this tests available online here, and as you can see, all tests passed.

After this all, I think, we can open Pull Request for @sinonjs,
where JS-files contains in his ./lib-folder.

Also, I did add some another tests there,
And you can see, all commits - here: https://github.com/username1565/text-encoding/network

It took me a little while to answer, it happens I'm a little busy right now, but I'll keep this work soon.

No problems. I think we should not rush anywhere.
But we should think about the quality of the code,
because we leave this code for posterity, for centures,
and maybe, as stantardizated polyfill of etalon-library - forever!
Which is already included into many-many browsers. Hehheh.

So, as I said here: #1 (comment)
in that code, there is many another encodings,
which can be encoded, decoded and tested,
by using TextEncoder/TextDecoder, to make this code full, complete, and reversive.

And of course, you can fix it, and add, and do this only in your free time,
and with your patience, and just for your fun.


P.S.: I did add your changes to my fork, fix "CRLF", draft new release, then publish NPM-pachage.
All is works fine!
After this, I did create new branch, change @username1565 to @zxing-js and opened
this Pull Request for you with the minimal changes. All conflicts is resolved, and you can merge this changes, after see the differences there.

username1565 referenced this issue in username1565/text-encoding May 30, 2020
1. Remove "new TextEncoding.TextEncoder()" and "new TextEncoding.TextDecoder()" from browser tests,
and leave just old "new TextEncoder()" and "new TextDecoder()".

2. Define "new TextEncoder()" and "new TextDecoder()" if this is undefined.

3. To override "TextEncoder/TextDecoder", use:
<script>
  window.TextEncoder = undefined;
  window.TextDecoder = undefined;
//or
//window.TextEncoder = window.TextDecoder = null;
</script>

4.  "new TextEncoding.TextEncoder()" and "new TextEncoding.TextDecoder()" working too,
when "TextEncoder/TextDecoder" is already defined.
@odahcam
Copy link
Member

odahcam commented Jun 24, 2020

FYI: I'm a little away right now, I pretend to come back in the next semester.

@odahcam
Copy link
Member

odahcam commented Jun 28, 2020

I'm trying to understand better the key differences between Windows-1252 and ISO-8859-1. I ran into this answer, which is pretty straight forward and interesting: https://stackoverflow.com/a/31800761/4367683

Also, I found this very elegant table which compares characters differences between both encodings: https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

image

I'd like to let it here for documentation reasons.

Is there something else you'd like to add?

@username1565
Copy link
Author

username1565 commented Jun 28, 2020

The both those encodings, this was been an extended ASCII.
So, the first 128 characters (0x00-0x7F in range 0-127) this is an ASCII-characters
for the both encodings iso-8859-1 (latin1), and Windows-1252.
The second part is different, and differences you can see here, in that charset tables
Also, as you can see, latin-1 is the oldest encoding,
and some characters was not been included, in the first version of windows-1252.

On your picture, I see windows-1252 chars which is represented as charcodes
in iso-8859-15, Unicode, utf-8 bytes, and NCR.
But, iso-8859-15 is not iso-8859-1,
moreover, the some characters from iso-8859-1-charset table, are not contains in windows-1251.

Also, you can compare the differences, by this way, compare_latin1_and_cp1252.html:

<script src="https://unpkg.com/@username1565/[email protected]/umd/encoding-indexes.js"></script>
<script src="https://unpkg.com/@username1565/[email protected]/umd/encoding.js"></script>

<script>
//	generate latin-1 string
var s = ''; for(var i = 0; i<256; i++){s+= String.fromCharCode(i);} console.log('s: \n'+s);	//generate string with all latin-1 characters
//	get consecutive bytes by decoding this
var latin1_bytes = new TextEncoding.TextEncoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: true }).encode(s);	//try to encode this
console.log('latin1_bytes', latin1_bytes);	//show this
//	generate consecutive bytes as Uint8Array
var allBytes = new Uint8Array(256); for(var i = 0; i<256; i++){allBytes[i] = i;} console.log('allBytes', allBytes);		//generate all consecutive bytes
//	Decode this as latin-1 encoded string
var latin1 = new TextEncoding.TextDecoder('iso-8859-1', { NONSTANDARD_allowLegacyEncoding: true }).decode(allBytes);	//try to decode this as latin-1 string
console.log('latin1: ', latin1, '(latin1 === s)', (latin1 === s));	//show the string and compare it with previous string. 			//true

//	decode bytes as windows-1252 chars
var windows1252 = new TextEncoding.TextDecoder('windows-1252', { NONSTANDARD_allowLegacyEncoding: true }).decode(allBytes);	//try to encode bytes as windows-1252 encoded string
console.log('\n\n'+		'windows1252: ', windows1252);	//show the string
//	get consecutive bytes by decoding this
var bytes = new TextEncoding.TextEncoder('windows-1252', { NONSTANDARD_allowLegacyEncoding: true }).encode(windows1252);	//try to encode this to bytes
console.log('bytes', bytes, '(windows1252 === decoded): ', (windows1252 === new TextEncoding.TextDecoder('windows-1252').decode(bytes)));	//show this bytes, encode this back and compare with string

//compare strings, encoded as iso-8859-1 (latin-1) and windows-1252, and write diff
var diff = [];								//in empty array with diff
for(var i=0; i<allBytes.length; i++){		//for each byte
	if(latin1[i] !== windows1252[i]){		//if symbol is not equal
		diff.push({ 'i': i, 'latin-1 char': latin1[i], 'windows1252 char': windows1252[i]});	//write charcode, latin1-char and cp1252-char, as one JSON-object, as item of array.
	}
}
console.log("diff: ", JSON.stringify(diff, null, 1));	//show array with differences, as formatted-indented JSON.
</script>

And, as result, there is 27 different characters, from your image, charcodes, and the chars for both encodings:

diff:  [
 {
  "i": 128,
  "latin-1 char": "�",
  "windows1252 char": "€"
 },
 {
  "i": 130,
  "latin-1 char": "�",
  "windows1252 char": "‚"
 },
 {
  "i": 131,
  "latin-1 char": "�",
  "windows1252 char": "ƒ"
 },
 {
  "i": 132,
  "latin-1 char": "�",
  "windows1252 char": "„"
 },
 {
  "i": 133,
  "latin-1 char": "�",
  "windows1252 char": "…"
 },
 {
  "i": 134,
  "latin-1 char": "�",
  "windows1252 char": "†"
 },
 {
  "i": 135,
  "latin-1 char": "�",
  "windows1252 char": "‡"
 },
 {
  "i": 136,
  "latin-1 char": "�",
  "windows1252 char": "ˆ"
 },
 {
  "i": 137,
  "latin-1 char": "�",
  "windows1252 char": "‰"
 },
 {
  "i": 138,
  "latin-1 char": "�",
  "windows1252 char": "Š"
 },
 {
  "i": 139,
  "latin-1 char": "�",
  "windows1252 char": "‹"
 },
 {
  "i": 140,
  "latin-1 char": "�",
  "windows1252 char": "Œ"
 },
 {
  "i": 142,
  "latin-1 char": "�",
  "windows1252 char": "Ž"
 },
 {
  "i": 145,
  "latin-1 char": "�",
  "windows1252 char": "‘"
 },
 {
  "i": 146,
  "latin-1 char": "�",
  "windows1252 char": "’"
 },
 {
  "i": 147,
  "latin-1 char": "�",
  "windows1252 char": "“"
 },
 {
  "i": 148,
  "latin-1 char": "�",
  "windows1252 char": "”"
 },
 {
  "i": 149,
  "latin-1 char": "�",
  "windows1252 char": "•"
 },
 {
  "i": 150,
  "latin-1 char": "�",
  "windows1252 char": "–"
 },
 {
  "i": 151,
  "latin-1 char": "�",
  "windows1252 char": "—"
 },
 {
  "i": 152,
  "latin-1 char": "�",
  "windows1252 char": "˜"
 },
 {
  "i": 153,
  "latin-1 char": "�",
  "windows1252 char": "™"
 },
 {
  "i": 154,
  "latin-1 char": "�",
  "windows1252 char": "š"
 },
 {
  "i": 155,
  "latin-1 char": "�",
  "windows1252 char": "›"
 },
 {
  "i": 156,
  "latin-1 char": "�",
  "windows1252 char": "œ"
 },
 {
  "i": 158,
  "latin-1 char": "�",
  "windows1252 char": "ž"
 },
 {
  "i": 159,
  "latin-1 char": "�",
  "windows1252 char": "Ÿ"
 }
]

Also, as you can see, some chars of iso-8859-1 encoding, are not copyable,
and this was been replaced to replacement character.
And all windows-1252-encoded chars, are copyable, in this diff.
But windows-1252 contains not copyable chars too:

(null-byte skipped)��������	
�
������������������ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~�€�‚ƒ„…†‡ˆ‰Š‹Œ�Ž��‘’“”•–—˜™š›œ�žŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Anyway, the both encodings are reversive,
and (latin1 === s) and (windows1252 === decoded) return true,
after encoding and decoding between strings and bytes.
Also, as you can see, on your image, some chars from windows-1252 can be encoded as two-bytes unicode,
while iso-8859-1-chars encoded by 1 byte,
because unicode is ascii-compatible and moreover iso-8859-1-compatible encoding (first 128 characters, 00-7F - ASCII-chars, and first 256 characters there, 00-FF - this is latin-1 characters).
So there is better to use iso-8859-1 instead of windows-1252,
to encode bytes as string and decode string into bytearray.
In this case, n bytes can be converted to n symbols, and back,
because one byte converts only to 1 char, and 1 char convert to 1 byte back,
for each byte value from 0 up to 255 (256 chars there).
In this case, the encoded strings have the same bytelength, as bytearrays,
and no one character not converting to more than 1 byte, like some windows-1252-chars.

This makes it possible to work with byte arrays as with strings,
and without exceeding the byte lengths for this encoded strings,
then transfer this bytearrays as strings to methods and functions,
that accept only strings as arguments, and returns strings only.
Then, inside that methods and functions, there is possible to convert those strings into bytearrays,
process this bytes, and encode the result into a string,
and return it as a string, without exceeding the byte length of the encoded string.
In this case no need to write another methods, or add optional arguments, to accept bytes directly,
and no need to working with binary data.
And just input-output of iso-8859-1-encoded strings is enough, as text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants