Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous rogue element detection? #84

Open
jul059 opened this issue Nov 3, 2024 · 13 comments
Open

Erroneous rogue element detection? #84

jul059 opened this issue Nov 3, 2024 · 13 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@jul059
Copy link

jul059 commented Nov 3, 2024

Adblock-lean reports a rogue element in the following blocklist:

Warning: Rogue element: '711線上認證.online' identified originating in blocklist file from: https://raw.githubusercontent.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist/main/global-anti-scam-org-scam-urls-pihole.txt.

It looks to me like an erroneous detection.

@Wizballs
Copy link
Contributor

Wizballs commented Nov 3, 2024

It does appear that asian characters are allowable in URL addresses.
https://stackoverflow.com/questions/7184802/are-chinese-characters-allowed-entered-in-urls

We have not so far whitelisted these types of characters in adblock-lean and is why this rogue element detection occurs. Alphaumeric characters are allowed via [::alnum::]
I'll have a look for an equivalent for asian characters.

@friendly-bits
Copy link
Collaborator

friendly-bits commented Nov 3, 2024

I thought that international characters, while showing up as such in the browser, in fact are transparently encoded into ascii characters when dns resolution is performed. This Wikipedia article seems to confirm that:
https://en.m.wikipedia.org/wiki/Internationalized_domain_name

Which means that internationalized domain names can be adblocked as-is, in their ascii transcription. Technically we could implement support for additional alphabets, but they are many, so this may get complicated and have a performance impact. For this reason, I think that the correct solution is to use lists that have the internationalized domain names encoded with Punycode.

I'd suggest @jul059 to file a bug report to whoever maintains the list which has this entry.

Edit: if using a different list or fixing this list is not possible, I suppose we could make rogue element detection optional. Since rogue element check safeguards against both corrupted and malicious lists, the user would then need to make sure that the lists they are using are safe and sound.

@friendly-bits
Copy link
Collaborator

friendly-bits commented Nov 4, 2024

Upon further research, perhaps we could consider changing the regex to support all unicode characters. Apparently this can be done by replacing [[:alnum:]_-] with (\w|[0‐9]|-).

(edit: it's [0-9], not [0_9] as incorrectly represented by Github)

\w is supposed to match "word" character, which supposedly means any letter in either ascii or unicode, and underscore. So this should cover all alphabets (unless I am misunderstanding something). The caveats:

  • \w is non-POSIX. So whether it does or does not work in sed depends on the specific sed implementation. Busybox seems to support this, but we would need to test against Busybox sed in OoenWrt to make sure.
  • This may have performance impact

If this does work but has performance impact, we could make this an optional feature.

The question of whether this is actually needed still stands.

@lynxthecat lynxthecat added help wanted Extra attention is needed question Further information is requested labels Nov 4, 2024
@Wizballs
Copy link
Contributor

Wizballs commented Nov 4, 2024

international characters, while showing up as such in the browser, in fact are transparently encoded into ascii characters when dns resolution is performed.

I read this same thing also. The Blocklist maintainer could possibly switch to using ascii character.

replacing [[:alnum:]_-] with (\w|[0‐9]|-)

I'll give this a performance/sed test later on tonight. Let's see how that goes and decide from there.

@Wizballs
Copy link
Contributor

Wizballs commented Nov 4, 2024

So I have some different script behavior than OP. Running on the latest adblock-lean main (no modification), my test file with entry 華信金融.tw passes adblock-lean rogue check with no issues
https://raw.githubusercontent.com/Wizballs/dnsmasqaddresstest/refs/heads/main/test

However, dnsmasq does not like these characters anyway. So that should pretty much be decision made as asian characters cause this error:

Error: The dnsmasq test on the final blocklist failed.
dnsmasq --test errors:
dnsmasq: error at line 15843 of stdin

Error: New blocklist file check failed.

@friendly-bits
Copy link
Collaborator

friendly-bits commented Nov 4, 2024

One other thing we could do for this use case is add an option enabling pre-processing of certain lists with the idn2 utility. idn2 can be used to translate unicode to punycode. This requires packages idn2 libidn2 (seems to be available in the repo).

So something like

unicode_blocklist_urls="https://raw.githubusercontent.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist/main/global-anti-scam-org-scam-urls-pihole.txt"

@Wizballs
Copy link
Contributor

Wizballs commented Nov 4, 2024

I think converting urls to punycode should be up to the list maintainer and not built into adblock-lean. The OP could request that particular blocklist maintainer to do so.

Something else in that list is
d+D446own.chaojis0.com
I haven't seen urls with a + symbol before. And while a Google search says these are accepted, I can't get this address to resolve. It also has a capital letter which is unusual.

@friendly-bits
Copy link
Collaborator

The question is whether including internationalized domain names (IDN's) in blocklists is a normal practice. From a little research, it looks like at least pihole does accept IDN's. I can't tell how widespread the use of these lists is but if it doesn't cost us much effort, we could as well support them in adblock-lean (with help of idn2).

@Wizballs
Copy link
Contributor

Wizballs commented Nov 5, 2024

I don't think it's widespread use, first I've seen after all these years. Can you make any sense of this address in that same list? I'm trying to understand if the maintainer needs to make some updates since the really only a few idns in that list , and a couple that look like this:

d+D446own.chaojis0.com

@lynxthecat
Copy link
Owner

Does this issue still need to be worked on?

@friendly-bits
Copy link
Collaborator

@lynxthecat I think we haven't reached a conclusion on this one. I think it's a matter of policy. In other words, you are the boss, you decide :)

@lynxthecat
Copy link
Owner

So I have some different script behavior than OP. Running on the latest adblock-lean main (no modification), my test file with entry 華信金融.tw passes adblock-lean rogue check with no issues https://raw.githubusercontent.com/Wizballs/dnsmasqaddresstest/refs/heads/main/test

However, dnsmasq does not like these characters anyway. So that should pretty much be decision made as asian characters cause this error:

Error: The dnsmasq test on the final blocklist failed.
dnsmasq --test errors:
dnsmasq: error at line 15843 of stdin

Error: New blocklist file check failed.

Doesn't this answer it for us? Ultimately if dnsmasq is going to reject we're limited by that?

@friendly-bits
Copy link
Collaborator

Doesn't this answer it for us? Ultimately if dnsmasq is going to reject we're limited by that?

This definitely does mean that those lists can not be used as-is. See further discussion of implementing a conversion mechanism, which is what probably some other adblockers which do accept these lists do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants