Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open() utf-8 standardization #75

Open
munyanjacob opened this issue Dec 3, 2021 · 9 comments
Open

open() utf-8 standardization #75

munyanjacob opened this issue Dec 3, 2021 · 9 comments

Comments

@munyanjacob
Copy link
Contributor

Extending the discussion from #74 to a new issue:

I do think that testing to make sure people are using utf-8 makes sense, because I also think it's likely for locale encoder differences to be a future problem if locale encoders aren't ensured to be utf-8 or every open() is not specified to use utf-8.

Because the page linked said it would be best practice to specify, is this something we should update other instances of open() to have & require in the future? It sounds simpler to test for utf-8 once at the start, but best practices are best practices so idk which way to proceed

Originally posted by @munyanjacob in #74 (comment)

@jcushman
Copy link
Contributor

jcushman commented Dec 3, 2021

Yeah, it seems to be technically correct for every open in a cross-platform library to have an encoding, so changing other open calls seems fine.

I don't think (unless maybe PEP 597 lands) there's any way to automatically enforce that practice, so it's likely to stay more a fix-it-if-someone-complains thing in this and other projects. Which in turn means, I'm guessing, Windows python users must be mostly setting export PYTHONUTF8=1 to solve the problem, or using linux in WSL, because who wants to chase down a bunch of libraries to fix their open commands after every package update.

I don't think we (or any given library) should be testing the locale and complaining at the start. Maybe Python itself should change the default encoding and warn people about it, but users don't need random libraries complaining about their locale setting.

Oh, actually a larger fix might be to get the CI tests running on Windows -- I think Github Actions supports that.

@mlissner
Copy link
Member

mlissner commented Dec 3, 2021

I agree with everything Jack says, per usual.

The only other thought I have is that semgrep helps a lot with these kinds of things. They don't currently have a rule requiring an encoding parameter to open(), but I bet they'd be happy to have one. They have a bunch of rules categorized as "python/lang/best-practice" (including one to make sure open() commands are closed):

https://github.com/returntocorp/semgrep-rules/tree/develop/python/lang/best-practice

If you got such a rule put together in semgrep, we could turn it on for this repo (we use it elsewhere already), and it would throw an error in PR's that don't include the encoding parameter. That'd fix it forever. Including Windows in our CI would help too, like Jack says.

There's a guide here for creating new rules: https://semgrep.dev/docs/writing-rules/overview/

I contributed one a bit back, it was pretty painless. Yaml and Python.

@munyanjacob
Copy link
Contributor Author

Awesome, adding it to semgrep and including Windows in the CI both sound good.

Although the uncommon few who experience it can just do what Jack mentioned, for now I'll still take a look at adding the rule to semgrep.

I have been learning a lot just with this one issue - thank you both!

@mlissner
Copy link
Member

How did this go, @munyanjacob ? Any progress?

@munyanjacob
Copy link
Contributor Author

I finished writing the rule but have not been able to commit it (appears to also be because of a Windows issue, this time with Windows marking files as executable and pre-commit requiring they have shebangs).

Currently downloading WSL to see if using that can help. Windows does not feel good.

@mlissner
Copy link
Member

Windows, as you're learning, may be an uphill fight. :)

Usually, pre-commit will fix most errors for you. I usually run it once (it fails and fixes things), and then I run it again (and it passes).

@munyanjacob
Copy link
Contributor Author

munyanjacob commented Dec 17, 2021

The rule just merged but with severity of warning instead of error. Is a warning going to be enough for requiring an encoding in new PRs?

@munyanjacob
Copy link
Contributor Author

Windows, as you're learning, may be an uphill fight. :)

Do you guys generally run Linux, Mac, or WSL? I knew it was common for people to use linux/mac for dev, but I thought there would be more Windows users running into problems like mine lol

@mlissner
Copy link
Member

I think the severity should be enough. I just enabled semgrep on this repo, so I think we'll see as soon as somebody does a PR with a bad open in it. Good stuff. I also enabled it for our other repos that use semgrep.

I run Linux. I don't know what Jack's using. Nobody around here uses Windows that I have noticed in awhile. I usually only find out they do because something is broken for them (like this). I like Ubuntu, FWIW!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants