Additional lints for the Python code base #4044

stinodego · 2022-07-17T06:25:40Z

As the project becomes more popular, we can expect more people to start contributing to the code base. Having a good linting setup will make sure our code quality remains consistently high, while aiding in the code review process. I outlined a number of tools/settings that I think will help. Suggestions are more than welcome.

flake8

Set max-line-length = 88
- We have a lot of unnecessary inconsistency regarding line lengths in the code. ~~Enable Black preview #4041 helps address this for the code itself.~~ There are many strings/comments/docstrings that could easily fit within 88 characters, but are now going over this limit for no reason. These should be fixed. Exceptions like these can be ignored on a per-case basis using # noqa: E501.

flake8 plugins

flake8 has a rich plugin ecosystem with additional lints that can help keep your code clean. They can be enabled simply by adding them to our build requirements. Using these, flake8 becomes more like the programming buddy that cargo is for Rust. Below is a list that I recommend (loosely in order of importance):

All of the following find legitimate issues in the existing code base:

flake8-docstrings - Enforce docstring uniformity.
flake8-bugbear - Finds possible bugs.
flake8-comprehensions - Helps optimize list/set/dict comprehensions.
flake8-simplify - Helps simplify some code patterns.
~~pep8-naming - Helps enforce PEP8 naming conventions.~~ Not worth it, finds a single error that is questionable.
flake8-tidy-imports - Helps keep imports clean
~~flake8-type-checking - Enforces the use of if TYPE_CHECKING: blocks in order to minimize import overhead.~~ Cannot use this right now due to requirement of Python 3.8 and up.

We will skip the flake8 lints below for now. They have minimal impact.

The following should be nice to enforce, but we are currently compliant:

~~flake8-typing-imports - Makes sure your typing imports (i.e. Literal) are valid in your supported Python versions.~~
~~flake8-broken-line - Avoid backslash line breaks.~~

The following I am not sure about, but might be useful:

~~flake8-use-fstring - Force the use of f-strings for formatted strings. Should be nice, but seems to have some false positives for our code base.~~
~~flake8-annotations - Helps set certain requirements for type hints. Not sure how well this complements mypy.~~
~~flake8-eradicate - Helps identify and remove commented out code. The idea is nice, but we have some code commented out on purpose, so there might be false positives.~~

mypy

I would like to set strict = True for mypy in order to improve reliability and quality of our type hints. This currently produces 1157 errors in 38 files. The strict flag is a combination of multiple strictness-related flags. I recommend we enable these one-by-one and fix the related errors.

Other helpful CLI tools

These can be added as additional commands in the CI pipeline.

pyupgrade - Makes sure to use that we're using the latest language features.
~~pycln - Automatically detect and remove unused imports. Has functionality for identifying side effects (like the pyarrow imports).~~ Not worth it for the small benefit it brings. flake8 will catch unused imports; fix them manually.
~~yesqa - Functions like mypy's warn-unused-ignores. It makes sure all the # noqa comments are actually necessary.~~ Not worth incorporating in the CI right now.

The text was updated successfully, but these errors were encountered:

zundertj · 2022-07-17T11:00:20Z

This is a good plan, and happy to help out on the mypy side.

I am not too familiar with most of the flake8 plugins. My only concern with adding these would be that if some plugins add a number of false positives, there is quite some additional work to opt-out particular lines of code again, increasing the developer burden. Plugins should also be actively maintained of course (for when we are upgrading to newer Python versions, etc).

yesqa also seems useful, note that there is an open feature request in flake8 for this functionality: PyCQA/flake8#603 But given that has been open for some time, using yesqa seems sensible.

stinodego · 2022-07-17T11:22:32Z

This is a good plan, and happy to help out on the mypy side.

Great! I opened a PR to prepare the incremental adoption of mypy strictness features.

I am not too familiar with most of the flake8 plugins. My only concern with adding these would be that if some plugins add a number of false positives, there is quite some additional work to opt-out particular lines of code again, increasing the developer burden. Plugins should also be actively maintained of course (for when we are upgrading to newer Python versions, etc).

Having a more opinionated formatter (in the form of these plugins) can be a great help, but we should indeed be mindful of selecting plugins that work for this code base (i.e. do not throw a lot of false positives). I think we should just try out a few of these, see what the effect would be on the code base to satisfy them, and then decide. I'll gladly create a PR for each of these.

With regards to being actively maintained, I believe these plugins are mostly modern and wisely used. Let's indeed make sure we're happy about the state before adopting each plugin.

alexander-beedie · 2022-07-18T05:48:04Z

@stinodego; good stuff - the only thing I'd really disagree with here is Set max-line-length = 88.
This is just waaaay too short for the modern era of widescreen high-resolution displays.

Recent projects I've worked on that enforce line-length (typically also via black if in python) have set it to between 100-120, which has proved to be a decent compromise between the 1980s expectation of working in a green-screen 320×200 CGA terminal and the 2020s reality of retina screens in your pocket and 4K, 5K, 6K (!) monitors. And if 120 is a bridge too far for some, let's at least go for 100, heh.

FYI: 100 is what the linux kernel set as their new max back in 2020, up from 80. You will probably not be surprised that this update came along with some fruity commentary from Linus, heh: "I do not care about somebody with an 80x25 terminal window getting line wrapping ... For exactly the same reason I find it completely irrelevant if somebody says that their kernel compile takes 10 hours because they are doing kernel development on a Raspberry PI with 4GB of RAM."

For the curious - the typical/historic 80 char line width originated from IBM punch-cards - not a joke - from which early terminals/teletypes/etc took their cues ;)

(132 was also a popular line length at one point, due to it being the max width for later terminals...)

stinodego · 2022-07-18T08:23:37Z

the only thing I'd really disagree with here is Set max-line-length = 88.

I definitely see your point. Most work projects I do are set to 100 characters, although personally I prefer to stick to black's default 88. What I don't like about longer lines is that they tend to wrap when I have two files open side by side, like when reviewing code or writing tests.

What mostly triggered me here is the fact that black and flake8 are misaligned: we get 88 characters of code through black, and then some docstrings/comments are nicely wrapped at 88 characters, others are a bit more inconsistent, and then others are waaaay longer. Plus it looks weird when the code is wrapped at 88 but the docstrings are much longer.

I am fine with setting line length to 100 for black, and then matching it with flake8.

@ritchie46, do you have a preference here?

PS: Awesome picture, thanks for the nice historical reference 😄

ritchie46 · 2022-07-18T11:38:06Z

I like the current line lengths. In my setup I have half my screen minus file tree for code and I almost never have to scroll horizontally this way.

stinodego · 2022-07-18T20:23:29Z

I like the current line lengths.

In that case, I propose we stick to 88 characters. We can follow the advice outlined here though, where black suggests to set flake8 to 88 characters, but allow some slack on the enforcement of the 88 characters. This is done by replacing flake8's E501 with flake8-bugbear's B950.

I think that could be a nice compromise! Then the guideline is clear, but we have some wiggle room.

zundertj · 2022-07-20T20:48:25Z

Has anyone here used pylint before, and a good idea of added value over flake8? I have tried it on our code base for a couple files, see #4100. There were definitely some useful suggestions, as it seems to be much stricter than flake8, which has its pro's and cons.

cnpryer · 2022-07-21T18:19:52Z

I get that shorter line lengths mean less than optimal vertical organization, and maybe I missed it in reading through here, but why not follow PEP with 79?

stinodego · 2022-07-23T21:08:57Z

I get that shorter line lengths mean less than optimal vertical organization, and maybe I missed it in reading through here, but why not follow PEP with 79?

Reading this explanation by black on why they default to 88 characters should tell you all you need to know. The video they link to is very inspirational, as well.

When we use try-except in our code base, often we raise a new exception to replace the original exception. It is good practice to either wrap around the original exception (`from exc`), or bin the original exception (`from None`) if the new exception is a self-contained description of what happened. The latter is for instance the case when we wrap around a dict, catch the KeyError, and provide more meaningful Exception with message, as the `KeyError` is not what the user is after, it wants to know that dtype1 cannot be converted to dtype2, etc. Without re-raise, you will see `During handling of the above exception, another exception occurred`, which means the two exceptions, original and new, are treated unrelated. With `from exc`, you will see `The above exception was the direct cause of the following exception:`. With `from None`, you will get a clean traceback without the original exception. Found all cases using pylint (`pylint polars --disable=all --enable=raise-missing-from`). Discussion on the rollout of pylint in CI takes place in pola-rs#4044. Background reading: https://stefan.sofa-rockers.org/2020/10/28/raise-from/

stinodego · 2022-07-24T16:48:14Z

Has anyone here used pylint before, and a good idea of added value over flake8? I have tried it on our code base for a couple files, see #4100. There were definitely some useful suggestions, as it seems to be much stricter than flake8, which has its pro's and cons.

I admit that I have limited experience with pylint, but here's my hot take:

I believe most open source projects use flake8 over pylint (and I have absolutely 0 numbers to back up that statement 😄). The reason is that pylint has a lot of lints that are very subjective and debatable. You can already see it in your PR #4100 and the reaction by @ghuls (which I agree with). So it is not necessarily more strict, it's just more opinionated.

And it is opinionated by default, so you will have to spend time configuring it to your liking. Which is fine, but the result is that every contributor has to get used to your subjective way of doing things. Which might be different from the other projects they are working on.

flake8 is super basic by default. It just makes sure you're mostly following pep8. And then you can extend it from there by making it more strict. That makes it more suitable for open source repos, in my opinion.

I would be in favor of sticking to flake8 and the corresponding ecosystem, for now at least.

* Improve re-raises of Exceptions When we use try-except in our code base, often we raise a new exception to replace the original exception. It is good practice to either wrap around the original exception (`from exc`), or bin the original exception (`from None`) if the new exception is a self-contained description of what happened. The latter is for instance the case when we wrap around a dict, catch the KeyError, and provide more meaningful Exception with message, as the `KeyError` is not what the user is after, it wants to know that dtype1 cannot be converted to dtype2, etc. Without re-raise, you will see `During handling of the above exception, another exception occurred`, which means the two exceptions, original and new, are treated unrelated. With `from exc`, you will see `The above exception was the direct cause of the following exception:`. With `from None`, you will get a clean traceback without the original exception. Found all cases using pylint (`pylint polars --disable=all --enable=raise-missing-from`). Discussion on the rollout of pylint in CI takes place in #4044. Background reading: https://stefan.sofa-rockers.org/2020/10/28/raise-from/ * Remove test_example

cjermain · 2022-07-28T11:27:30Z

I agree with @stinodego. In my experience flake8 is faster to run, which helps with iteration and build times.

ryanrussell · 2022-07-30T20:36:15Z

A little late to this conversation...

I've had great success with pre-commit / autoflake/ isort / black. Here's a sample YAML...

Somewhat related, but maybe out of scope, I think the guys over at Vector have a beautiful commit history.

I added a conventional commits enforcement, which you can see a sample of at https://github.com/dragonflydb/dragonfly/blob/main/contrib/scripts/conventional-commits

Happy to help if it's wanted.

stinodego · 2022-07-31T04:35:14Z

Hi @ryanrussell ! I have talked to @ritchie46 before about using pre-commit, and he's not a fan. So, long story short, this will not happen in the near future.

I definitely see the value of pre-commit: it has some nice lints, and in the YAML it's easy to quickly see the lints you need to conform to. There is a reason many open source projects use pre-commit.

Then again, there are some lints that need to run outside of pre-commit. mypy is one of them, due to some specifics on how the pre-commit environment works (it won't find all errors). I am not sure of the Rust side of things (does pre-commit have miri?). So if we need a make command anyway that runs various linting tools, then what is really the added value of pre-commit?

The conventional commits enforcement looks cool, but this project always does squash & merge in the GitHub UI, so I'm not sure how that would work.

So for now, we're sticking with the workflow where you just run a make command to make sure your stuff lints before you open a PR. And if you forget, it'll get caught by the CI, and you'll have to fix it afterwards.

zundertj · 2022-07-31T18:32:19Z

I believe most open source projects use flake8 over pylint (and I have absolutely 0 numbers to back up that statement 😄). The reason is that pylint has a lot of lints that are very subjective and debatable. You can already see it in your PR #4100 and the reaction by @ghuls (which I agree with). So it is not necessarily more strict, it's just more opinionated.

And it is opinionated by default, so you will have to spend time configuring it to your liking. Which is fine, but the result is that every contributor has to get used to your subjective way of doing things. Which might be different from the other projects they are working on.

Agreed on being less used and very opinionated (and with @ghuls on the if-else), and it is more a system of opt-out, where you turn off stuff you don't like. Rather than the other way around.

flake8 is super basic by default. It just makes sure you're mostly following pep8. And then you can extend it from there by making it more strict. That makes it more suitable for open source repos, in my opinion.

I would be in favor of sticking to flake8 and the corresponding ecosystem, for now at least.

Agreed, but I do wonder whether there is a middle ground, as there is plenty of stuff that flake8 misses out on by default (see some of the PR's you are raising currently), which I think most would agree is sensible. But then again, that may be highly subjective, and I am only hoping there to be something that doesn't exist.

I am a little wary of us ending up having to fight multiple extensions breaking on each new Python release. But maybe that is just me being conservative with adding dependencies, and we should just add them liberally and remove if the extension becomes outdated and/or a burden to fix.

stinodego · 2022-08-02T19:19:27Z

[...] we should just add them liberally and remove if the extension becomes outdated and/or a burden to fix.

That's the way I look at it personally! We can add a lint if it helps us fix some issues in the code base. And if it causes any friction, we'll just get rid of it. No harm done.

You did get me interested in diving more into pylint, so I might backtrack on this. But as I said, let's stick with flake8 for now!

stinodego · 2022-08-14T15:30:42Z

I have looked into setting mypy's warn-return-any. This one throws a lot of errors, because all the Rust bindings we use return the Any type.

As an example, a simple method like DataFrame.estimated_size gives problems. Consider the following example:

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3]})

reveal_type(df._df.estimated_size())  # Any
reveal_type(df.estimated_size())  # builtins.int

The backend call returns an Any type, as it is not typed to Python standards. We return the result as an int, hence the mypy error.

This happens many, many times in the code base. I don't feel it really improves the code quality if we litter our codebase with # type: ignore statements.

Curious to hear your thoughts on this, @matteosantama and @zundertj !

matteosantama · 2022-08-14T15:47:32Z

Yes, I noticed the same thing. I still think it is a worthwhile endeavor, but we should use cast instead of type: ignore.

There seems to be some movement on the PyO3 side to natively support the generation of stub files (see PyO3/pyo3#510 and PyO3/pyo3#2454). If we use cast on our side, then when this feature becomes available mypy will either warn us of a redundant cast OR an invalid cast (attempting to cast an int to a str, for example). If we instead use type: ignore instead, mypy will warn us about unnecessary ignore comments, but any mismatches will be silenced.

matteosantama · 2022-08-14T15:50:32Z

The PyO3 documentation suggests manually creating .pyi files, but I don't think this is necessary for our use-case. All our Python-to-Rust calls are hidden internally, so maintaining a .pyi file doesn't have much benefit over individual casts (besides maybe centralizing the annotations).

stinodego · 2022-08-14T16:08:26Z

Most of our code is just an adapter on the underlying PyObject.

def estimated_size(self) -> int:
    return self._df.estimated_size()

Do you want to rewrite all such functions to the version below?

def estimated_size(self) -> int:
    return cast(int, self._df.estimated_size())

That just makes things more unclear, in my opinion. In which way would this be worthwhile?

ritchie46 · 2022-08-14T16:18:13Z

Most of our code is just an adapter on the underlying PyObject.
def estimated_size(self) -> int:
    return self._df.estimated_size()
Do you want to rewrite all such functions to the version below?
def estimated_size(self) -> int:
    return cast(int, self._df.estimated_size())
That just makes things more unclear, in my opinion. In which way would this be worthwhile?

I agree.. this seems like a lot of clutter to me.

How much lints have we that are not inner binary polars related?

zundertj · 2022-08-14T16:22:10Z

@stinodego : didn't even see this until now, guess we figured out the problem with turning on this warning at the same time :)

zundertj · 2022-08-14T16:24:14Z

The PyO3 documentation suggests manually creating .pyi files, but I don't think this is necessary for our use-case. All our Python-to-Rust calls are hidden internally, so maintaining a .pyi file doesn't have much benefit over individual casts (besides maybe centralizing the annotations).

As I also commented in PR #4415, this amounts to duplicating almost all type annotations, as our Python api is a thin wrapper around PyO3 to start with. I.e, our api is currently already mostly type annotations, and some minimal conversion code in places. I don't think we should add another layer that is just the type annotations.

matteosantama · 2022-08-17T23:59:59Z

Perhaps this is tangential to the Python lints, but within py-polars we house several Rust files. As part of the CI we run cargo clippy, which is a CLI tool similar to flake8, and there are currently 110 warnings.

We could clean up the warnings and instead run cargo clippy -- -D warnings, which fails when a warning is encountered. Do people think this is a worthwhile endeavor?

stinodego · 2022-08-18T14:58:26Z

Perhaps this is tangential to the Python lints, but within py-polars we house several Rust files. As part of the CI we run cargo clippy, which is a CLI tool similar to flake8, and there are currently 110 warnings.

We could clean up the warnings and instead run cargo clippy -- -D warnings, which fails when a warning is encountered. Do people think this is a worthwhile endeavor?

I'd be very much in favor. Cleaning up the Rust-Python bindings was on my to do (dispatching stuff to expressions was part of this effort). Conforming to the clippy linter would be a good step to take!

ritchie46 · 2022-08-18T15:01:11Z

Yes, this is something we must certainly do. It has been on my todo list very long. 😅

ghuls · 2022-08-18T15:30:32Z

Looking at it now.

ghuls · 2022-08-18T16:25:09Z

Fixes some: #4486

stinodego · 2022-08-18T22:06:09Z

Fixes some: #4486

Those "borrow_deref_ref" warnings are weird - I don't really see how we're even doing that? They're just a simple &str a lot of the time 😕

ritchie46 · 2022-08-19T07:41:31Z

Could be a bug/false positive in clippy. Had that already quite a few times. I shall take a look first.

Edit: Might also be the cause because we have wrapped in #[pymethods] which is a proc macro, so the actual code will be different. We need to expand the macros for that to check.

matteosantama · 2022-08-19T16:25:34Z

re: borrow_deref_ref seems to be a known issue

EDIT: might be fixed in upcoming PyO3 release

stinodego · 2022-09-18T22:58:24Z

I think we're good on Python lints for now. Closing this issue. Feel free to add new lint ideas here or open a new issue.

stinodego added the feature label Jul 17, 2022

zundertj added the python Related to Python Polars label Jul 17, 2022

stinodego mentioned this issue Jul 17, 2022

Update mypy settings #4049

Merged

This was referenced Jul 17, 2022

Disallow generics in mypy typing check #4052

Closed

Disallow untyped calls in mypy check #4053

Merged

Mypy warn redundant casts #4055

Merged

chore[python]: Enforce strict type equality in mypy check #4058

Merged

stinodego mentioned this issue Jul 18, 2022

[flake8] Enable flake8-bugbear extension #4073

Merged

zundertj mentioned this issue Jul 24, 2022

Improve re-raises of Exceptions #4142

Merged

stinodego mentioned this issue Jul 24, 2022

Add pylint #4100

Closed

This was referenced Jul 24, 2022

Wrap long strings #4144

Merged

Fix Python line lengths to 88 characters #4152

Merged

stinodego mentioned this issue Jul 31, 2022

Add flake8 extension: comprehensions #4200

Merged

matteosantama mentioned this issue Aug 1, 2022

Set no_implicit_reexport = true in pyproject.toml #4211

Merged

This was referenced Aug 3, 2022

Add flake8-tidy-imports extension #4229

Merged

Add flake8-type-checking extension #4230

Closed

This was referenced Aug 4, 2022

LazyFrame should not be generic #4256

Closed

Turn on disallow-any-generics #4320

Merged

stinodego mentioned this issue Aug 11, 2022

Add pyupgrade #4369

Merged

zundertj mentioned this issue Aug 14, 2022

chore[python]: Turn on mypy strict mode #4415

Merged

stinodego closed this as completed Sep 18, 2022

stinodego mentioned this issue Sep 18, 2022

chore[python]: Apply yesqa #4899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional lints for the Python code base #4044

Additional lints for the Python code base #4044

stinodego commented Jul 17, 2022 •

edited

Loading

zundertj commented Jul 17, 2022

stinodego commented Jul 17, 2022

alexander-beedie commented Jul 18, 2022 •

edited

Loading

stinodego commented Jul 18, 2022 •

edited

Loading

ritchie46 commented Jul 18, 2022

stinodego commented Jul 18, 2022 •

edited

Loading

zundertj commented Jul 20, 2022 •

edited

Loading

cnpryer commented Jul 21, 2022

stinodego commented Jul 23, 2022 •

edited

Loading

stinodego commented Jul 24, 2022

cjermain commented Jul 28, 2022

ryanrussell commented Jul 30, 2022 •

edited

Loading

stinodego commented Jul 31, 2022 •

edited

Loading

zundertj commented Jul 31, 2022

stinodego commented Aug 2, 2022

stinodego commented Aug 14, 2022

matteosantama commented Aug 14, 2022

matteosantama commented Aug 14, 2022

stinodego commented Aug 14, 2022

ritchie46 commented Aug 14, 2022

zundertj commented Aug 14, 2022

zundertj commented Aug 14, 2022 •

edited

Loading

matteosantama commented Aug 17, 2022

stinodego commented Aug 18, 2022

ritchie46 commented Aug 18, 2022

ghuls commented Aug 18, 2022

ghuls commented Aug 18, 2022

stinodego commented Aug 18, 2022

ritchie46 commented Aug 19, 2022 •

edited

Loading

matteosantama commented Aug 19, 2022 •

edited

Loading

stinodego commented Sep 18, 2022

Additional lints for the Python code base #4044

Additional lints for the Python code base #4044

Comments

stinodego commented Jul 17, 2022 • edited Loading

flake8

flake8 plugins

mypy

Other helpful CLI tools

zundertj commented Jul 17, 2022

stinodego commented Jul 17, 2022

alexander-beedie commented Jul 18, 2022 • edited Loading

stinodego commented Jul 18, 2022 • edited Loading

ritchie46 commented Jul 18, 2022

stinodego commented Jul 18, 2022 • edited Loading

zundertj commented Jul 20, 2022 • edited Loading

cnpryer commented Jul 21, 2022

stinodego commented Jul 23, 2022 • edited Loading

stinodego commented Jul 24, 2022

cjermain commented Jul 28, 2022

ryanrussell commented Jul 30, 2022 • edited Loading

stinodego commented Jul 31, 2022 • edited Loading

zundertj commented Jul 31, 2022

stinodego commented Aug 2, 2022

stinodego commented Aug 14, 2022

matteosantama commented Aug 14, 2022

matteosantama commented Aug 14, 2022

stinodego commented Aug 14, 2022

ritchie46 commented Aug 14, 2022

zundertj commented Aug 14, 2022

zundertj commented Aug 14, 2022 • edited Loading

matteosantama commented Aug 17, 2022

stinodego commented Aug 18, 2022

ritchie46 commented Aug 18, 2022

ghuls commented Aug 18, 2022

ghuls commented Aug 18, 2022

stinodego commented Aug 18, 2022

ritchie46 commented Aug 19, 2022 • edited Loading

matteosantama commented Aug 19, 2022 • edited Loading

stinodego commented Sep 18, 2022

stinodego commented Jul 17, 2022 •

edited

Loading

alexander-beedie commented Jul 18, 2022 •

edited

Loading

stinodego commented Jul 18, 2022 •

edited

Loading

stinodego commented Jul 18, 2022 •

edited

Loading

zundertj commented Jul 20, 2022 •

edited

Loading

stinodego commented Jul 23, 2022 •

edited

Loading

ryanrussell commented Jul 30, 2022 •

edited

Loading

stinodego commented Jul 31, 2022 •

edited

Loading

zundertj commented Aug 14, 2022 •

edited

Loading

ritchie46 commented Aug 19, 2022 •

edited

Loading

matteosantama commented Aug 19, 2022 •

edited

Loading