-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to distinguish between Sequence[str]/Iterable[str] and str? #256
Comments
Since |
I think type should never lie, even if it is a white lie. Either we should remove Or we should have a special type name for "iterable of strings that is not a string". Strings are already special, as AnyStr shows. But although AnyStr is able to be represented using more primitive operations, I think it's too early to introduce a "type difference" operation in general. E.g. co(ntra)variance seems weird in that case. |
The problem I have with allowing People can over-specify their APIs by requiring Random thought: Would it be possible for our "magic" Text type to lose it's |
It requires more work on the part of API authors, but one option that might be less of a lie is to be able to delete an overload. C++ has a similar problem, where a type being passed in might "work" but you want to forbid it. So they implemented a special overload that, if matched, causes an error. See e.g. this SO thread. Then one could define the API for |
Mypy doesn't currently have a way to remove methods in a subclass, because it would fail Liskov. But there's a hack possible. You can change the signature of a method override in a way that violates Liskov, and then add a class Text(Sequence[str]):
def __iter__(self) -> None: ... # type: ignore |
It actually doesn't work. Because currently there is a rule in mypy: "nominal first" (for various important reasons), if something works using nominal subtyping, then mypy just uses it. In this case |
Hm... Maybe Text could be a Protocol that has the same methods as Sequence except for one? |
Maybe, TBH I am still not sure what are the costs/benefits here. I am afraid making such big changes in typeshed can break many existing code. But on the other hand if someone wants to do this "locally" it should be a fine solution. |
A relatively simple approach would be to special case |
If we assume the type checker has reasonable good dead code analysis capabilities, we could get a solution that's pretty similar to the one C++ has for free by combining
That said, idk if any type checkers actually do handle this case gracefully. Mypy, for example, will just silently ignore the last Maybe to help this analysis, we could add some sort of It's not a perfect solution since there's still no definitive way of telling if an |
Given the norm for most APIs is to accept the iterable and never want plain |
I think we're trying to expand type hints beyond their original purpose, and it shows. |
Unfortunately this would make |
I like the idea of special-casing strings in the tool rather than in the type system, since as @gvanrossum notes, str is an iterable of str (turtles all the way!). Also this sort of type-aware linting is a neat idea, and could be done relatively easily within the typechecker because we have all the information at hand. |
Do we? When I see a function that takes an |
I was thinking always excluded; I've run into problems in both python and other languages where a function expecting an iterable was passed a string, and never (that I can think of) actually wanted a generic iterable to treat a string as an iterable of chars. That should hold even more strongly if the function specifies Iterable[str]; it is a good hint that str is being viewed as an atomic type there. |
Hm, I guess you could add it back explicitly by saying Would this extend to e.g. |
I think so, yes; I want to say that str|bytes|unicode should not satisfy Iterable[anything] if the flag is passed in. |
Yes. We have seen this specific bug multiple independent times at work. Unfortunately more than once after deployment in production. I consider it a motivating anti-pattern for a type checker to help avoid. No other tool can validate this, it requires type information. Lets not be purists here. It improves developer productivity and code maintainability to flag this and we have a way to explicitly annotate the less common APIs that want to accept both. :) Requiring such APIs to specify |
I recall about how Rob Pike (who famously has just 'r' as his username) once got spammed when some script that sent email invoked an email-sending API with a single email address instead of a list. |
If we're going to go EIBTI route, why not be explicit where it counts?
It would also help in distinguishing iterating through combined characters (graphemes), and be almost analogous to iterating through words with .split() and lines with .splitlines(). While we're at it, I would be very happy with Yes, I know what the response is going to be. But if we really don't want to change the language, maybe it really is not the problem of the language as a whole, but of a specific API. And there I don't see any problem with writing
Again, it's not the type that's wrong (although you can raise TypeError above if you want:). Analogy: there are many functions that declaratively accept int, but in fact work only with nonnegative numbers. In fact, I think there are more such functions than the ones that work out of the box with negative integers. Are we going to redefine that an annotation |
I found this thread because I am looking for a way to annotate some code like below:
Currently, mypy (v0.730) gives Not sure if anyone suggested this before, perhaps we can add a "negative" or "difference" type. Similar to Having the
|
I ended up here looking for a way to handle a case almost identical to the above, trying to specify different overloads for def foo(value):
if isinstance(value, str):
return len(value)
return [len(x) for x in value] How can I annotate such a function such that foo("abc") + 1
max(foo(["a", "bcdefg"])) are both valid? As far as I can tell, I have to give up and say It's worth noting explicitly that this is distinct from the case in which we want to write def bar(xs: Sequence[str]):
return max([len(x) for x in xs] or [0]) and forbid I'm not trying to use type checking to forbid using a string -- I'm trying to correctly describe how the types of arguments map to the types of potential return values. Generalizing beyond strings, it seems like what's wanted is a way of excluding a type from an annotation which would otherwise cover it. I'd love to see def foo(value: str) -> int: ...
def foo(value: Excluding[str, Sequence[str]]) -> List[int]: ... or even for this to be deduced from overloads based on their ordering: def foo(value: str) -> int: ...
def foo(value: Sequence[str]) -> List[int]: ... with the meaning that the first annotation takes precedence. |
I've also run into this issue. In my case, I have functions that iterate over the given parameter assuming said parameter is a List or Tuple of strings. If a string were passed in by accident, things will break. Ex: def check_users_exist(users: Union[List[str], Tuple[str]]) -> Tuple[bool, List[Tuple[str, int]]]:
results = []
for user in users:
results.append((user, request.urlopen(f'{addr}/users/{user}').status)) # This breaks if `users` is eg. "username", as we ask for `u`, `s`, etc.
return all([199 < x[1] < 300 for x in results]), results
# Granted, an `isinstance` check would save us here, but it would be great if the type checker could alert us, too. The annotation I'm using, It's possible to hide the problem with a type alias, I suppose, eg. Maybe, since it doesn't seem possible to differentiate Unfortunately, I'm not knowledgeable enough to know if this isn't a dumb idea. I just know it would make my life a little easier if it existed 😃 |
I'm yet another user overspecializing my type annotations to avoid this bug because I was using the more generic |
Why wouldn't using the new iterable class be an opt in? I am not proposing changing Iterable in general. |
No, you'd have to convince all the people who currently diehard insist on annotating their own API as ... switch from |
Sorry, but what does this have to do with my comment? In this thread, my suggestion was changing
Then the benefit would be really small since almost no one is going to opt in. In particular, generic parameters |
Your suggestion would seem to be different from your comment, then. (In fact I don't even think I've seen you make a suggestion here but maybe it was from back in the 2+ years range which I didn't fully read/remember). The comment I replied to was your response to @spacether. The suggestion of @spacether which you were providing commentary on, was NOT about changing what str inherits from. I believe that @spacether's suggestion would be very suitable for my use cases. In fact, it's even been suggested before somewhere in one of the various GitHub issues on this topic spread across multiple repos. I can't remember which issue though. :( The core issue here is that semantically it is "correct" for str to point out that it is, in fact, capable of iteration, and for functions that accept iterables and can validly do something with them, to accept a str. It's valid to do But for third-party APIs belonging to a personal project, there are often functions that aren't generic language primitives like list() and actually expect to receive an iterable of, say, filenames. There's no type for a string whose contents are a valid filename. What you get is an iterable of strings. This API needs a way to type itself such that it accepts the things it accepts and doesn't accept the things it doesn't accept. We don't need to change the definition of a string, but we do need to have a type annotation that says "accept any iterable of strings, as long as it isn't an instance of str". The suggestion by @spacether (and others in the past) could be one way to achieve that goal. Changing what a str is doesn't really help, because it's been repeatedly shot down as "not technically correct" (and the maintainers are correct in that analysis, it is not technically correct, so it's difficult to argue that they should do the wrong thing just because it's convenient for me). Changing what a str is, just leads to people repeatedly arguing back and forth about whether it is technically correct, whether it's practically correct, whether people who need actual generic iterable protocols need to change their correct annotations, etc. In practice, people who don't want to annotate sequences of str and get bugs when they get a str, are doing one of two things:
Both of these groups are concerned exclusively with their own code. Both of these groups would be happy to change their own code which is non-portable (fails mypy) / buggy (variance problems hacked around with casting or whatnot), if it means an end to the problem. Offering brand new and never before seen types that are a figment of the type checker's imagination, like NonTextSequence or choice of bikeshedded names (ContainerizedHolderOfDiscreteCovariantOriginalAndUnmodifiedElements?) seems like a win all around. Iterables/sequences are still technically correct, and my code is also technically correct because it annotates that it accepts sequences as long as they are the particular subcategory of sequences that don't generate entirely new elements by running a chunking algorithm with length 1 against the element I did give. No one has to be convinced to do anything they don't want to, because nothing changed. Only new stuff got added. All code in the world, anywhere and everywhere, still means exactly what it did beforehand... but now people can describe new concepts they couldn't describe before, if they want to. |
Is it possible to bind a typevar to subset classes of Sequence?, like str, and typing.List[str].
|
My suggestion is linked in my comment. The discussion is extensive in case you're curious.
Right, I was providing an alternative solution that has a different set of costs and benefits.
I think the core problem that we're solving is strings being improperly used as sequences. Any solution has to balance:
There is also a question of code elegance:
I think you should read the entire linked thread. Nothing is "out of the question" because it's "not technically correct". Type checkers today flag a huge amount of "technically correct" code.
Even if you're concerned with only your code, you can still accidentally pass strings to interfaces that expect sequences. The And since this whole process is opt-in, it means that you have to convince all of the maintainers to update their code. This is trickier than it seems because of generic functions that accept things like
I think the best way to make a convincing case of one solution over another would be to use mypy-primer with different solutions and estimate A and B. For the For the non-sequence-str, count up all the errors it catches (B), and all of the places that you need to access a string as a sequence (using a property like |
Then the suggestion wasn't made here, no wonder I was confused. :/ (Also sorry but I find the Discourse software exceedingly unpleasant to try to read or interact with. There are many forum software packages that are far nicer, and none that are worse, from a user interaction perspective. I'm disinclined to subject myself to experiences I find painful. Maybe you can summarize that thread here?)
Again, I don't know what happens on some linked thread elsewhere, but in the family of GitHub issues in which this Sequence[str] topic was discussed there has been considerable resistance to making str falsely claim that it cannot be iterated over, or annotating the list function as not accepting What code is being flagged as wrong despite being technically correct at the type theory level? Certainly, lots of code is being flagged as wrong because its types don't match up, even though the code works as expected... that's not what "technically correct typing" means.
I don't know what this means. If interfaces expect sequences and can do something meaningful with a sequence of type str that gets unpacked to a progression of single characters, then it's the callee that should be deciding this, whether that means annotating as What does keeping A small mean? The callee wants to keep bad code that performs incorrectly, mypy-passing, and therefore chooses to exclude useful information from its API contract because ? This doesn't sound like a convincing point of discussion...
In general, asking people whose code is currently incorrect (read: failing to correctly communicate what typed inputs are valid) to change, results in fewer people with ruffled feathers than asking people whose code is currently correct and are expecting the documented behavior to work the way it does, to change. Also since you mention below "using a property like s.chars" I assume that this opt out solution also depends on runtime changes, which is another hard sell, because it won't work until projects can start requiring python 3.13 as a minimum version, you can't import new runtime properties of the str type from typing_extensions after all. That means I, personally, cannot benefit from said change until 2028, when the unreleased python 3.12 is both released and EOL. As for generic functions, those don't need changing at all -- e.g. list() wants to communicate to its callers that they can do There's no churn anyway -- for an opt-in change, the size of A is 0.
I'm extremely aware that in FOSS the best way to convince maintainers is to write the patch yourself and demo it in action. Thanks for pointing that out. If I had any knowledge of type checkers other than running them, I'd certainly be inclined to try just that! But I don't, and it's an intimidating topic to me. Although since this logic works equally well to make a convincing case for one solution without looking at other solutions, I'm curious -- have you followed that advice and implemented the solution you suggested? If so, I'd be very curious to see the patches and mypy_primer results. ;) Since it appears to require runtime modifications to builtin types in python itself, I assume your patches include patches for mypy, typeshed, and also CPython. Very impressive. :) |
I realise that the status quo here is frustrating for many, and that a lot of people care deeply about improving this situation. Let's just all remember to give each other the benefit of the doubt in this discussion and avoid sarcasm wherever possible. If the tone of this discussion continues to deteriorate, I'll have to lock this thread. |
Off the top of my head, decorators that implement the descriptor protocol and have a different behavior than the functions they decorate—are currently expected to produce a callable with the same behavior.
Yes, but the problem is that it catches few errors that way. (B is too small.)
It means keeping the amount of code changes that you need to make small.
That's the point: just because the code is valid Python, it doesn't mean that it's correct. That's what the Pytype people found: That when they narrowed the base classes, it caught coding errors.
Another option would be to allow a cast to get the old behavior.
The size of A varies with the number of times you have to change your code to opt in.
I did spend a couple hours playing with the mypy-primer, but it was harder than I expected. I don't remember what wasn't working.
I think you just need to change the typeshed, but it's possible that I would have needed to change MyPy as well. It's been a few months, and I don't remember.
You might try convincing the author of BasedMyPy to implement set difference. They have already done intersection (A & B), and logically set difference (A \ B) may be similar to implement. Then, you can try to sprinkle in some |
@twoertwein your solution works. Thanks for posting it. The working code is here: import typing
_T_co = typing.TypeVar("_T_co", covariant=True)
class Sequence(typing.Protocol[_T_co]):
"""
if a Protocol would define the interface of Sequence, this protocol
would NOT allow str/bytes as their __contains__ is incompatible with the definition in Sequence.
methods from: https://docs.python.org/3/library/collections.abc.html#collections.abc.Collection
"""
def __contains__(self, value: object, /) -> bool:
raise NotImplementedError
def __getitem__(self, index, /):
raise NotImplementedError
def __len__(self) -> int:
raise NotImplementedError
def __iter__(self) -> typing.Iterator[_T_co]:
raise NotImplementedError
def __reversed__(self, /) -> typing.Iterator[_T_co]:
raise NotImplementedError And it looks like bytes is also incompatible with it per: And the python interpreter imposes this behavior too. x in y where y is a str uses
and for bytes:
|
Based on the suggestion in this thread, I've added a |
We have experimented with the Overall, I really wish this could be reconsidered. In my opinion it feels like a mistake to typecheck The problem is that Technically these statements are true, but semantically most often not. The perhaps ~99% use case of a Regarding the rare use cases that actually want a sequence of characters: Since this can be easily solved explicitly via |
Hi.
|
This is a real pain and has been open since 2016. thanks |
(Async)Neo4jBookmarkManager's `initial_bookmarks` parameter, as well as `Bookmarks.from_raw_values` `values` parameter accept anything of type `Iterable[str]`. Unfortunately, `str` itself implements that type and yields the characters of the string. That most certainly not what the user intended. In an ideal world, we could tell the type checker to only accept `Iterable[str]` when *not* of type `str`. See also python/typing#256. To help users out, we now explicitly check for `isinstance(input, str)` and turn such inputs into an iterable with the input string as the only element.
@dmoisset hi, have you tried to push your suggestion? |
I think what users want is what was elaborated above of just codifying a special case for I realize this may be considered annoying from a type checker and theory purist implementation standpoint, but it is how most users should think of code marked IMNSHO, waiting for a general purpose generic expressible within typing syntax solution rather than this one practical special case does not feel like it is helping users. |
(Async)Neo4jBookmarkManager's `initial_bookmarks` parameter, as well as `Bookmarks.from_raw_values` `values` parameter accept anything of type `Iterable[str]`. Unfortunately, `str` itself implements that type and yields the characters of the string. That most certainly not what the user intended. In an ideal world, we could tell the type checker to only accept `Iterable[str]` when *not* of type `str`. See also python/typing#256. To help users out, we now explicitly check for `isinstance(input, str)` and turn such inputs into an iterable with the input string as the only element.
If a function expects an iterable of strings, is it possible to forbid passing in a string, since strings are iterable? Seems like there are many cases where this would be an error, but I don't see an obvious way to check 't','h','i','s'.
The text was updated successfully, but these errors were encountered: