Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax for typing multi-dimensional arrays #516

Open
shoyer opened this issue Dec 10, 2017 · 8 comments
Open

Syntax for typing multi-dimensional arrays #516

shoyer opened this issue Dec 10, 2017 · 8 comments
Labels
topic: feature Discussions about new features for Python's type annotations

Comments

@shoyer
Copy link

shoyer commented Dec 10, 2017

As part of the larger project for multi-dimensional arrays (#513), one of the first questions I would like to settle is what syntax for typing data-types and shapes should look like.

Both dtype and shape should be optional, and it should be possible to define multi-dimensional arrays for which either or both of these are generic:

  • dtype: indicates the data type for array elements, e.g., np.float64
  • shape: indicates the shape of the multi-dimensional array, a tuple of zero or more integers. We would like to support integer and variable sized dimensions, and variable numbers of dimensions. These are most naturally represented with indexing by a variadic number of integer, variable, colon : and/or ellipsis ... arguments, e.g., NDArray[1, N, :, ...] for an array with dimensions of size 1, size N, and arbitrary size, followed by 0 or more arbitrary sized dimensions.

For NumPy, ideally we would like to add basic typing support for dtype (using Generic) even before typing for shape is possible. But we'd like to know what the ultimate syntax should look like, so we don't paint ourselves into a corner.

One key question: can we safely rely on using a single generic argument for dtypes (e.g., np.ndarray[np.float64]) as indicating an array without any shape constraints?

My doc (same as in the master issue) considers a number of options under the "Possible syntax" section.

So far, I think the best option is some variation of "two generic arguments", for dtype and shape. But this could quickly get annoyingly verbose when sprinkled all over a code-base, e.g., np.ndarray[np.float32, Shaped[..., N, M]]:

  • It would be nice to support syntax like np.ndarray[np.float32] (the multi-dimensional equivalent of List[float]) as an alias for np.ndarray[np.float32, Any], but we don't yet have optional arguments for generics (variadic arguments are a somewhat awkward fit for a single argument).
  • It would also be nice to allow omitting Shaped[], e.g., by writing dimensions as variadic generics to the array type like np.ndarray[np.float32, ..., N, M]. One possible ambiguity is how to specify scalar arrays: np.ndarray[np.float32,] looks very similar to np.ndarray[np.float32]. But scalar arrays are rare enough that these could potentially be resolved by disallowing np.ndarray[np.float32,] in favor of requiring np.ndarray[np.float32, Shape[()]].
@ilevkivskyi
Copy link
Member

This is a hard question. I would prefer to have np.ndarray generic in two type variables: the first one for dtype, the second one for shape. Something like this (in stub file):

T = TypeVar('T')
S = TypeVar('S', bound=Shape)
class ndarray(Generic[T, S]): ...

where Shape would be a special variadic type very similar to Tuple but it will accept integers (both literals and constants) and "integer variables". So that it will look like:

a: ndarray  # just an array, shape and type are arbitrary
b: ndarray[float32, Any]  # array of floats with an unknown (dynamic) shape
c: ndarray[Any, Shape[100, 100]]  # array of dynamic types with fixed shape (100, 100)
c: ndarray[float32, Shape[100, 100]]
N = IntVar('N')
M = IntVar('M')
d: ndarray[float32, Shape[N, M]]

I understand that typing second Any in situations where one doesn't care about shape (or dtype) might be annoying, but I don't want to introduce additional exceptions for omitted type parameters. Second, I want to factor out the Shape type, so that it can be easily used by other libraries that use alternative array types (and maybe even built-in array).

@shoyer
Copy link
Author

shoyer commented Dec 11, 2017

I would prefer to have np.ndarray generic in two type variables: the first one for dtype, the second one for shape.

Yes, this seems like the right way to do things.

I understand that typing second Any in situations where one doesn't care about shape (or dtype) might be annoying, but I don't want to introduce additional exceptions for omitted type parameters.

I will raise the issue of optional/default type variables separately. I agree that we shouldn't have a special case just for arrays.

On a related note: is there a good way to write "partially defined" generic type aliases? This would potentially alleviate the usability issue. For example, in user code:

  • FloatArray[...] as an alias for ndarray[float64, Shape[...]]
  • Matrix[...] as an alias for ndarray[..., Shape[N, M]]

I know I can specialize type variables in subclasses (e.g., class Matrix(ndarray[T, Shape[N, M]]), but that implies the argument is actually a member of the Matrix subclass. Likewise, I can write an alias Matrix = ndarray[T, Shape[N, M]], but that implies using the particular type variable T rather than producing a generic type with only one type variable.

Second, I want to factor out the Shape type, so that it can be easily used by other libraries that use alternative array types (and maybe even built-in array).

Yes, definitely!

@ilevkivskyi
Copy link
Member

Likewise, I can write an alias Matrix = ndarray[T, Shape[N, M]], but that implies using the particular type variable T rather than producing a generic type with only one type variable.

If I understand you correctly, then I should say that situation with generic aliases is exactly opposite. For example:

T = TypeVar('T')
SDict = Dict[str, Tuple[T, T]]

d: SDict[int]  # same as Dict[str, Tuple[int, int]]

U = TypeVar('U')
def func(x: U, y: U) -> SDict[U]  # same as Dict[str, Tuple[U, U]]

careful: SDict = {}  # same as Dict[str, Tuple[Any, Any]]

(the last example is a typical pitfall, so that we have a special flag in mypy to catch this). There are some more examples in mypy docs (note they are still incomplete).

Taking this into account, I would expect at least the following alias defined in numpy (verbatim, but name is random) describing array with a given type but dynamic dimensions:

dynarray = ndarray[T, Any]

x: dynarray[float32]  # same as ndarray[float32, Any]

I will raise the issue of optional/default type variables separately. I agree that we shouldn't have a special case just for arrays.

FWIW this proposal (defaults for type variables) have appeared some time ago, but it didn't get enough support.

@ilevkivskyi
Copy link
Member

(Also for the numpy stubs I know there was some prior proof-of-concept attempt, see https://github.com/machinalis/mypy-data)

@shoyer
Copy link
Author

shoyer commented Dec 11, 2017

@ilevkivskyi Thanks for correcting my misconception about generics! This does make a significant difference for usability (probably good enough for me).

I suppose that whatever syntax is chosen for variadic type variables in #193 should also allow for aliases, so we can write something like:

S = TypeVar('S', variadic=True)
FloatArray = ndarray[float64, S]
x: FloatArray[N, M]

(Also for the numpy stubs I know there was some prior proof-of-concept attempt, see https://github.com/machinalis/mypy-data)

Yes, we know about this one. We definitely plan to recycle work if possible.

@mitar
Copy link
Contributor

mitar commented Dec 12, 2017

If keyword arguments to indexing would be allowed, we could have things like ndarray[int, shape=(1,2,3)]. :-)

@junjihashimoto
Copy link

Hi, @ilevkivskyi san and @shoyer san.
Thank you for great idea of ndarray with generic.

For supporting generic tensor and matrix operations like multiplying, flatten and reshape,
I think it is necessary to calculate IntVar.

I am trying IntVar and Shape with Generic on this code.
Current python can accept following syntax with modified typing.py.

def matmul(a: ndarray[T,Shape[N0,M0]],b: ndarray[T,Shape[N1,M1]]) -> ndarray[T,Shape[N0,M1]]:
    pass

def flatten(a: ndarray[T,Shape[N0,M0]]) -> ndarray[T,Shape[N0*M0]]:
    pass

def reshape(a: ndarray[T,Shape[N0*M0]]) -> ndarray[T,Shape[N0,M0]]:
    pass

@junjihashimoto
Copy link

I made a mistake.
There was no problem with multiplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: feature Discussions about new features for Python's type annotations
Projects
None yet
Development

No branches or pull requests

5 participants