Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Remove deepcopies when slicing cubes and copying coords #2261

Closed
wants to merge 4 commits into from

Conversation

cpelley
Copy link

@cpelley cpelley commented Dec 8, 2016

Summary:

  • When indexing a cube when the data is reaslised (not lazy), a view of the original array is returned where possible (subject to the rules when slicing in numpy).
  • When indexing a cube when the data is not reaslised (lazy), realising the data on one will still not realise the data on the other.
  • Existing behaviour is that slicing coordinates returns views of the original points and bounds (where possible). This was likely chosen behaviour on the basis that DimCoords at least are not writeable. This is not the same however for Auxiliary coordinates and likely raises the likely case for this being a bug (i.e. one can modify AuxCoord object points and bounds).
  • DimCoord slicing will now return views of its data like AuxCoords. DimCoords will continue to realise data unlike AuxCoords due to the validation necessary for being monotonically increasing.

Replaces #1992

@cpelley
Copy link
Author

cpelley commented Dec 8, 2016

Happy to squash when your happy @marqh

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to come in on this party so late..

I've just reviewed our deprecation "rules", as set out in the change management whitepaper.

Key bits I think are relevant :

  • "A deprecation is issued when we decide that an existing feature needs to be removed or modified :
    We add notices to the documentation, and issue a runtime "Deprecation Warning" whenever the feature is used."
  • "sometimes we really need to change the way an API works, without modifying or extending (i.e. complicating) the existing user interface" ...
    • ... "For changes of this sort, the release will define a new boolean property of the iris.FUTURE object"
  • "the new option defaults to iris.FUTURE.<new_enable>=False, meaning the \u2018old\u2019 behaviour is the default" ...
    • ... "when any relevant API call is made that invokes the old behaviour, a deprecation warning is emitted."

So, according to those definitions, I think ...

  • this is a deprecation of a feature, and a FUTURE flag is the right way to approach it
  • it ought to issue a DeprecationWarning when the old behaviour is used.

I also think a couple of the additions to existing docstrings are a bit confusing

  • they must not suggest that we are deprecating the actual routines.
  • they should explain what the old_new behaviours look like (or refer to somewhere that does)
  • they must make it clear that the current default behaviour is unchanged, but users should now be enabling the new one

controls whether `Coord.copy()` defaults to creating coordinates
whose `points` and `bounds` attributes are views onto the
original coordinate's attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a sentence explaining the alternative. Something like ...
"Conversely, when share_data=False (currently the default), an indexed partial cube or copied coord always contains fully independent, copied data arrays."

@@ -516,13 +516,26 @@ def copy(self, points=None, bounds=None):
.. note:: If the points argument is specified and bounds are not, the
resulting coordinate will have no bounds.

.. deprecated:: 1.12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be clearer : It reads like you are proposing to remove the coord.copy function altogether (or some keywords)
-- I assume not ?!?

lib/iris/cube.py Outdated
@@ -2153,6 +2153,14 @@ def __getitem__(self, keys):
requested must be applicable directly to the cube.data attribute. All
metadata will be subsequently indexed appropriately.

.. deprecated:: 1.12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, what we are actually deprecating is only an aspect of behaviour, so this still doesn't read quite right.
I'd suggest something like :

    .. deprecated:: 1.12
        In future, the `data` attribute of the indexing result may be a view onto the original data array,
        to avoid unnecessary copying.  For the present, however, indexing always produces a full
        independent copy.
        The `share_data` attribute of `iris.FUTURE` should be set to True to enable this new
        data-sharing behaviour (if not, a deprecation warning will be issued).

Also, awkwardly, at present the Sphinx docs pass all this one by anyway, as we aren't processing docstrings for special functions.
I think you can fix that by explicitly generating docs for specific routines, and maybe we should do that with this one, but I'm not immediately sure how.

@@ -0,0 +1,3 @@
* Deprecated the data-copying behaviour of Cube indexing and `Coord.copy()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a bit more explanation I think, as everyone is going to be affected.
For the whatsnew, I think that could be just a link to a better explanation.

@@ -516,13 +516,26 @@ def copy(self, points=None, bounds=None):
.. note:: If the points argument is specified and bounds are not, the
resulting coordinate will have no bounds.

.. deprecated:: 1.12

By default the new coordinate's `points` and `bounds` will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this explains it from slightly the wrong angle : see the Cube.__getitem__ comment...

new_coord.attributes = copy.deepcopy(self.attributes)
new_coord.coord_system = copy.deepcopy(self.coord_system)
else:
new_coord = copy.deepcopy(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should issue a deprecation warning.

if isinstance(data, biggus.Array) or not data.flags['OWNDATA']:
data = copy.deepcopy(data)
if not iris.FUTURE.share_data:
# We don't want a view of the data, so take a copy of it if it's
Copy link
Member

@pp-mo pp-mo Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This codepath should issue a deprecation warning.

@@ -516,13 +516,26 @@ def copy(self, points=None, bounds=None):
.. note:: If the points argument is specified and bounds are not, the
resulting coordinate will have no bounds.

.. deprecated:: 1.12
Copy link
Member

@pp-mo pp-mo Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a completely seperate issue from deprecations etc...
I'm a bit bothered about the scope of the new behaviour in coordinates.

Firstly, I'm not convinced that it is expected for a method called "copy" to return a view at all.
I'm thinking that it would make more sense to have .copy() return a full copy, and use [:] for a toplevel-copy.
Isn't that what numpy does, so that is what we should emulate ?
E.G.

>>> a=np.arange(10)
>>> b=a[2:4]
>>> b[:] = 0
>>> a
array([0, 1, 0, 0, 4, 5, 6, 7, 8, 9])
>>> b=a.copy()
>>> b[:] = -1
>>> a
array([0, 1, 0, 0, 4, 5, 6, 7, 8, 9])
>>> 

Secondly, as I've just mentioned it, shouldn't Coord.__getitem__ be returning copies ??
(that is, instead, if you accept the previous...)

@cpelley
Copy link
Author

cpelley commented Dec 9, 2016

Thanks @pp-mo ah your onto something here, I think I agree. The subtlety is that Coord.__getitem__ currently calls self.copy.
I'll have a thinking about the logic of getitem and the copy method and get back to you.

@cpelley
Copy link
Author

cpelley commented Dec 9, 2016

Hi @pp-mo, I think I have found a more reasonable behaviour for this views/copy business for the coords object (see the latest commit). Summary:

  1. Coord.copy now always returns a deepcopy but doesn't unnecessarily copy the original points before replacing it when supplying new points and bounds (this provides a significant memory optimisation especially in the case of multidimensional coordinates present).
  2. Coord.__getitem__ returns a view when possible, much like the behaviour of a numpy array.

I haven't updated the deprecation issues you mentioned yet (but I will do). Let me know if your happy with the underlying chosen behaviour.

Cheers

@pp-mo
Copy link
Member

pp-mo commented Dec 9, 2016

I think I have found a more reasonable behaviour for this views/copy business for the coords object
Thanks @cpelley

I'm a bit confused now as to what we are (and are not) doing with coords...

  • Of course, unlike data, the coords data arrays are not modifiable, so we don't have to make copies just to avoid unintentional change to a 'parent' coord.
  • You are now addressing a gross issue with copying the coord points only to throw them away (!)
  • However, I think there is also another potentially large benefit from being able to index lazy coords without realising them (consider making cube slices, for example), but that is simply not in scope at this point, is that right ??

I'm also wondering whether you need to apply these tests to a DimCoord as well ?

@pp-mo
Copy link
Member

pp-mo commented Dec 9, 2016

Meanwhile...
the change in Cube looks good to me, but I think the code can be neater.
After some effort, I convinced myself that this should be equivalent to your lines 2185-2220 ...

        try:
            first_slice = next(slice_gen)
        except StopIteration:
            first_slice = Ellipsis if iris.FUTURE.share_data else None

        if first_slice is not None:
            data = self._my_data[first_slice]
        else:
            data = copy.deepcopy(self._my_data)

        for other_slice in slice_gen:
            data = data[other_slice]

I think it passes the tests, anyway.
Does you think that is right ?

@pp-mo
Copy link
Member

pp-mo commented Dec 12, 2016

@cpelley uncovered some unexpected behaviour (as discussed offline):

>>> co = iris.coords.AuxCoord(np.arange(5))
>>> co.points
array([0, 1, 2, 3, 4])
>>> co.points[2] = 77
>>> co.points
array([ 0,  1, 77,  3,  4])
>>> 
>>> co2 = co[2:4]
>>> co2.points
array([77,  3])
>>> co2.points[1] = 9999
>>> co.points
array([   0,    1,   77, 9999,    4])
>>> 

This only applies to an AuxCoord, as the DimCoord value arrays are not writeable.
We think this is a bug really : it is not intended behaviour that changing a derived (sliced) coord can affect the original.
Could address this in a separate PR.

For the present case, we might consider introducing this type of behaviour, i.e. a sliced coord is linked to the original. We might also consider extending that to DimCoords.
However, from a practical, performance point of view, such behaviour won't achieve anything much in most cases. That is, only if coords (a) are quite large (meaning, usually, multidimensional) and (b) have deferred data (which at present generally means loaded from netcdf).

@pp-mo
Copy link
Member

pp-mo commented Dec 12, 2016

NOTE:
The change proposed here is practical, but doesn't go as far as making cubes behave "just like arrays" :
The existing design completely forbids that by ensuring that views on cube data and coord values are never returned.
The proposed change, however, only affects the data at this point : Even if we (later) add more changes to let coords share points arrays, cubes will not share coordinate objects, so for instance changes made to cube.slices() will still not work the way you might expect (i.e. like numpy).

@pp-mo
Copy link
Member

pp-mo commented Dec 12, 2016

Another NOTE:
Another thing that this won't deliver is to have copies of a cube with deferred data "share" data when the original is realised (i.e. loaded) : Because we replace "cube._my_data", the copied cubes still have deferred arrays, and when requested the data will be fetched again for each one.

@cpelley
Copy link
Author

cpelley commented Dec 14, 2016

Thanks @pp-mo, I'll update the ticket description. There are so many subtleties that I think it worth going in some detail in the description for the record

...After some effort, I convinced myself that this should be equivalent to your lines 2185-2220 ...

Thanks @pp-mo, yes looks neater and simpler, thanks.

@cpelley
Copy link
Author

cpelley commented Dec 14, 2016

You will be pleased to know that I have found some further subtleties. Some I have fixed and others I'm working on.

  • DimCoords was returning copies of points and bounds rather than views (I have fixed this now).
  • DimCoords realises its data while AuxCoords does not realise its data. I don't think I will be doing anything about this other than capturing it in a suitable unittest, since we need to evaluate whether it is monotonic in order to allow/disallow setting it.

Gosh!

@cpelley cpelley self-assigned this Dec 15, 2016
@cpelley
Copy link
Author

cpelley commented Dec 15, 2016

@pp-mo I think I'm ready. The travis failure is unrelated (I have restarted the tests but to no avail).

======================================================================

FAIL: test_resolve (iris.tests.test_image_json.TestImageFile)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/envs/test-environment/lib/python2.7/site-packages/Iris-1.12.0.dev0-py2.7-linux-x86_64.egg/iris/tests/test_image_json.py", line 95, in test_resolve

    self.assertEqual(deque(), exceptions)

AssertionError: deque([]) != deque([ConnectionError(MaxRetryError("HTTPSConnectionPool(host='scitools.github.io', port=443): Max retries exceeded with url: /test-iris-imagehash/images/155fb4e2e9a1a9a16da9b9e006fa9176078a965e0b869b5f0b8659751b807b75.png (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f8260507050>: Failed to establish a new connection: [Errno 110] Connection timed out',))",),)])```

Cheers

@marqh
Copy link
Member

marqh commented Dec 22, 2016

Hi @cpelley

your patience is appreciated, I hope you are still with us.

We have been having trouble with the above test failure in processing the preparation for the 1.12 release.

There is now a fix on the 1.12.x branch, which hasn't been merged back into master yet.

I'd like this functionality in 1.12 and to see the tests passing; so, please rebase and retarget (in a new PR is fine if that is easier) and we'll do our best to get this in

thank you
mark

@cpelley cpelley changed the base branch from master to v1.12.x January 9, 2017 13:00
@cpelley cpelley force-pushed the AVOID_DATA_COPIES branch 2 times, most recently from 9abd578 to e1add28 Compare January 9, 2017 13:23
@cpelley
Copy link
Author

cpelley commented Jan 9, 2017

Thanks @marqh, I have changed base and squashed.

@cpelley cpelley force-pushed the AVOID_DATA_COPIES branch from e1add28 to bcc0100 Compare January 9, 2017 15:06
@cpelley
Copy link
Author

cpelley commented Jan 9, 2017

Test failures are due to unrelated (files untouched) license header failures (problems with 1.12.x).

@cpelley
Copy link
Author

cpelley commented Jan 10, 2017

Rebased but I'm still getting what looks like unrelated problems...

======================================================================

ERROR: test_resolve (iris.tests.test_image_json.TestImageFile)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/envs/test-environment/lib/python2.7/site-packages/Iris-1.12.0rc0-py2.7-linux-x86_64.egg/iris/tests/test_image_json.py", line 46, in test_resolve

    raise ValueError('Github API get failed: {}'.format(iuri))

ValueError: Github API get failed: https://api.github.com/repos/scitools/test-iris-imagehash/contents/images

-------------------- >> begin captured logging << --------------------

requests.packages.urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): api.github.com

requests.packages.urllib3.connectionpool: DEBUG: https://api.github.com:443 "GET /repos/scitools/test-iris-imagehash/contents/images HTTP/1.1" 403 None

--------------------- >> end captured logging << ---------------------

@cpelley
Copy link
Author

cpelley commented Jan 16, 2017

ping

@cpelley
Copy link
Author

cpelley commented Jan 19, 2017

Hi @marqh, I have rebased again and have the same error (looks unrelated to me??)

@cpelley cpelley changed the base branch from v1.12.x to master February 17, 2017 11:00
Carwyn Pelley added 2 commits February 17, 2017 11:14
- When indexing a cube when the data is reaslised (not lazy), a view
  of the original array is returned where possible (subject to the
  rules when slicing in numpy).
- When indexing a cube when the data is not reaslised (lazy),
  realising the data on one will still not realise the data on the
  other.
- Existing behaviour is that slicing coordinates returns views of the
  original points and bounds (where possible).  This was likely chosen
  behaviour on the basis that DimCoords at least are not writeable.
  This is not the same however for Auxiliary coordinates and likely
  raises the likely case for this being a bug (i.e. one can modify
  AuxCoord object points and bounds).
- DimCoord slicing will now return views of its data like AuxCoords.
  DimCoords will continue to realise data unlike AuxCoords due to the
  validation necessary for being monotonically increasing.
@cpelley
Copy link
Author

cpelley commented Feb 17, 2017

Urgh no end of trouble with travis on this one...

TimedOutException on the Python3.4 TARGET=DEFAULT tests :(

There are genuine test failures though:

The example tests (i.e. TARGET=example) are failing because the warning about this deprecation is issued. Before I look at this, do we really want the example tests failing just because deprecation warnings are issued??

@marqh
Copy link
Member

marqh commented Mar 7, 2017

The example tests (i.e. TARGET=example) are failing because the warning about this deprecation is issued. Before I look at this, do we really want the example tests failing just because deprecation warnings are issued??

this is an explicit choice, and one that I support

I do not want 'example' code using deprecated features

@cpelley
Copy link
Author

cpelley commented Mar 24, 2017

Finally got the time to get back to this.
All tests passing and no timeouts!

ping @marqh, @pp-mo

Cheers

@cpelley
Copy link
Author

cpelley commented May 17, 2017

Closed in favour of #2549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants