Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiling/AWS Session problems in veda-dev #192

Open
anayeaye opened this issue Jul 4, 2023 · 17 comments
Open

Tiling/AWS Session problems in veda-dev #192

anayeaye opened this issue Jul 4, 2023 · 17 comments
Labels
bug Something isn't working

Comments

@anayeaye
Copy link
Collaborator

anayeaye commented Jul 4, 2023

What

We are attempting to promote a large diff from our dev backend to staging but have encountered problems with the maps in the discovery and explore views of a preview of the dashboard running against the development backend. This is a hard error to document because a request for /cog/info that fails on first attempt succeeds when attempted a second time (I've seen that before but don't recall our answer at the moment). I suspect at least some of the solution may be in our raster-api GDAL environment, perhaps the configuration has drifted?

Dashboard preview:

https://deploy-preview-281--visex.netlify.app/

Mosaic examples

Failing mosaic in dev

https://dev-raster.delta-backend.com/mosaic/tiles/795277e64375a264bf3f73506a6cd2d0/WebMercatorQuad/2/0/1@1x?assets=cog_default&resampling=bilinear&bidx=1&colormap_name=rdylbu_r&rescale=0,1

First try:
'/vsis3/veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif' does not exist in the file system, and is not recognized as a supported dataset name.

Second attempt after executing /cog/info:
Read or write failed. IReadBlock failed at X offset 0, Y offset 0: /vsis3/veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif, band 1: IReadBlock failed at X offset 0, Y offset 0: TIFFReadEncodedTile() failed."

Mosaic works in staging

https://staging-raster.delta-backend.com/mosaic/tiles/795277e64375a264bf3f73506a6cd2d0/WebMercatorQuad/2/0/1@1x?assets=cog_default&resampling=bilinear&bidx=1&colormap_name=rdylbu_r&rescale=0,1

COG info examples

Note we are unable to read COG info for the file for the mosaic but can access other files in the same collection so it is not purely a permission issue

https://dev-raster.delta-backend.com/cog/info?url=s3://veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif

On first attempt: ``'/vsis3/veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif' does not exist in the file system, and is not recognized as a supported dataset name.`

On second attempt: endpoint returns cog/info

These are yearly COGs so the error should be reproducible by incrementing the date in the tif name.

COG Tiles example

We already know that /cog is handling the env, this tiles example works as expected.
https://dev-raster.delta-backend.com/cog/tiles/WebMercatorQuad/0/0/0@1x?url=s3://veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif&bidx=1&rescale=0,1

Stack Notes

We cannot make a one to one comparison of the dev and staging veda-backend stacks because we have upgraded the version of pgstac for the dev database but not staging.

Similarities

  • The dev database was created with a fairly recent snapshot of the staging database, however, so the contents similar.
  • Both stacks are assuming the same data access role

Differences

@anayeaye anayeaye added the bug Something isn't working label Jul 4, 2023
@smohiudd
Copy link
Contributor

smohiudd commented Jul 4, 2023

@anayeaye this is the issue we encountered before where the endpoint was failing intermittently. The problem that time was the creds weren't being passed to gdal. This is what that fix looked like: https://github.com/NASA-IMPACT/veda-backend/pull/144/files

The error that we're seeing now:

"detail": "'/vsis3/veda-data-store-staging/EIS/COG/coastal-flooding-and-slr/MODIS_LC_2001_BD_v2.cog.tif' does not exist in the file system, and is not recognized as a supported dataset name."

looks very similar to what we saw in the previous issue.

@smohiudd
Copy link
Contributor

smohiudd commented Jul 4, 2023

In the PR we had some changes to how GDAL envs are passed through titiler based on the 0.7.0 breaking changes: https://github.com/developmentseed/titiler/blob/main/CHANGES.md#070-2022-06-08

Do we know if these gdal config changes were tested in dev?

@anayeaye
Copy link
Collaborator Author

anayeaye commented Jul 6, 2023

Just for the record (no new insights): I tried some pinning in the raster-api. These changes did not solve our problem and the dev deployment is reverted to the current develop branch.

To be extra sure we weren't getting the breaking version of starlette. This looked promising because there is some subtle difference between the cold start true/false conditions that cause slightly different results on multiple tries of the same request (examples in the issue description).

"fastapi>=0.87,<0.92",
"starlette>=0.21.0,<0.25",

And on a whim, to see if the recent release of rasterio was related to our woes

"rasterio<1.3.8",

So our current condition remains: /cog routes are happily using the sts assume role session credentials, /mosaic and /stac endpoints are not. I don't see where the divergence happens--I'm pretty sure they all have titiler core's BaseTilerFactory underneath.

@vincentsarago
Copy link
Contributor

mosaic and stac may use another level of threading which might explain why the environment is not the same. I've been trying to talk this issue for a while without success.

Can you test by setting RIO_TILER_MAX_THREADS=1 and MOSAIC_CONCURRENCY=1 (this in theory will remove any multi-threading)

ref: developmentseed/titiler#186

@anayeaye
Copy link
Collaborator Author

anayeaye commented Jul 6, 2023

With RIO_TILER_MAX_THREADS=1 (our current deployment already does) and MOSAIC_CONCURRENCY=1 (just now) Still seeing an Access Denied 403 on the first hit followed by a does not exist in the file system error on re-tries.

EDIT/note: I've now reverted the lambda environment to match the env variables stored for github actions: RIO_TILER_MAX_THREADS=1; MOSAIC_CONCURRENCY is unset.

@ranchodeluxe
Copy link
Contributor

ranchodeluxe commented Jul 8, 2023

@vincentsarago: My hacky fix for this issue created a success case for veda-backend and shows where I believe the issue resides. I wish I would've seen this convo yesterday b/c it would've saved hours 😆

The issue explained:

  • rasterio.Env uses thread local which means it inits a new empty context per thread

  • So by the the time CustomSTACReader is retrieving the image in this line it's running inside of multiple threads. Each thread's CustomSTACReader triggers self.ctx (which is just rasterio.Env) and it has to start all over again with empty context when accessing the image

  • My fix created a success case b/c I was forcing the env vars back into the rasterio.Env per-thread context via AssetInfo

@vincentsarago
Copy link
Contributor

Thanks so much @ranchodeluxe for this deep dive. This is definitely a bug that we should fix at rio-tiler level

I wonder if using a combination of https://github.com/rasterio/rasterio/blob/main/rasterio/env.py#L328C1-L339C1 to get the options in the environment and forward them to a new Env will work 🤷

@vincentsarago
Copy link
Contributor

FYI this can be simply demo with

with rasterio.Env(
    session=AWSSession(
        aws_access_key_id="MyDevseedId",
        aws_secret_access_key="MyDevseedKey",
    )
):
    with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey.tif") as src:
        print(src.profile)
    
    with rasterio.Env():
        with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey_1024_512.tif") as src:
            print(src.profile)

{'driver': 'GTiff', 'dtype': 'uint8', 'nodata': None, 'width': 21580, 'height': 10780, 'count': 3, 'crs': CRS.from_epsg(4326), 'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331), 'blockxsize': 128, 'blockysize': 128, 'tiled': True, 'compress': 'jpeg', 'interleave': 'pixel', 'photometric': 'ycbcr'}

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: Access Denied

@vincentsarago
Copy link
Contributor

ok, I may have a fix for this but it will require a full rio-tiler/titiler/titiler-pgstac update

I see veda raster-api is a bit behind the actual version (titiler-pgstac=0.2.3 / titiler 0.10.2), ideally I'll release titiler-pgstac 0.5 and titiler 0.12 with a new rio-tiler 4.2

The move from titiler-pgstac 0.2.3 to 0.5 will have couple breaking changes:

# Changes in Item and Collection endpoint URL
# Before
{endpoint}/stac/info?collection=collection1&item=item1

# Now
{endpoint}/collections/collection1/items/item1/info


# Before
{endpoint}/mosaic/tiles/20200307aC0853900w361030/0/0/0

# Now
{endpoint}/mosaic/20200307aC0853900w361030/tiles/0/0/0

# Before
/{searchid}/{z}/{x}/{y}/assets

# Now
/{searchid}/tiles/{z}/{x}/{y}/assets
rename add_map_viewer to add_viewer option in MosaicTilerFactory for consistency with titiler's options

@ranchodeluxe
Copy link
Contributor

ranchodeluxe commented Jul 10, 2023

FYI this can be simply demo with

with rasterio.Env(
    session=AWSSession(
        aws_access_key_id="MyDevseedId",
        aws_secret_access_key="MyDevseedKey",
    )
):
    with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey.tif") as src:
        print(src.profile)
    
    with rasterio.Env():
        with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey_1024_512.tif") as src:
            print(src.profile)

{'driver': 'GTiff', 'dtype': 'uint8', 'nodata': None, 'width': 21580, 'height': 10780, 'count': 3, 'crs': CRS.from_epsg(4326), 'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331), 'blockxsize': 128, 'blockysize': 128, 'tiled': True, 'compress': 'jpeg', 'interleave': 'pixel', 'photometric': 'ycbcr'}

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: Access Denied

I'm confused as to why rasterio is operating this way in the same thread. Based on the source code it should be picking these things up!: https://github.com/rasterio/rasterio/blob/main/rasterio/env.py#L272-L291

@vincentsarago
Copy link
Contributor

Even in the same thread it seems the session is not forwarded. I'm opening an issue in rasterio because to me it seems to be a Bug

@ranchodeluxe
Copy link
Contributor

Even in the same thread it seems the session is not forwarded. I'm opening an issue in rasterio because to me it seems to be a Bug

Yeah, based on the code I'm reading it is a bug

@ranchodeluxe
Copy link
Contributor

ranchodeluxe commented Jul 10, 2023

@vincentsarago : For a single thread nested rasterio.Env DOES find the previous environ. The exact same thing works fine for me below. Not the same s3 endpoint (don't have a DS AWS account). Can you double check that you don't have any existing AWS_* os environ variables exported and please remove them?

import rasterio
import pprint

session = {
    "session": rasterio.session.AWSSession(
        aws_access_key_id="<blah>",
        aws_secret_access_key="<blah>",
        aws_session_token="<blah>",
    )
}

with rasterio.Env(**session) as rioenv1:
    print('########### rioenv1 ###########')
    pprint.pprint(rioenv1.options, indent=4)
    with rasterio.open("s3://veda-data-store-staging/geoglam/CropMonitor_202001.tif") as src:
        pprint.pprint(src.profile, indent=4)
    with rasterio.Env() as rioenv2:
        print('########### rioenv2 ###########')
        pprint.pprint(rioenv2.options, indent=4)
        with rasterio.open("s3://veda-data-store-staging/geoglam/CropMonitor_202001.tif") as src:
            pprint.pprint(src.profile, indent=4)

@vincentsarago
Copy link
Contributor

vincentsarago commented Jul 10, 2023

########### rioenv1 ###########
{   'AWS_ACCESS_KEY_ID': '<blah>',
    'AWS_REGION': 'us-east-1',
    'AWS_SECRET_ACCESS_KEY': '<blah>'}

{   'blockxsize': 128,
    'blockysize': 128,
    'compress': 'jpeg',
    'count': 3,
    'crs': CRS.from_epsg(4326),
    'driver': 'GTiff',
    'dtype': 'uint8',
    'height': 10780,
    'interleave': 'pixel',
    'nodata': None,
    'photometric': 'ycbcr',
    'tiled': True,
    'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331),
    'width': 21580}

########### rioenv2 ###########
{}

{   'blockxsize': 512,
    'blockysize': 512,
    'compress': 'jpeg',
    'count': 3,
    'crs': CRS.from_epsg(4326),
    'driver': 'GTiff',
    'dtype': 'uint8',
    'height': 10780,
    'interleave': 'pixel',
    'nodata': None,
    'photometric': 'ycbcr',
    'tiled': True,
    'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331),
    'width': 21580}

Note: the second call should fails but because I've got my default AWS profile as devseed it works 😅

@vincentsarago
Copy link
Contributor

@ranchodeluxe feel free to add more comments in the rasterio ticket 🙏

@ranchodeluxe
Copy link
Contributor

ranchodeluxe commented Jul 10, 2023

@ranchodeluxe feel free to add more comments in the rasterio ticket 🙏

will do, but I have to build my rasterio image and want to do it as a test case for them

@anayeaye anayeaye changed the title WIP notes on tiling problems in veda-dev Tiling/AWS Session problems in veda-dev Jul 10, 2023
@moradology
Copy link
Contributor

After asking around, it appears this has been resolved for the time being. The ultimate fix is in rasterio, so the next step is bumping rasterio versions once the next release is cut (>1.3.9)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants