Accessing CEFI Cloud data with Python#

Packages used#

To use python to access the ERDDAP server directly from your python script or jupyter-notebook, you will need

  • Xarray

  • cf_xarray

  • netcdf4

  • matplotlib

  • zarr

  • fsspec

  • s3sf

  • cartopy

Packages

The package netcdf4 developed by UNIDATA and the zarr engine are used at runtime. It is the essential package that support netCDF format output from Xarray and OPeNDAP access be available in the environment when using xarray. The package cartopy is also not needed to read the data, but is used later in the script for making the map plots.


On this page, we will explore the process of extracting a subset of model simualtion data produce from the regional MOM6 model. The model output is to support the Climate Ecosystem and Fishery Initiative. We will showcase how to utilize xarray and fsspec for accessing the data and visualize it on an detailed map. The currently available data can be viewed on

The contents of this folder encompass hindcast simulations derived from the regional MOM6 model spanning the years 1993 to 2019.

Import python packages#

import xarray as xr
import cf_xarray # This is a wrapper for xarray that allows access to the data through standard coordinate names
import fsspec

Info

Thanks to the xarray, fsspec, zarr and kerchunk, accessing data from the cloud directly from Xarray is seamless. For detailed usage guidance, the Xarray documentation offers excellent examples and explanations.

Access data (regular grid product)#

From the previous page explaining OPeNDAP interface, we know that the OPeNDAP URL can be obtained from the form. In our case, we will use the browser to find the name of the object we want from the links to the cloud storage. The object names follow the same directory pattern as the OPeNDAP data on the portal.

fs = fsspec.filesystem(
    "reference", 
    fo='gcs://noaa-oar-cefi-regional-mom6-nwa/northwest_atlantic/full_domain/hindcast/monthly/regrid/r20230520/tob.nwa.full.hcast.monthly.regrid.r20230520.199301-201912.json', 
    remote_protocol="gcs", 
    remote_options={"anon":True},
    skip_instance_cache=True,
    target_options={"anon": True}
)
m = fs.get_mapper()
ds_tob = xr.open_dataset(m, engine='zarr', consolidated=False)

Info

fsspec is a library that provides a unified interface for accessing various file systems, including local, cloud, and remote file systems. filesystem(“reference”, …) creates a reference file system using fsspec. This is used to access a dataset described by a reference JSON file.

get_mapper method returns a mapper object that can be used to access the data described by the reference JSON file.

Finally, we use the xr.open_dataset to open the remote dataset using the zarr engine.

ds_tob.cf
Coordinates:
             CF Axes: * X: ['lon']
                      * Y: ['lat']
                      * T: ['time']
                        Z: n/a

      CF Coordinates: * longitude: ['lon']
                      * latitude: ['lat']
                      * time: ['time']
                        vertical: n/a

       Cell Measures:   area, volume: n/a

      Standard Names: * latitude: ['lat']
                      * longitude: ['lon']

              Bounds:   n/a

       Grid Mappings:   n/a

Data Variables:
       Cell Measures:   area, volume: n/a

      Standard Names:   sea_water_potential_temperature_at_sea_floor: ['tob']

              Bounds:   n/a

       Grid Mappings:   n/a
ds_tob
<xarray.Dataset> Size: 847MB
Dimensions:  (lat: 844, lon: 774, time: 324)
Coordinates:
  * lat      (lat) float64 7kB 5.273 5.335 5.398 5.461 ... 58.04 58.1 58.16
  * lon      (lon) float64 6kB 261.6 261.6 261.7 261.8 ... 323.8 323.8 323.9
  * time     (time) datetime64[ns] 3kB 1993-01-16T12:00:00 ... 2019-12-16T12:...
Data variables:
    tob      (time, lat, lon) float32 847MB ...
Attributes: (12/29)
    NCO:                    netCDF Operators version 5.0.1 (Homepage = http:/...
    NumFilesInSet:          1
    associated_files:       areacello: 19930101.ocean_static.nc
    cefi_archive_version:   /archive/acr/fre/NWA/2023_04/NWA12_COBALT_2023_04...
    cefi_aux:               N/A
    cefi_data_doi:          10.5281/zenodo.7893386
    ...                     ...
    contact:                chia-wei.hsu@noaa.gov
    dataset:                regional mom6 regrid
    external_variables:     areacello
    grid_tile:              N/A
    grid_type:              regular
    title:                  NWA12_COBALT_2023_04_kpo4-coastatten-physics

Tip

When we use xr.open_dataset() to load the data, we actually only load the metadata and coordinates information. This provide a great way to peek at the data’s dimension and availability on our local machine in the xarray object format. The actual gridded data values at each grid point will only be downloaded from the cloud server when we add .load() to the dataset or when we access the data with a visualization or analysis operation.

xarray will lazy load the data from the selected subset when it’s needed, in this case when the plot is made below.

ds_tob_subset = ds_tob.sel(time='2012-12')
ds_tob_subset
<xarray.Dataset> Size: 3MB
Dimensions:  (lat: 844, lon: 774, time: 1)
Coordinates:
  * lat      (lat) float64 7kB 5.273 5.335 5.398 5.461 ... 58.04 58.1 58.16
  * lon      (lon) float64 6kB 261.6 261.6 261.7 261.8 ... 323.8 323.8 323.9
  * time     (time) datetime64[ns] 8B 2012-12-16T12:00:00
Data variables:
    tob      (time, lat, lon) float32 3MB ...
Attributes: (12/29)
    NCO:                    netCDF Operators version 5.0.1 (Homepage = http:/...
    NumFilesInSet:          1
    associated_files:       areacello: 19930101.ocean_static.nc
    cefi_archive_version:   /archive/acr/fre/NWA/2023_04/NWA12_COBALT_2023_04...
    cefi_aux:               N/A
    cefi_data_doi:          10.5281/zenodo.7893386
    ...                     ...
    contact:                chia-wei.hsu@noaa.gov
    dataset:                regional mom6 regrid
    external_variables:     areacello
    grid_tile:              N/A
    grid_type:              regular
    title:                  NWA12_COBALT_2023_04_kpo4-coastatten-physics

Quick view of the variable#

To check if the data makes sense and fits the chosen time and region, you can use Xarray’s .plot() method. This way, you see a graph, making it easier to understand.

ds_tob_subset.tob.plot(cmap='inferno')
<matplotlib.collections.QuadMesh at 0x7eff29d2b100>
../../../_images/f0678e0e067458d317e66f5878418c53eea762405f98c295176b4607b195ea48.png

Plotting the data on a detailed map#

In this section, we are going to use cartopy to plot the data with a detailed map backgournd.

import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs

Specify figure setup#

  • Use the cf wrapper to reference the latitude, longitude and time via their CF standard names.

  • Calculate the aspect ratio of the data to set the aspect ratio of the figure

  • Shows how to center the plot at 180, not applicable to this data set, but critical for others like the NE Pacific.

xmin = ds_tob_subset.cf['longitude'].min()
xmax = ds_tob_subset.cf['longitude'].max()
ymin = ds_tob_subset.cf['latitude'].min()
ymax = ds_tob_subset.cf['latitude'].max()
aspect = (xmax-xmin)/(ymax-ymin)

Add details to the map#

  • Use the default land feature to fill in the land mass

  • Draw the coasts with the 50m data (will trigger a download of the data the first time you use it).

  • Draw some graticules

  • Put the colorbar on top and size it to fit the plot

  • Create a title from the metadata in the file and the variable name

  • Draw the plot

plt.figure(figsize=(8*aspect,8))
proj = ccrs.PlateCarree(central_longitude=180)
proj180 = ccrs.PlateCarree()
ax = plt.axes(projection=proj)
ax.set_extent([xmin, xmax, ymin, ymax], crs=proj180)
# add some features to make the map a little more polished
ax.add_feature(cartopy.feature.LAND)
ax.coastlines('50m')
gl = ax.gridlines(draw_labels=True)
ct = ax.contourf(ds_tob_subset.cf['longitude'], ds_tob_subset.cf['latitude'], ds_tob_subset['tob'][0], levels=255, transform=proj180, cmap="inferno")
plt.colorbar(ct, orientation='horizontal',pad=0.08, aspect=35, fraction=.06, location='top')
plt.title(str(ds_tob_subset.attrs['title']) + '\ntob\n at t=' + str(ds_tob_subset['time'].values), y=1.25)
plt.show()
../../../_images/13f67d1e84f6a1dda55960e2347164c43d61ad82e461fa014f99e2161a07e62c.png

Access data (raw grid product)#

The raw grid product differs from the regular grid in that it represents the original output of the model. The model grid uses the Arakawa C grid. The Regional MOM6 Grid provides an excellent visualization of the Arakawa C grid. We can concentrate on the h point, which is where scalar values like sea level height are stored. The grid does not have uniform spacing and is, in fact, a curvilinear grid.

Info

The raw model output is beneficial when calculating the energy budget, which includes factors like heat and momentum. This output preserves each term’s original values, avoiding any potential distortions caused by interpolation. By using the raw grid product, we can ensure the energy budget balances within a closed system.

To use the raw grid data, one need to get two files.

  1. the variable file

  2. the static file

For this demonstration we will use the AWS S3 to access the data. You can use either the Google Cloud Storage copy (gcs://noaa-oar-cefi-regional-mom6-nwa) or the AWS copy (s3://noaa-oar-cefi-regional-mom6-nwa-pds) by changing the prefix and name of the bucket.

fs = fsspec.filesystem(
    "reference", 
    fo='s3://noaa-oar-cefi-regional-mom6-nwa-pds/northwest_atlantic/full_domain/hindcast/monthly/raw/r20230520/tob.nwa.full.hcast.monthly.raw.r20230520.199301-201912.json',
    remote_protocol="s3", 
    remote_options={"anon":True},
    consolicated=False,
    skip_instance_cache=True,
    target_options={"anon": True}
)
map_s = fs.get_mapper()
ds_tob = xr.open_dataset(map_s, engine='zarr', consolidated=False)

fs_os = fsspec.filesystem(
    "reference", 
    fo='s3://noaa-oar-cefi-regional-mom6-nwa-pds/northwest_atlantic/full_domain/hindcast/monthly/raw/r20230520/ocean_static.json',
    remote_protocol="s3", 
    remote_options={"anon":True},
    skip_instance_cache=True,
    target_options={"anon": True}
)
map_os = fs_os.get_mapper() 
ds_static = xr.open_dataset(map_os, engine='zarr', consolidated=False)
ds_tob
<xarray.Dataset> Size: 849MB
Dimensions:     (time: 324, nv: 2, yh: 845, xh: 775)
Coordinates:
  * nv          (nv) float64 16B 1.0 2.0
  * time        (time) datetime64[ns] 3kB 1993-01-16T12:00:00 ... 2019-12-16T...
  * xh          (xh) float64 6kB -98.0 -97.92 -97.84 ... -36.24 -36.16 -36.08
  * yh          (yh) float64 7kB 5.273 5.352 5.432 5.511 ... 51.9 51.91 51.93
Data variables:
    average_DT  (time) timedelta64[ns] 3kB ...
    average_T1  (time) datetime64[ns] 3kB ...
    average_T2  (time) datetime64[ns] 3kB ...
    time_bnds   (time, nv) datetime64[ns] 5kB ...
    tob         (time, yh, xh) float32 849MB ...
Attributes: (12/27)
    NCO:                    netCDF Operators version 5.0.1 (Homepage = http:/...
    NumFilesInSet:          1
    associated_files:       areacello: 19930101.ocean_static.nc
    cefi_archive_version:   /archive/acr/fre/NWA/2023_04/NWA12_COBALT_2023_04...
    cefi_aux:               N/A
    cefi_data_doi:          10.5281/zenodo.7893386
    ...                     ...
    cefi_subdomain:         full
    cefi_variable:          tob
    external_variables:     areacello
    grid_tile:              N/A
    grid_type:              regular
    title:                  NWA12_COBALT_2023_04_kpo4-coastatten-physics
ds_static
<xarray.Dataset> Size: 66MB
Dimensions:       (yq: 846, xq: 776, yh: 845, xh: 775, time: 1)
Coordinates:
  * time          (time) datetime64[ns] 8B 1980-01-01
  * xh            (xh) float64 6kB -98.0 -97.92 -97.84 ... -36.24 -36.16 -36.08
  * xq            (xq) float64 6kB -98.04 -97.96 -97.88 ... -36.2 -36.12 -36.04
  * yh            (yh) float64 7kB 5.273 5.352 5.432 5.511 ... 51.9 51.91 51.93
  * yq            (yq) float64 7kB 5.233 5.312 5.392 5.472 ... 51.9 51.92 51.94
Data variables: (12/25)
    Coriolis      (yq, xq) float32 3MB ...
    areacello     (yh, xh) float32 3MB ...
    areacello_bu  (yq, xq) float32 3MB ...
    areacello_cu  (yh, xq) float32 3MB ...
    areacello_cv  (yq, xh) float32 3MB ...
    deptho        (yh, xh) float32 3MB ...
    ...            ...
    geolon_v      (yq, xh) float32 3MB ...
    sftof         (yh, xh) float32 3MB ...
    wet           (yh, xh) float32 3MB ...
    wet_c         (yq, xq) float32 3MB ...
    wet_u         (yh, xq) float32 3MB ...
    wet_v         (yq, xh) float32 3MB ...
Attributes:
    NCO:            netCDF Operators version 5.0.1 (Homepage = http://nco.sf....
    NumFilesInSet:  1
    grid_tile:      N/A
    grid_type:      regular
    history:        Fri May 12 10:50:21 2023: ncks -4 -L 3 ocean_static.nc co...
    title:          NWA12_MOM6_v1.0

Tip

Again, only the metadata and coordinate information are loaded into local memory, not the actual data. The gridded data values at each grid point are only downloaded from the cloud when we append .load() to the dataset.

Warning

In the ds_static dataset, you can observe the one-dimensional xh, yh grid. The numbers on this grid may resemble longitude and latitude values, but they do not represent actual geographical locations. For accurate longitude and latitude data, you should refer to the geolon, geolat variables in the list. These variables, which are dimensioned by xh, yh, provide the correct geographical coordinates.

Now let’s take a look at the grid area distribution which give a great sense of how it is different from regular spacing grid

ds_static = ds_static.set_coords(['geolon','geolat'])
ds_static.areacello.plot(x='geolon',y='geolat')
<matplotlib.collections.QuadMesh at 0x7eff16553d00>
../../../_images/85d33c6ae63f3339b462235311cca96389a3aeeef80cd22afd40488d04d026c0.png
ds_static.areacello.plot(x='xh',y='yh')
<matplotlib.collections.QuadMesh at 0x7eff1644baf0>
../../../_images/21f7d0d7ddf7b12ac194992575ca7c1c706387ed58798fe9faba8224c85794b4.png

We can see that the most significant differences between the two maps are located in the higher latitudes. Since the grid information is important in the raw grid, the easiest way to merge the two dataset so we can have the accurate lon lat that follows the data that we are going to work on.

ds_tob_subset = ds_tob.sel(time='2012-12').load()
ds = xr.merge([ds_tob_subset,ds_static])
ds = ds.isel(time=slice(1,None))  # exclude the 1980 empty field due to merge

From the xarray dataset object output above, we will notice the tob value in ds is now avaible and the coordinate information in the coordinate is also included into a single Dataset object. In the line below, we’ll assign the latitude and longitude names to geolat and geolon to help the CF library identify them since the data lacks the proper attributes for this to happen automatically.

ds = ds.assign_coords(latitude=ds.geolat, longitude=ds.geolon)
ds.cf
Coordinates:
             CF Axes: * X: ['xh', 'xq']
                      * Y: ['yh', 'yq']
                      * T: ['time']
                        Z: n/a

      CF Coordinates:   longitude: ['geolon', 'longitude', 'xh', 'xq']
                        latitude: ['geolat', 'latitude', 'yh', 'yq']
                      * time: ['time']
                        vertical: n/a

       Cell Measures:   area, volume: n/a

      Standard Names:   n/a

              Bounds:   n/a

       Grid Mappings:   n/a

Data Variables:
       Cell Measures:   area: ['areacello']
                        volume: n/a

      Standard Names:   SeaAreaFraction: ['sftof']
                        cell_area: ['areacello', 'areacello_bu', 'areacello_cu', 'areacello_cv']
                        sea_floor_depth_below_geoid: ['deptho']
                        sea_water_potential_temperature_at_sea_floor: ['tob']

              Bounds:   T: ['time_bnds']
                        time: ['time_bnds']

       Grid Mappings:   n/a

Quick view of the variable#

To check if the data makes sense and fits the chosen time and region, you can use Xarray’s .plot() method. This way, you see a graph instead of just numbers, making it easier to understand.

ds.tob.plot(x='longitude',y='latitude', cmap='inferno')
<matplotlib.collections.QuadMesh at 0x7eff1631e470>
../../../_images/f45182d71419fab1702b511e5b8b450ec59701ab72be8f2278ff3f981a5e08bd.png

Make a detailed plot showing the underlying grid#

long_name = ds.tob.attrs['long_name']
xmin = ds['longitude'].min()
xmax = ds['longitude'].max()
ymin = ds['latitude'].min()
ymax = ds['latitude'].max()
aspect = (xmax-xmin)/(ymax-ymin)
plt.figure(figsize=(8*aspect,8))
proj = ccrs.PlateCarree(central_longitude=180)
proj180 = ccrs.PlateCarree()
ax = plt.axes(projection=proj)
ax.set_extent([xmin, xmax, ymin, ymax], crs=proj180)
# add some features to make the map a little more polished
ax.add_feature(cartopy.feature.LAND)
ax.coastlines('50m')
gl = ax.gridlines(draw_labels=True)
ct = ax.contourf(ds['longitude'], ds['latitude'], ds['tob'][0], levels=255, transform=proj180, cmap="inferno")
plt.colorbar(ct, orientation='horizontal',pad=0.08, aspect=35, fraction=.06, location='top', cmap="inferno")
ax.autoscale(False) # keep the scalling info the same for the grid dots
ct = ax.scatter(ds['longitude'][::15], ds['latitude'][::15], .05, zorder=1, transform=proj180, color='grey')
plt.title(str(ds.attrs['title']) + '\n+long_name+\n at t=' + str(ds['time'].values)+'\nwith every 15th grid point overlaid', y=1.25)
plt.show()
../../../_images/34cfe6930d0d0d08057573d1c46b66db6ea82850fec7195994a810ec52e092f8.png