Accessing CEFI Cloud data with R#

This example requires you to have used conda or mamba to install the cefi-cookbook-dev environment as explained here.

The following notebook is a quick demostration on how to use R to access the CEFI Cloud data. This is designed for users that prefer the programing interface to be R. There are already couple of great resource related to preprocessing and visualizing data in R. Therefore, the notebook will not repeat that part of the code but focusing on how to accessing the data from a R interface. The resources is listed below

Launch R in Jupyterlab

In addition to the standard R code demonstrated below, our current interface leverages JupyterLab to execute R code using the R kernel. To enable this functionality, we utilize the environment.yml file to create a Conda/Mamba environment. The primary objective is to install the r-irkernel package within this environment.

This installation of r-irkernel through Conda/Mamba ensures that the R kernel becomes available as an option when launching JupyterLab. Selecting the R kernel empowers users to utilize the JupyterLab interface for running R code effortlessly.

Packages used#

In this example, we are going to use the reticulate library with Python to read the CEFI data into an R data structure.

From there you can operate on the data in R (or R Studio) as you would any other data.

N.B. The only library you need to read the data is the retiuclate library to access the python environment to read the data.

The other libraries are used to filter the data and make a plot on a detailed base map.

The last line show the python configuration that should have been set up when you loaded the conda environment to run this notebook.

library(reticulate)
library(dplyr)
library(maps)
library(ggplot2)
# reticulate::py_config()  # check python version
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:

    filter, lag
The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
use_condaenv('cefi-cookbook-dev')

Get some python tools to read the data from the cloud storage#

fsspec and xarray are python packages that are in the conda environment we build to run this notebook. We’re going to import them into our R environment and use them to create a pandas dataframe with lat,lon,tos as the columns

xr <- import('xarray')
fsspec <- import('fsspec')

Get some data from the cloud#

  • Read the data

  • select a time slice

fs <- fsspec$filesystem('reference', 
    remote_protocol='gcs',
    fo='gcs://noaa-oar-cefi-regional-mom6-nwa/northwest_atlantic/full_domain/hindcast/monthly/regrid/r20230520/tos.nwa.full.hcast.monthly.regrid.r20230520.199301-201912.json'
)
m <- fs$get_mapper()
dataset <- xr$open_dataset(m, engine='zarr', consolidated=FALSE)
subset <- dataset$sel(time='2012-12')

Get the variable long_name, date of the slice and data set title for the plot below

datetime <- subset$time$values
title <- subset$attrs$title
long_name <- subset$tos$attrs$long_name

Reorganize that data into a data.frame#

We transform the gridded data into column data with lon, lat, tos as columns. Take care: We are expanding along lon first and then by lat, so we reorder the xarray so that lon varies fastest to match.

subset <- subset$transpose('lon','lat','time')
df <- expand.grid(x = subset$lon$values, y=subset$lat$values)
data <- as.vector(t(matrix(subset$tos$values)))
df$Data <- data
names(df) <- c("lon", "lat", "tos")

Subset the dataset#

We use filter to extract the points over the Gulf region.

df_gulf = filter(df, lat>17)
df_gulf = filter(df_gulf, lat<31)
df_gulf = filter(df_gulf, lon>260)
df_gulf = filter(df_gulf, lon<290)
df_gulf
A data.frame: 79072 × 3
lonlattos
<dbl[1d]><dbl[1d]><dbl>
261.557717.00457NaN
261.638417.00457NaN
261.719117.00457NaN
261.799817.00457NaN
261.880417.00457NaN
261.961117.00457NaN
262.041817.00457NaN
262.122517.00457NaN
262.203117.00457NaN
262.283817.00457NaN
262.364517.00457NaN
262.445217.00457NaN
262.525817.00457NaN
262.606517.00457NaN
262.687217.00457NaN
262.767917.00457NaN
262.848517.00457NaN
262.929217.00457NaN
263.009917.00457NaN
263.090617.00457NaN
263.171317.00457NaN
263.251917.00457NaN
263.332617.00457NaN
263.413317.00457NaN
263.494017.00457NaN
263.574617.00457NaN
263.655317.00457NaN
263.736017.00457NaN
263.816717.00457NaN
263.897317.00457NaN
287.616030.9951622.10050
287.696630.9951622.11896
287.777330.9951622.13621
287.858030.9951622.15041
287.938730.9951622.16084
288.019430.9951622.16887
288.100030.9951622.17518
288.180730.9951622.18109
288.261430.9951622.18618
288.342130.9951622.19002
288.422730.9951622.19225
288.503430.9951622.19283
288.584130.9951622.19231
288.664830.9951622.19145
288.745430.9951622.19104
288.826130.9951622.19171
288.906830.9951622.19394
288.987530.9951622.19798
289.068130.9951622.20489
289.148830.9951622.21741
289.229530.9951622.24266
289.310230.9951622.28262
289.390830.9951622.31960
289.471530.9951622.33831
289.552230.9951622.34284
289.632930.9951622.34084
289.713530.9951622.33542
289.794230.9951622.32747
289.874930.9951622.31860
289.955630.9951622.30894

Longitude Manipulation#

The model lons are from 0 to 360, so we’re going to use this function to normalize them to -180 to 180 The map polygons are defined on -180 to 180 we need to adjust our values to see them on the map

normalize_lon <- function(x) {
  quotient <- round(x / 360.)
  remainder <- x - (360. * quotient)
  # Adjust sign if necessary to match IEEE behavior
  if (sign(x) != sign(360.) && remainder != 0) {
    remainder <- remainder + sign(360.) * 360.
  }
  return(remainder)
}
df_gulf$lon <- unlist(lapply(df_gulf$lon, normalize_lon))

Plot setup#

Set up a resonable plot size and limits for the plot area of the world base map

options(repr.plot.width = 10, repr.plot.height = 10)
ylim <- c( min(df_gulf$lat), max(df_gulf$lat) )
xlim <- c( min(df_gulf$lon), max(df_gulf$lon) )

Mesh/Dot Map#

Plot a mesh/dot colored according to the value of tos at each lat/lon location in the Gulf.

w <- map_data( 'world', ylim = ylim, xlim = xlim )
p = ggplot() + 
    geom_polygon(data = w, aes(x=long, y = lat, group = group), fill = "grey", color = "black") +
    coord_fixed(1.3, xlim = xlim, ylim = ylim) + 
    geom_point(data = df_gulf, aes(color = tos, y=lat, x=lon), size=.5) +
    labs(title = paste(long_name, 'at', datetime, 'of', title))
p
../../../_images/675ae2cedcf351eab06268967fdf998a5e295cbc134384155873ca96743cf9f5.png

Accessing the data on the original model grid#

The original model grid is explained in detail at the beginning of the cookbook. Understanding where the data are located on the earth requires you to merge the raw grid data file with the grid description. We’ll get both the data and the grid from the cloud storage just as we did in the example above and merge them using xarray, then we’ll convert them to a data.frame for plotting (or processing) in R.

N.B. we use the AWS cloud storage here. We used the Google Cloud Storage above. You can use either, but if you are computing in a cloud environment you’ll get better performance if you use the same cloud.

N.B. The “bucket” names are slightly different

N.B. Additional parameters (remote_options and target_options) are needed to let AWS know that the access does not require any credentials and could be done anonymously.

params = list(anon = TRUE)
fs <- fsspec$filesystem('reference', 
    remote_protocol='s3',
    fo='s3://noaa-oar-cefi-regional-mom6-nwa-pds/northwest_atlantic/full_domain/hindcast/monthly/raw/r20230520/sos.nwa.full.hcast.monthly.raw.r20230520.199301-201912.json',
    remote_options=params,
    target_options=params
)
m <- fs$get_mapper()
dataset <- xr$open_dataset(m, engine='zarr', consolidated=FALSE)
fsg <- fsspec$filesystem('reference', 
    remote_protocol='s3',
    fo='s3://noaa-oar-cefi-regional-mom6-nwa-pds/northwest_atlantic/full_domain/hindcast/monthly/raw/r20230520/ocean_static.json',
    remote_options=params,
    target_options=params
)
gm <- fsg$get_mapper()
dataset <- xr$open_dataset(m, engine='zarr', consolidated=FALSE)
grid <- xr$open_dataset(gm, engine='zarr', consolidated=FALSE)

Slicing the data#

Here, we get one month slice from the data set. We took June 2003 in this case.

slice = dataset$sel(time='2003-06')
title = slice$attrs$title
long_name = slice$sos$long_name
datetime = slice$time$values
slice
<xarray.Dataset> Size: 3MB
Dimensions:     (time: 1, nv: 2, yh: 845, xh: 775)
Coordinates:
  * nv          (nv) float64 16B 1.0 2.0
  * time        (time) datetime64[ns] 8B 2003-06-16
  * xh          (xh) float64 6kB -98.0 -97.92 -97.84 ... -36.24 -36.16 -36.08
  * yh          (yh) float64 7kB 5.273 5.352 5.432 5.511 ... 51.9 51.91 51.93
Data variables:
    average_DT  (time) timedelta64[ns] 8B ...
    average_T1  (time) datetime64[ns] 8B ...
    average_T2  (time) datetime64[ns] 8B ...
    sos         (time, yh, xh) float32 3MB ...
    time_bnds   (time, nv) datetime64[ns] 16B ...
Attributes: (12/27)
    NCO:                    netCDF Operators version 5.0.1 (Homepage = http:/...
    NumFilesInSet:          1
    associated_files:       areacello: 19930101.ocean_static.nc
    cefi_archive_version:   /archive/acr/fre/NWA/2023_04/NWA12_COBALT_2023_04...
    cefi_aux:               N/A
    cefi_data_doi:          10.5281/zenodo.7893386
    ...                     ...
    cefi_subdomain:         full
    cefi_variable:          sos
    external_variables:     areacello
    grid_tile:              N/A
    grid_type:              regular
    title:                  NWA12_COBALT_2023_04_kpo4-coastatten-physics
lon <- grid$geolon$values
lat <- grid$geolat$values

We’re using the lat and lon values from the static grid file to pair with each data value we read from the raw grid file to create a data.frame with lon,lat,sos

slice <- slice$transpose('xh', 'yh', 'time', 'nv')
X <- as.vector(t(lon))
Y <- as.vector(t(lat))
data <- as.vector(t(matrix(slice$sos$values)))

df <- data.frame(
  "lon" = X,
  "lat" = Y,
  "sos" = data
)
sub = filter(df, lat>35)
sub = filter(sub, lat<45)
sub = filter(sub, lon>=-80)
sub = filter(sub, lon<=-60)
sub
A data.frame: 41164 × 3
lonlatsos
<dbl><dbl><dbl>
-80.0000035.02885NaN
-79.9200135.02885NaN
-79.8400035.02885NaN
-79.7600135.02885NaN
-79.6799935.02885NaN
-79.6000135.02885NaN
-79.5199935.02885NaN
-79.4400035.02885NaN
-79.3599935.02885NaN
-79.2800035.02885NaN
-79.2000135.02885NaN
-79.1200035.02885NaN
-79.0400135.02885NaN
-78.9599935.02885NaN
-78.8800035.02885NaN
-78.7999935.02885NaN
-78.7200035.02885NaN
-78.6400135.02885NaN
-78.5600035.02885NaN
-78.4800135.02885NaN
-78.3999935.02885NaN
-78.3200135.02885NaN
-78.2399935.02885NaN
-78.1600035.02885NaN
-78.0799935.02885NaN
-78.0000035.02885NaN
-77.9200135.02885NaN
-77.8400035.02885NaN
-77.7600135.02885NaN
-77.6799935.02885NaN
-62.3200144.98657 NaN
-62.2399944.98657 NaN
-62.1600044.98657 NaN
-62.0799944.98657 NaN
-62.0000044.98657 NaN
-61.9200144.9865730.98845
-61.8400044.9865731.01529
-61.7600144.9865731.11183
-61.6799944.9865731.18510
-61.6000144.9865731.26443
-61.5199944.9865731.35068
-61.4400044.9865731.43041
-61.3599944.9865731.50238
-61.2800044.9865731.57698
-61.2000144.9865731.64444
-61.1200044.9865731.69162
-61.0400144.9865731.72362
-60.9599944.9865731.73899
-60.8800044.9865731.73177
-60.7999944.9865731.70110
-60.7200044.9865731.64692
-60.6400144.9865731.58435
-60.5600044.9865731.52341
-60.4800144.9865731.47704
-60.3999944.9865731.45421
-60.3200144.9865731.44486
-60.2399944.9865731.43628
-60.1600044.9865731.42496
-60.0799944.9865731.41590
-60.0000044.9865731.42198
options(repr.plot.width = 10, repr.plot.height = 10)
ylim <- c( min(sub$lat), max(sub$lat) )
xlim <- c( min(sub$lon), max(sub$lon) )
w <- map_data( 'world', ylim = ylim, xlim = xlim )
p = ggplot() + 
    geom_polygon(data = w, aes(x=long, y = lat, group = group), fill = "grey", color = "black") +
    coord_fixed(1.3, xlim = xlim, ylim = ylim) + 
    geom_point(data = sub, aes(color = sos, y=lat, x=lon), size=1) +
    labs(title = paste(long_name, 'at', datetime, 'of', title, '\nRaw Grid'))
p
../../../_images/4121732410ed6a6bfa340ca266985da52d36195c8989c701468f19c8aa5c2255.png