The “Surface UV” data are on a 0.5∘ × 0.5∘ longitude E and latitude N grid. That is latitudes south of the equator and longitudes W of Greenwich are expressed as negative numbers. They consist in several different variables, both daily doses and daily maximum irradiances, biologically weighted and not weighted.
Please, see the online-only article for step-by-step instructions.
The files are text files with a header protected with #
as comment marker and the data are in aligned columns separated by one
space character. The column names are not stored as column headings, but
instead in the header of the file, one variable per row. Thus, decoding
the file header is key to the interpretation of the data, while reading
the data is simple, although setting the correct R classes to the
different variables is also important.
We fetch the path to an example file included in the package, originally downloaded from the FMI server. The grid point closest to Viikki, Helsinki, Finland, and for variables for UV-B and UV-A daily dose and daily maximum irradiances, not biologically weighted, were requested.
one.file.name <-
system.file("extdata", "AC_SAF_ts_Viikki.txt",
package = "reumetsat", mustWork = TRUE)
Two query functions make it possible to find out the names of the variables contained in a file and the coordinates of location corresponding to the time series data.
vars_AC_SAF_txt(one.file.name)
#> [1] "Date" "DailyDoseUva" "DailyDoseUvb"
#> [4] "DailyMaxDoseRateUva" "DailyMaxDoseRateUvb" "QC_MISSING"
#> [7] "QC_LOW_QUALITY" "QC_MEDIUM_QUALITY" "QC_INHOMOG_SURFACE"
#> [10] "QC_POLAR_NIGHT" "QC_LOW_SUN" "QC_OUTOFRANGE_INPUT"
#> [13] "QC_NO_CLOUD_DATA" "QC_POOR_DIURNAL_CLOUDS" "QC_THICK_CLOUDS"
#> [16] "QC_ALB_CLIM_IN_DYN_REG" "QC_LUT_OVERFLOW" "QC_OZONE_SOURCE"
#> [19] "QC_NUM_AM_COT" "QC_NUM_PM_COT" "QC_NOON_TO_COT"
#> [22] "Algorithm version"
The geographic coordinates of the location are returned.
The variables included in downloaded files can be chosen when the request is submitted on-line. The default is to read all the variables present in the file.
summer_viikki.tb <- read_AC_SAF_txt(one.file.name)
dim(summer_viikki.tb)
#> [1] 153 22
colnames(summer_viikki.tb)
#> [1] "Date" "DailyDoseUva" "DailyDoseUvb"
#> [4] "DailyMaxDoseRateUva" "DailyMaxDoseRateUvb" "QC_MISSING"
#> [7] "QC_LOW_QUALITY" "QC_MEDIUM_QUALITY" "QC_INHOMOG_SURFACE"
#> [10] "QC_POLAR_NIGHT" "QC_LOW_SUN" "QC_OUTOFRANGE_INPUT"
#> [13] "QC_NO_CLOUD_DATA" "QC_POOR_DIURNAL_CLOUDS" "QC_THICK_CLOUDS"
#> [16] "QC_ALB_CLIM_IN_DYN_REG" "QC_LUT_OVERFLOW" "QC_OZONE_SOURCE"
#> [19] "QC_NUM_AM_COT" "QC_NUM_PM_COT" "QC_NOON_TO_COT"
#> [22] "Algorithm version"
The returned data frame has 153 rows (= days) and 22 columns (variables). We can see above that several of the variables have names starting with “QC” for quality control.
The class of the different columns varies, in particular the “QC”
variables are stored in the data frame as integer
values.
str(lapply(summer_viikki.tb, class))
#> List of 22
#> $ Date : chr "Date"
#> $ DailyDoseUva : chr "numeric"
#> $ DailyDoseUvb : chr "numeric"
#> $ DailyMaxDoseRateUva : chr "numeric"
#> $ DailyMaxDoseRateUvb : chr "numeric"
#> $ QC_MISSING : chr "integer"
#> $ QC_LOW_QUALITY : chr "integer"
#> $ QC_MEDIUM_QUALITY : chr "integer"
#> $ QC_INHOMOG_SURFACE : chr "integer"
#> $ QC_POLAR_NIGHT : chr "integer"
#> $ QC_LOW_SUN : chr "integer"
#> $ QC_OUTOFRANGE_INPUT : chr "integer"
#> $ QC_NO_CLOUD_DATA : chr "integer"
#> $ QC_POOR_DIURNAL_CLOUDS: chr "integer"
#> $ QC_THICK_CLOUDS : chr "integer"
#> $ QC_ALB_CLIM_IN_DYN_REG: chr "integer"
#> $ QC_LUT_OVERFLOW : chr "integer"
#> $ QC_OZONE_SOURCE : chr "integer"
#> $ QC_NUM_AM_COT : chr "integer"
#> $ QC_NUM_PM_COT : chr "integer"
#> $ QC_NOON_TO_COT : chr "integer"
#> $ Algorithm version : chr "character"
As bad data values are filled with NA
in the
measured/derived variables, a smaller data frame can be obtained by not
reading the QC
(quality control) variables.
summer_viikki_QCf.tb <-
read_AC_SAF_txt(one.file.name, keep.QC = FALSE)
dim(summer_viikki_QCf.tb)
#> [1] 153 6
colnames(summer_viikki_QCf.tb)
#> [1] "Date" "DailyDoseUva" "DailyDoseUvb"
#> [4] "DailyMaxDoseRateUva" "DailyMaxDoseRateUvb" "Algorithm version"
In the case of reading multiple time series for different locations
it is important to include the geographic coordinates in the returned
data frame. The default is to include these coordinates when more than
one file is read in a single call to read_AC_SAF_txt()
.
However, here we override the default and add coordinates when reading
only one file.
summer_viikki_geo.tb <-
read_AC_SAF_txt(one.file.name, keep.QC = FALSE, add.geo = TRUE)
dim(summer_viikki_geo.tb)
#> [1] 153 8
colnames(summer_viikki_geo.tb)
#> [1] "Date" "DailyDoseUva" "DailyDoseUvb"
#> [4] "DailyMaxDoseRateUva" "DailyMaxDoseRateUvb" "Algorithm version"
#> [7] "Longitude" "Latitude"
In some cases we may want to read only specific variables out of the
file. This is possible by passing the names of the variables as an
argument through parameter vars.to.read
.
Being the returned object an R data frame plotting and other
computations do not differ from the usual ones. One example follows
showing subsetting based on dates. In the time series there are
occasionally days missing data (NA
), and may need to be
addressed.
We may be interested in computing the total UV-B dose accumulated over the duration of an experiment. There are different ways of doing this computation, here I use base R functions.
Worldwide coverage consists in 720 × 360 = 2.592 × 105 grid points. As for time series, the number of data columns varies. However, one difference is that QC information is collected into a single variable. The format of the files is HDF5, which are binary files that allow selective reading. There are additional optimizations used to reduce the size, the main one is that the geographic coordinates of the grid points are not saved explicitly but instead the information needed to compute them is included as metadata. The data are provided as one file per day, with the size of the files depending on the number of grid points included as well as the number of variables. As these are off-line data available with a delay, in most cases we are interested in data for a certain period of time.
Please, see the online-only article for step-by-step instructions.
The HDF5 files have a specific format and content organization, the
function in package ‘reumetsat’ uses functions from package ‘rhdf5’ to
access the files. The column names are not stored as metadata and can be
queried without reading the whole file. Thus, decoding is simpler than
for the time series files in text format. Reading the data is simple as
it is stored as numeric values no requiring interpretation. The dates,
in contrast, need to be decoded from the file names, making it crucial
that users do not rename the files contained in the .zip
archive.
As above for the time series file, we fetch the path to an example file included in the package, originally downloaded from the FMI server. It covers the whole of the Iberian peninsula and the Balearic islands. Only variables for UV-B and UV-A daily dose and daily maximum irradiances, not biologically weighted, were requested from the server.
one.file.name <-
system.file("extdata", "O3MOUV_L3_20240621_v02p02.HDF5",
package = "reumetsat", mustWork = TRUE)
Two query functions make it possible to find out the names of the variables contained in a file and the coordinates of the grid.
vars_AC_SAF_hdf5(one.file.name)
#> [1] "Date" "Longitude" "Latitude"
#> [4] "DailyDoseUva" "DailyDoseUvb" "DailyMaxDoseRateUva"
#> [7] "DailyMaxDoseRateUvb" "QualityFlags"
By default only the boundaries of the grid are returned.
With defaults all variables are read, and because the data can
include multiple geographic grid points, Longitude
and
Latitude
are always returned in the data frame.
midsummer_spain.tb <- read_AC_SAF_hdf5(one.file.name)
dim(midsummer_spain.tb)
#> [1] 221 8
colnames(midsummer_spain.tb)
#> [1] "Date" "Longitude" "Latitude"
#> [4] "DailyDoseUva" "DailyDoseUvb" "DailyMaxDoseRateUva"
#> [7] "DailyMaxDoseRateUvb" "QualityFlags"
Variable names are consistent between the data frames returned by
read_AC_SAF_hdf5()
and read_AC_SAF_txt()
, but
the position of the columns, can vary. Use names rather than numeric
positional indices to extract columns!
str(lapply(midsummer_spain.tb, class))
#> List of 8
#> $ Date : chr "Date"
#> $ Longitude : chr "numeric"
#> $ Latitude : chr "numeric"
#> $ DailyDoseUva : chr "numeric"
#> $ DailyDoseUvb : chr "numeric"
#> $ DailyMaxDoseRateUva: chr "numeric"
#> $ DailyMaxDoseRateUvb: chr "numeric"
#> $ QualityFlags : chr "numeric"
Quality control information is encoded differently in the two types
of downloaded files. As seen above, in .txt
individual QC
variables, taking as values single-digit integer
values are
present. In the .HDF5
files the flags are collapsed into a
single variable, that needs decoding.
We can as before read only specific variables if needed by passing
their names as argument to vars.to.read
.
midsummer_spain_daily.tb <-
read_AC_SAF_hdf5(one.file.name,
vars.to.read = c("DailyDoseUva", "DailyDoseUvb"))
dim(midsummer_spain_daily.tb)
#> [1] 221 5
colnames(midsummer_spain_daily.tb)
#> [1] "Date" "Longitude" "Latitude" "DailyDoseUva" "DailyDoseUvb"
We can read multiple files, with a limit to their maximum number imposed by the available computer RAM as data frames as used reside in RAM during computations. The amount of RAM required varies with the geographic area covered and number of variables read. In practice, this limit is unlikely to be a problem only with data with world-wide coverage.
We fetch the paths to the example files included in the package. In
normal use, this step is not needed as the user will know the paths to
the files to read, or will use function list.files()
with a
search pattern if he/she knows the folder where the files to be read
reside.
five.file.names <-
system.file("extdata",
c("O3MOUV_L3_20240620_v02p02.HDF5",
"O3MOUV_L3_20240621_v02p02.HDF5",
"O3MOUV_L3_20240622_v02p02.HDF5",
"O3MOUV_L3_20240623_v02p02.HDF5",
"O3MOUV_L3_20240624_v02p02.HDF5"),
package = "reumetsat", mustWork = TRUE)
The only difference to the case of reading a single file is in the
length of the character vector containing file names. The different
files read in the same call to read_AC_SAF_hdf5()
should
share identical grids and contain the variables to be read. If this
is not the case, read_AC_SAF_hdf5()
should be used to read
them individually and later combined, which is a slower approach.