---
title: "User Guide"
subtitle: "Package 'surfaceuv' `r packageVersion('surfaceuv')` "
author: "Pedro J. Aphalo"
date: "`r packageDate('surfaceuv')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{User Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  fig.width = 6,
  fig.height = 4,
  out.width = "95%",
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(surfaceuv)
```

## Introduction

The AC SAF project of EUMETSAT on atmospheric composition provides several different data products, including "Surface UV", for ultraviolet radiation doses and irradiances. The "Surface UV" data are on a $0.5^\circ \times 0.5^\circ$ longitude E and latitude N grid. That is to say that latitudes south of the equator and longitudes west of Greenwich are expressed as negative numbers. The data consist in several different variables, both daily doses and daily maximum irradiances, biologically weighted and not weighted, estimate errors for them as well as quality flags.

Data can be downloaded in two different formats, suitable for different uses. The most efficient way of downloading data for a single location is to download then as a _time series_ in a text file. When we intend to download data for a region or the whole Earth gridded data are the right approach. These are available as binary files in HDF5 format, that are smaller and allow selective reading of variables and grid regions, allowing fast "ingestion" of the data. This package provides functions that make it easy to selectively important data in either format into R data frames. Please, see the [online-only article for step-by-step instructions on how to obtain the data files from the server](https://docs.r4photobiology.info/surfaceuv/articles/surface-uv.html). This must be done at a web page as there is no API available and in addition the files can be downloaded only after a delay, as preparation involved processing in the server.

## Time series

### File format

The files are text files with a header protected with `#` as comment marker and the data are in aligned columns separated by one space character. The column names are not stored as column headings, but instead in the header of the file, one variable per row. Thus, decoding the file header is key to the interpretation of the data, while reading the data is simple, although setting the correct R classes to the different variables is also important. The top of the file we will use in the examples is shown below.

```
#AC SAF offline surface UV, time-series
#OUV EXTRACTOR VERSION: 1.20
#LONGITUDE: 25.000 (0-based index 410)
#LATITUDE: 60.000 (0-based index 300)
#COLUMN DEFINITIONS
#0: Date [YYYYMMDD]
#1: DailyDoseUva [kJ/m2]
#2: DailyDoseUvb [kJ/m2]
#3: DailyMaxDoseRateUva [mW/m2]
#4: DailyMaxDoseRateUvb [mW/m2]
#5: QC_MISSING
#6: QC_LOW_QUALITY
#7: QC_MEDIUM_QUALITY
#8: QC_INHOMOG_SURFACE
#9: QC_POLAR_NIGHT
#10: QC_LOW_SUN
#11: QC_OUTOFRANGE_INPUT
#12: QC_NO_CLOUD_DATA
#13: QC_POOR_DIURNAL_CLOUDS
#14: QC_THICK_CLOUDS
#15: QC_ALB_CLIM_IN_DYN_REG
#16: QC_LUT_OVERFLOW
#17: QC_OZONE_SOURCE
#18: QC_NUM_AM_COT
#19: QC_NUM_PM_COT
#20: QC_NOON_TO_COT
#21: Algorithm version
#DATA
20240501  1.224e+03  1.558e+01  3.932e+04  6.628e+02 0 0 0 0 0 0 0 0 0 0 0 0  1  2  0  0 2.2
20240502  1.235e+03  1.648e+01  3.951e+04  6.974e+02 0 0 0 0 0 0 0 0 0 0 0 0  1  2  0  0 2.2
20240503  9.368e+02  1.345e+01  3.224e+04  5.871e+02 0 0 0 0 0 0 0 0 0 0 0 0  1  2  0  0 2.2
... etc.
```

### Reading one file

All functions in the package are vectorised for parameter `files` and can read one or more files in a single call to any of them. However, in our first example we read a single.

We fetch the path to an example file included in the package, originally downloaded from the FMI server. The grid point closest to Viikki, Helsinki, Finland. _In normal use this step is unnecessary as the user will already know the folder where the files to be read are located._

```{r}
one.file.name <-
   system.file("extdata", "AC_SAF-Viikki-FI-6masl.txt",
               package = "surfaceuv", mustWork = TRUE)
```

Query functions make it possible to find out the names of the variables contained in a file and the coordinates of location corresponding to the time series data. They are very useful as these depend on what is requested at the time the file was downloaded.

```{r}
vars_AC_SAF_UV_txt(one.file.name)
```

To skip the quality control (`QC`) flags from the listed variables we can add `keep.QC = FALSE` to the call above.

```{r}
vars_AC_SAF_UV_txt(one.file.name, keep.QC = FALSE)
```

The geographic coordinates of the location are returned based on the file header.

```{r}
grid_AC_SAF_UV_txt(one.file.name)
```

The variables included in downloaded files can be chosen when the request is submitted on-line. The default in function `read_AC_SAF_UV_txt()` is to read all the variables present in the file.

```{r}
summer_viikki.tb <- read_AC_SAF_UV_txt(one.file.name)
dim(summer_viikki.tb)
colnames(summer_viikki.tb)
```

The returned data frame has 153 rows (= days) and 22 columns (variables). We can see above that several of the variables have names starting with "QC" for quality control. It is good to check them, as even though bad data are set as `NA`, some of these flags also report weaknesses of the estimates that even if not fatal can be important.

```{r}
range(summer_viikki.tb[["Date"]])
```

The class of the different columns varies, while dose and irradiance data are stored as `numeric` values, the "QC" variables are stored in the data frame as `integer` values. The dates are stored in a variable of class `Date` and the `Algorithm version` as `character`.

```{r}
str(lapply(summer_viikki.tb, class))
```

As bad data values are filled with `NA` in the measured/derived variables, a smaller data frame can be obtained by not reading the `QC` (quality control) variables.

```{r}
summer_viikki_QCf.tb <-
  read_AC_SAF_UV_txt(one.file.name, keep.QC = FALSE)
dim(summer_viikki_QCf.tb)
colnames(summer_viikki_QCf.tb)
```

In some cases we may want to read only specific variables out of the file. This is possible by passing the names of the variables as an argument through parameter `vars.to.read`. In the example below we read only two data variables plus `Date` that is always included.

```{r}
# read two variables
summer_viikki_2.tb <-
  read_AC_SAF_UV_txt(one.file.name,
                     vars.to.read = c("DailyDoseUva", "DailyDoseUvb"))
dim(summer_viikki_2.tb)
colnames(summer_viikki_2.tb)
```
To read variables based on name matching we can first retrieve the names of all variables, and select some of them with `grep()` before passing them to parameter `vars.to.read` as argument.

```{r}
# read UV-A and UV-B variables and QC flags
uvb.vars <- grep("uvb$|uva$|^QC_", 
                 vars_AC_SAF_UV_txt(one.file.name), value = TRUE, ignore.case = TRUE)
summer_viikki_3.tb <-
  read_AC_SAF_UV_txt(one.file.name,
                     vars.to.read = uvb.vars)
dim(summer_viikki_3.tb)
colnames(summer_viikki_3.tb)
```

### Reading multiple files

In the case of time-series data one may want to read several files for the same location, for example for different time periods at the same location or for a few different locations for the same time period. This can be achieved by passing to parameter `files` a vector of file names.

One consideration is that when reading multiple time-series files, we must ensure that the variables we intend to read are present in all the files. The presence of additional variables in only some files is not a problem as the files are read selectively, and the name and position of columns found from the header of each file. We can find out which variables are present in all the files we intend to read as shown below, relying on the default behavior of function `vars_AC_SAF_UV_txt()`. We first locate the folder where the files are stored and then search for files with names matching a pattern.

```{r}
path.to.files <-
    system.file("extdata",
                package = "surfaceuv", mustWork = TRUE)
two.file.names <-
    list.files(path.to.files, pattern = "*masl\\.txt$", full.names = TRUE)
```

We can find out if the two files contain data for the same location.

```{r}
grid_AC_SAF_UV_txt(two.file.names)
```

We can ignore the quality control variables to keep the output simpler and easier to understand. A warning is issued because the files contain different sets of variables. By default the returned vector contains the names of the variables that are present in all files. 
 
```{r}
shared.variables <- vars_AC_SAF_UV_txt(two.file.names, keep.QC = FALSE)
```

In the case of reading multiple time series for different locations it is important to include the geographic coordinates in the returned data frame. The default is to include these coordinates when more than one file is read in a single call to `read_AC_SAF_UV_txt()`. 
```{r}
two_locations.tb <-
  read_AC_SAF_UV_txt(two.file.names, vars.to.read = shared.variables, keep.QC = FALSE)
dim(two_locations.tb)
colnames(two_locations.tb)
```
The returned value is a single data frame with the concatenated contents of the files.


We can check the variables present in each file.

```{r}
  vars.ls <- lapply(two.file.names, FUN = vars_AC_SAF_UV_txt, keep.QC = FALSE)
  names(vars.ls) <- basename(two.file.names)
  vars.ls
```

### Using the data

Being the returned object an R data frame plotting and other computations do not differ from the usual ones. One example follows showing subsetting based on dates. In the time series there are occasionally days missing data (`NA`), and this may need to be addressed.

We may be interested in computing the total UV-B dose accumulated over the duration of an experiment. There are different ways of doing this computation, here I use base R functions.

```{r}
subset(summer_viikki_2.tb, 
       Date >= as.Date("2024-07-15") & Date < as.Date("2024-08-15")) |>
with(sum(DailyDoseUvb))
```

The file headers are copied to attribute "file.header", in the form of a list with one character vector per file read.

```{r}
attr(two_locations.tb, "file.header")
```

## Data on a geographic grid

Worldwide coverage consists in $720 \times 360 = `r 720 * 360`$ grid points. As for time series, the number of data columns varies. However, one difference is that QC information is collected into a single variable. The format of the files is HDF5, which are binary files that allow selective reading. There are additional optimizations used to reduce the size, the main one is that the geographic coordinates of the grid points are not saved explicitly but instead the information needed to compute them is included as metadata. The data are provided as one file per day, with the size of the files depending on the number of grid points included as well as the number of variables. As these are off-line data available with a delay, in most cases we are interested in data for a certain period of time.

### Format of gridded data

The HDF5 files have a specific format and content organization that makes it possible to efficiently read subsets of the data they contain. The functions in package 'surfaceuv' call functions from package 'rhdf5' to access these files. The column names are stored as metadata and can be queried without reading the whole file. Thus, decoding is simpler than for the time series files in text format. Reading the data is also simple as it is stored as numeric values no requiring interpretation. The dates, in contrast, need to be decoded from the file names, making it crucial that if users rename the files they preserve the date string and the `_` at its boundary. The HDF5 files are contained in a `.zip` archive. The `.zip` archives can be freely renamed if desired.

### Reading one gridded-data file

As HDF5 gridded data files contain data for a single day, frequently we need to read several of them concatenating the data they contain. Anyway, in the first example we read a single file for simplicity.

As above for the time series file, we fetch the path to an example file included in the package, originally downloaded from the FMI server. It covers the whole of the Iberian peninsula and the Balearic islands. Only variables for UV-B and UV-A daily dose and daily maximum irradiances, not biologically weighted, were requested from the server. _In normal use this step is unnecessary as the user will already know the folder where the file to be read is located._

```{r}
one.file.name <-
   system.file("extdata", "O3MOUV_L3_20240621_v02p02.HDF5",
               package = "surfaceuv", mustWork = TRUE)
```

Two query functions make it possible to find out the names of the variables contained in a file and the coordinates of the grid.

```{r}
vars_AC_SAF_UV_hdf5(one.file.name)
```

By default only the boundaries of the grid are returned.

```{r}
grid_AC_SAF_UV_hdf5(one.file.name)
```

With defaults all variables are read, and because the data can include multiple geographic grid points, `Longitude` and `Latitude` are always returned in the data frame.

```{r}
midsummer_spain.tb <- read_AC_SAF_UV_hdf5(one.file.name)
dim(midsummer_spain.tb)
colnames(midsummer_spain.tb)
```

Variable names are consistent between the data frames returned by `read_AC_SAF_UV_hdf5()` and `read_AC_SAF_UV_txt()`, but the position of the columns, can vary. *Use names rather than numeric positional indices to extract columns!*

```{r}
str(lapply(midsummer_spain.tb, class))
```

Quality control information is encoded differently in the two types of downloaded files. As seen above, in `.txt` individual QC variables, taking as values single-digit `integer` values are present. In the `.HDF5` files the flags are collapsed into a single 64 bit variable, that needs decoding. As R does not support 64 bit integers, the package returns the `QualityFlags` variable as an columns of class `integre64`, a class defined in package 'bit64'.

```{r}
head(midsummer_spain.tb$QualityFlags)
```

We can as before read only specific variables if needed by passing their names as argument to `vars.to.read`.

```{r}
midsummer_spain_daily.tb <-
  read_AC_SAF_UV_hdf5(one.file.name,
                    vars.to.read = c("DailyDoseUva", "DailyDoseUvb"))
dim(midsummer_spain_daily.tb)
colnames(midsummer_spain_daily.tb)
```

### Reading multiple gridded-data files

We can read multiple files, with a limit to their maximum number imposed by the available computer RAM as data frames as used reside in RAM during computations. The amount of RAM required varies with the geographic area covered and number of variables read. *In practice, this limit is unlikely to be a problem only with data with world-wide or continental coverage.* (The size of the data frame affects performance when the total number of cells or length of the underlying vector(s) is so large that cannot be represented as an R (32 bit) `integer`, $n > `r  .Machine$integer.max`$).

We fetch the paths to the example files included in the package. In normal use, this step is not needed as the user will know the paths to the files to read, or will use function `list.files()` with a search pattern if he/she knows the folder where the files to be read reside. _In normal use this step is unnecessary as the user will already know the folder where the file to be read is located._

```{r}
five.file.names <-
   system.file("extdata",
               c("O3MOUV_L3_20240620_v02p02.HDF5",
                 "O3MOUV_L3_20240621_v02p02.HDF5",
                 "O3MOUV_L3_20240622_v02p02.HDF5",
                 "O3MOUV_L3_20240623_v02p02.HDF5",
                 "O3MOUV_L3_20240624_v02p02.HDF5"),
               package = "surfaceuv", mustWork = TRUE)
```

The only difference to the case of reading a single file is in the length of the character vector containing file names. *The different files read in the same call to `read_AC_SAF_UV_hdf5()` should share identical grids and contain all the variables to be read (by default all those in the first file read).* If this is not the case, currently `read_AC_SAF_UV_hdf5()` should be used to read them individually and later combined, which is a slower approach.

```{r}
summer_5days_spain.tb <- read_AC_SAF_UV_hdf5(five.file.names)
dim(summer_5days_spain.tb)
colnames(summer_5days_spain.tb)
```
Why can silence progress reporting.

```{r}
summer_5days_spain.tb <- 
  read_AC_SAF_UV_hdf5(five.file.names, verbose = FALSE)
```

As shown above for text files, query functions can be used to extract information about the files, in this case without reading them in whole.

```{r}
vars_AC_SAF_UV_hdf5(five.file.names)
```

```{r}
vars_AC_SAF_UV_hdf5(five.file.names, keep.QC = FALSE)
```

```{r}
grid_AC_SAF_UV_hdf5(five.file.names)
```

```{r}
grid_AC_SAF_UV_hdf5(five.file.names, expand = TRUE) |>
  head(10)
```

```{r}
date_AC_SAF_UV_hdf5(five.file.names)
```

```{r}
date_AC_SAF_UV_hdf5(five.file.names, use.names = FALSE)
```


If the files differ in the variables they contain, or we want to only extract some, as shown above, we can select them with `grep()` or some other function once we have a list of their names. (Of course, we can also simply pass the names of the variables as an inline character vector in the call, as shown above for time-series data files.)

```{r}
all.vars <- vars_AC_SAF_UV_hdf5(five.file.names)
daily.dose.vars <- grep("^DailyDose", all.vars, value = TRUE)

doses_5days_spain.tb <- 
  read_AC_SAF_UV_hdf5(five.file.names, vars.to.read = daily.dose.vars)
dim(doses_5days_spain.tb)
colnames(doses_5days_spain.tb)

```

### Reading gridded files differing in the variables

As memory is pre-allocated the set of variables to be read is decided when or before the first file is read. If no argument is passed to `vars.to.read`, the default becomes the set of variables found in the first file read. In contrast, if an argument is passed, it determines the set of variables to read. If any variable in this set is not found in a file, currently, an error is triggered.

We can use `r `vars__AC_SAF_UV_hdf5()` to obtain a vector with the names of all the variables that are present consistently across all the files. This is the case with the default argument `set.oper = "intersect"`. Passing `"union"` lists all the variables found in at least one of the files.

```{r}
path.to.files <-
   system.file("extdata",
               package = "surfaceuv", mustWork = TRUE)
```

```{r}
gridded.files <-
  list.files(path.to.files, pattern = "*\\.HDF5$", full.names = TRUE)
```


```{r}
grid_AC_SAF_UV_hdf5(gridded.files)
```

```{r}
shared.variables <- vars_AC_SAF_UV_hdf5(gridded.files, keep.QC = FALSE)
shared.variables
```
We can check the variables present in each file.

```{r}
  vars.ls <- lapply(gridded.files, FUN = vars_AC_SAF_UV_hdf5, keep.QC = FALSE)
  names(vars.ls) <- basename(gridded.files)
  vars.ls
```

By default progress is reported when the function is called interactively. Thus, within the vignette they are not shown by default. We can force its display.

```{r}
six.days.tb <- 
  read_AC_SAF_UV_hdf5(gridded.files, 
                      vars.to.read = shared.variables, 
                      keep.QC = FALSE,
                      verbose = TRUE)
```

As you surely expect by now, the returned object is a data frame.

```{r}
dim(six.days.tb)
colnames(six.days.tb)
```

```{r}
head(six.days.tb)
```

### Using the data

As shown above, being the returned object an R data frame plotting and other computations do not differ from the usual ones.

Currently only a trace  of the origin of the data is preserved in the data frame object's `comment` attribute.

```{r}
comment(six.days.tb)
```
The individual variables also have metadata, including units of expression, as attributes. These are copied from the attributes in the first file read, which is safe, as the downloaded data are always consistent in this respect.

```{r}
attributes(six.days.tb$DailyDoseUvb)
```

## References

Kujanpää, J. (2019) _PRODUCT USER MANUAL Offline UV Products v2
   (IDs: O3M-450 - O3M-464) and Data Record R1 (IDs: O3M-138 - O3M-152)_. Ref.
   SAF/AC/FMI/PUM/001. 18 pp. EUMETSAT AC SAF.