Draft CF data model
This document outlines an abstract model for data and metadata corresponding
to the CF
metadata standard (version 1.5).
CF is a primarily a convention for storing data in netCDF,
and does not explicitly present a data model,
but its design implies one to some extent.
This document tries to avoid prescribing more than is needed
for interpreting CF as it stands,
in order to avoid inconsistency with future developments of CF.
Some parts of the CF standard arise from the requirements or restrictions
of the netCDF
file format, or are concerned with efficient ways of storing data on disk;
these parts are not logically part of the data model.
Data space
The central concept of the data model is a space construct.
(We use the word "construct" because it is a
language-neutral term, unlike "object" or "structure".)
In a dataset contained in a single netCDF file, each data variable usually
corresponds to a space construct, but a space construct
might be a combination of several data variables.
In a dataset comprising several netCDF files, a space construct may span data
variables in more than one file.
Rules for aggregating data variables from one or several files
into a single space construct are needed
but have yet to be defined in CF; at the moment these rules
are regarded as the concern of data processing software.
Each space construct may have
- An ordered list of one or more dimension constructs
(or "dimensions" for short).
- A data array with dimensions in the order listed and with
the appropriate sizes.
If there are no dimensions, the data array is a scalar.
The elements of the data array must all be of the same data type,
which may be numeric, character or string.
- An unordered collection of auxiliary coordinate constructs.
- An unordered collection of cell measure constructs.
- A cell methods property, which refers to the dimensions
(but not their sizes).
- An unordered collection of grid mappings, which describe functions
and supply parameters to define relationships among
the coordinates of the space or to calculate new coordinates from them.
- Other properties, which are metadata
that do not refer to the dimensions,
and serve to describe the data the space contains.
Properties may be of any data type (numeric, character or string)
and can be scalars or arrays.
They are attributes in the netCDF file, but we use the term "property" instead
because not all CF attributes are properties in this sense.
All the components of the space construct are optional. The data array
would be missing if the space construct serves only to define a
coordinate system for the space.
Dimension construct
An dimension construct must contain
- A size (an integer greater than zero),
which can be equal to one. In CF, there is a formal distinction between
scalar coordinate variables and size-one coordinate variables,
but they are logically the same; CF supports scalar coordinate variables
for simplicity and convenience in the netCDF file.
An example of a size-one dimension is a vertical dimension for 1.5 m height.
It may also contain
- A one-dimensional numerical coordinate array of the size specified
for the dimension.
Dimension constructs cannot have string-valued coordinates.
If the size of the dimension is greater than one,
the elements of the coordinate array must all be of the same numeric
data type, they must all have different non-missing values,
and they must be monotonically increasing or decreasing.
In this data model, a CF string-valued coordinate variable
or string-valued scalar coordinate variable
corresponds to an auxiliary coordinate construct (not a dimension construct),
with a dimension
whose own construct has no coordinate array.
- A two-dimensional boundary coordinate array, whose slow-varying
(second in Fortran) dimension equals the size specified by the
dimension construct,
and whose fast-varying dimension is two, indicating the extent of the cell.
For climatological time dimensions,
the bounds are interpreted in a special way
indicated by the cell methods.
- Properties (in the same sense as for the space construct)
serving to describe the coordinates.
In the data model, these latter components are all optional.
Dimension constructs can be regarded as the "axes" of the data
space. In CF-netCDF files, a dimension
construct corresponds to a netCDF dimension and its associated
coordinate variable
(in the Unidata sense, with the name of the variable and the name of its
dimension being equal), and
in CF 1.5 only coordinate variables can have an axis attribute.
However, we avoid the word "axis" here because there is an ongoing
discussion about also
allowing CF-netCDF auxiliary coordinate variables to have
the CF axis attribute,
although they are not axes of the data space
(the reason being that they may be regarded as axes of the physical space).
In this CF data model we permit a dimension not to have a coordinate array
if there is no appropriate numeric monotonic coordinate.
That is the case for a dimension that runs over ocean basins or area
types, for example, or for a dimension that indexes timeseries at
scattered points. Such dimensions do not correspond to a continuous
physical quantity.
In a netCDF file, data variables can share coordinate variables.
In the data model, the dimension constructs of one space construct are
logically independent of those of all other space constructs;
if the coordinates of one space construct are modified, it does not affect any
other space construct.
Auxiliary coordinate construct
An auxiliary coordinate construct must contain
- A list of some (at least one)
of the dimensions of the space construct in any order.
and may also contain
- A coordinate array with dimension sizes corresponding
to the list of dimensions of the auxiliary coordinate construct.
If there is a dimension with size greater than one,
the elements of the coordinate array must all be of the same data type
(numeric, character or string),
but they do not have to be distinct or monotonic.
Maybe they can have missing values; that is not clear at present.
- A boundary coordinate array with all the dimensions, in the same order,
as the coordinate array, and a fastest-varying dimension (first dimension
in Fortran) equal to the number of vertices of each cell.
CF doesn't exclude character or string coordinates having boundaries,
but is there a need for it?
- Properties serving to describe the coordinates.
Auxiliary coordinate constructs correspond to two kind of variables
in CF-netCDF files.
- Auxiliary
coordinate variables named by the coordinates attribute
of a data variable.
CF recommends there to be auxiliary coordinate constructs of latitude and
longitude if there is two-dimensional horizontal variation but the horizontal
coordinates are not latitude and longitude.
- Variables which depend on one or more dimensions and are
named by a formula_terms attribute of a vertical coordinate variable.
As for dimensions,
auxiliary coordinate constructs of different space constructs
are independent in the data model.
Cell measure construct
A cell measure construct may contain
- A list of some of the dimensions of the space construct in any order.
- Properties to describe itself.
and must contain
- A measure property, which indicates which metric of the grid
it supplies e.g. cell areas.
- A units property consistent with the measure property e.g. m2.
- A numeric array
of metric values having the dimensions listed, or a scalar metric
value if no dimensions are given.
If there is a dimension with size greater than one,
the elements of the array must all be of the same data type.
It is assumed that the metric does not depend on any of the dimensions
of the space which are not specified, and the values are implicitly propagated
along these dimensions.
In CF-netCDF files, cell measures constructs correspond to variables
named by the cell_measures attribute of the data variable.
As for dimensions, cell measures constructs of different space constructs
are independent in the data model.
Cell methods property
The cell methods property describes how the data values represent variation of
the quantity within cells. It corresponds to the cell_methods
attribute of the data variable in CF-netCDF files. It is an ordered list,
because the methods specified are not necessarily commutative.
Each entry of the list
specifies one or more dimensions and a method e.g. mean
(CF Appendix E). Special methods
indicate climatological time processing.
Grid mappings
In this category we cover the functions of the CF attributes
formula_terms, which describes how to compute a vertical coordinate
variable from components (CF Appendix D),
and grid_mapping, which describes how to transform between
longitude-latitude space and the horizontal coordinates of the space construct
(CF Appendix F).
These two functions are rather similar, and it is possible to imagine
generalisations of both of them being needed.
In fact, although formula_terms was introduced for dimensionless
coordinate variables, it already supports the dimensional
atmosphere_hybrid_height_coordinate.
A grid mapping contains
- A mapping property which indicates the nature of the transformation
and implies the formulae to be used. A CF-netCDF file does not explicitly
record the formulae; it depends on the application software knowing what to do.
The mapping property is the standard_name of a vertical coordinate
variable with formula_terms, and the grid_mapping_name
of a grid_mapping variable.
- An unordered collection of scalar parameters, pointers to
dimension or auxiliary coordinate constructs,
and pointers to other space constructs.
The scalar parameters are scalar data variables (which should
have units if dimensional) named by formula_terms,
and attributes of grid_mapping variables
(in specified units).
Each member of the collection has a particular role in the formulae,
as identified by its keyword in a formula_terms attribute,
or its attribute name in a grid_mapping variable.
Other properties
The other properties recognised by this CF data model correspond to attributes
listed in CF Appendix A.
For space constructs, the allowed properties are
comment,
history,
institution,
long_name,
references,
source,
standard_error_multiplier,
standard_name,
title,
units.
Some of these can be global attributes in a CF-netCDF file.
In this data model, it is assumed that any relevant global attribute
is also an
attribute of every data variable, although it is superseded if the data
variable has its own attribute.
Each space construct in the model has its own independent set of properties.
For dimensions and auxiliary coordinate constructs, the allowed properties are
axis,
calendar,
leap_month,
leap_year,
long_name,
month_lengths,
positive,
standard_name,
units.
For space, dimension and auxiliary coordinate constructs,
other properties not defined
by CF could be included, since CF permits any attributes to be included
which do not conflict with the convention.
The attributes
valid_max,
valid_min and
valid_range
of data variables and coordinate variables are checks on the validity of
the values, which could be verified on input and written on output.
In this CF data model we assume they do not constrain any manipulations
which might be done on the data in memory.
The attributes
_FillValue and
missing_value
of data variables specify how missing data is indicated in the data array.
The CF data model supports the idea of missing data, but does not depend on
any particular method of indicating it.
The attributes
add_offset,
compress,
flag_masks,
flag_meanings,
flag_values and
scale_factor
are all used in methods of compressing the data to save space,
with or without loss of information.
They are not part of this data model because these operations do not
logically alter the data.
The "feature type" attribute and associated new conventions,
currently under discussion, will provide a way of packing multiple data
spaces of the same kind of discrete sampling geometry
(timeseries, trajectories, etc.) into a single CF-netCDF data variable,
in order to save space, since a multidimensional representation with
common coordinate variables is typically very wasteful in such cases.
This is a kind of compression. The data model would regard each instance
of the feature type as an independent space construct.
However, the "feature type" attribute itself is also a metadata property
that would be a property of the space construct and part of the data model.
The attributes
bounds,
cell_measures,
cell_methods,
climatology,
Conventions,
coordinates,
formula_terms and
grid_mapping
have various special or structural functions in the CF-netCDF file format.
Their functions and
the relationships they indicate are reflected in this data model.
Interdependence of space constructs
This data model makes a central assumption that each space construct is
independent. It assumes that software will be able to alter any space construct in
memory without affecting other space constructs. This assumption of independence is
violated in CF by the formula_terms and ancillary_variables
attributes, which explicitly make links between data variables. This means the
data model needs to support such links, but it is proposed that they should be
fragile. If an operation alters one space construct in a way which could invalidate
a relationship with another space construct, the link should be broken. The user
of software will have to be aware of these relationships and remake them if
applicable and useful.
10th January 2011
Jonathan Gregory