Draft CF data model

This document outlines an abstract model for data and metadata corresponding to the CF metadata standard (version 1.5). CF is a primarily a convention for storing data in netCDF, and does not explicitly present a data model, but its design implies one to some extent. This document tries to avoid prescribing more than is needed for interpreting CF as it stands, in order to avoid inconsistency with future developments of CF. Some parts of the CF standard arise from the requirements or restrictions of the netCDF file format, or are concerned with efficient ways of storing data on disk; these parts are not logically part of the data model.

Data space

The central concept of the data model is a space construct. (We use the word "construct" because it is a language-neutral term, unlike "object" or "structure".) In a dataset contained in a single netCDF file, each data variable usually corresponds to a space construct, but a space construct might be a combination of several data variables. In a dataset comprising several netCDF files, a space construct may span data variables in more than one file. Rules for aggregating data variables from one or several files into a single space construct are needed but have yet to be defined in CF; at the moment these rules are regarded as the concern of data processing software.

Each space construct may have

All the components of the space construct are optional. The data array would be missing if the space construct serves only to define a coordinate system for the space.

Dimension construct

An dimension construct must contain It may also contain In the data model, these latter components are all optional.

Dimension constructs can be regarded as the "axes" of the data space. In CF-netCDF files, a dimension construct corresponds to a netCDF dimension and its associated coordinate variable (in the Unidata sense, with the name of the variable and the name of its dimension being equal), and in CF 1.5 only coordinate variables can have an axis attribute. However, we avoid the word "axis" here because there is an ongoing discussion about also allowing CF-netCDF auxiliary coordinate variables to have the CF axis attribute, although they are not axes of the data space (the reason being that they may be regarded as axes of the physical space).

In this CF data model we permit a dimension not to have a coordinate array if there is no appropriate numeric monotonic coordinate. That is the case for a dimension that runs over ocean basins or area types, for example, or for a dimension that indexes timeseries at scattered points. Such dimensions do not correspond to a continuous physical quantity.

In a netCDF file, data variables can share coordinate variables. In the data model, the dimension constructs of one space construct are logically independent of those of all other space constructs; if the coordinates of one space construct are modified, it does not affect any other space construct.

Auxiliary coordinate construct

An auxiliary coordinate construct must contain and may also contain Auxiliary coordinate constructs correspond to two kind of variables in CF-netCDF files.
  1. Auxiliary coordinate variables named by the coordinates attribute of a data variable. CF recommends there to be auxiliary coordinate constructs of latitude and longitude if there is two-dimensional horizontal variation but the horizontal coordinates are not latitude and longitude.
  2. Variables which depend on one or more dimensions and are named by a formula_terms attribute of a vertical coordinate variable.
As for dimensions, auxiliary coordinate constructs of different space constructs are independent in the data model.

Cell measure construct

A cell measure construct may contain and must contain In CF-netCDF files, cell measures constructs correspond to variables named by the cell_measures attribute of the data variable. As for dimensions, cell measures constructs of different space constructs are independent in the data model.

Cell methods property

The cell methods property describes how the data values represent variation of the quantity within cells. It corresponds to the cell_methods attribute of the data variable in CF-netCDF files. It is an ordered list, because the methods specified are not necessarily commutative. Each entry of the list specifies one or more dimensions and a method e.g. mean (CF Appendix E). Special methods indicate climatological time processing.

Grid mappings

In this category we cover the functions of the CF attributes formula_terms, which describes how to compute a vertical coordinate variable from components (CF Appendix D), and grid_mapping, which describes how to transform between longitude-latitude space and the horizontal coordinates of the space construct (CF Appendix F). These two functions are rather similar, and it is possible to imagine generalisations of both of them being needed. In fact, although formula_terms was introduced for dimensionless coordinate variables, it already supports the dimensional atmosphere_hybrid_height_coordinate. A grid mapping contains

Other properties

The other properties recognised by this CF data model correspond to attributes listed in CF Appendix A. For space constructs, the allowed properties are comment, history, institution, long_name, references, source, standard_error_multiplier, standard_name, title, units. Some of these can be global attributes in a CF-netCDF file. In this data model, it is assumed that any relevant global attribute is also an attribute of every data variable, although it is superseded if the data variable has its own attribute. Each space construct in the model has its own independent set of properties. For dimensions and auxiliary coordinate constructs, the allowed properties are axis, calendar, leap_month, leap_year, long_name, month_lengths, positive, standard_name, units. For space, dimension and auxiliary coordinate constructs, other properties not defined by CF could be included, since CF permits any attributes to be included which do not conflict with the convention.

The attributes valid_max, valid_min and valid_range of data variables and coordinate variables are checks on the validity of the values, which could be verified on input and written on output. In this CF data model we assume they do not constrain any manipulations which might be done on the data in memory.

The attributes _FillValue and missing_value of data variables specify how missing data is indicated in the data array. The CF data model supports the idea of missing data, but does not depend on any particular method of indicating it.

The attributes add_offset, compress, flag_masks, flag_meanings, flag_values and scale_factor are all used in methods of compressing the data to save space, with or without loss of information. They are not part of this data model because these operations do not logically alter the data. The "feature type" attribute and associated new conventions, currently under discussion, will provide a way of packing multiple data spaces of the same kind of discrete sampling geometry (timeseries, trajectories, etc.) into a single CF-netCDF data variable, in order to save space, since a multidimensional representation with common coordinate variables is typically very wasteful in such cases. This is a kind of compression. The data model would regard each instance of the feature type as an independent space construct. However, the "feature type" attribute itself is also a metadata property that would be a property of the space construct and part of the data model.

The attributes bounds, cell_measures, cell_methods, climatology, Conventions, coordinates, formula_terms and grid_mapping have various special or structural functions in the CF-netCDF file format. Their functions and the relationships they indicate are reflected in this data model.

Interdependence of space constructs

This data model makes a central assumption that each space construct is independent. It assumes that software will be able to alter any space construct in memory without affecting other space constructs. This assumption of independence is violated in CF by the formula_terms and ancillary_variables attributes, which explicitly make links between data variables. This means the data model needs to support such links, but it is proposed that they should be fragile. If an operation alters one space construct in a way which could invalidate a relationship with another space construct, the link should be broken. The user of software will have to be aware of these relationships and remake them if applicable and useful.

10th January 2011

Jonathan Gregory