User Interface for Binary Files
===============================

In terms of the scipy environment, qnd addresses the storage and
retrieval of numpy ndarrays, excepting arrays with the general python
object data type (dtype.kind 'O').  In scipy programs, these arrays
may be attributes of objects surrounded by methods and other non-data,
but we assume that the programmer provides some means of initializing
all such objects given just the (numerical) ndarrays at their heart.

The python language provides two kinds of collections which qnd
directly supports: the dict (a collection of named objects) and the
list (a sequence of heterogeneous objects).  We presume that a
programmer will provide a way to map a nested tree (no loops) of dicts
and lists with ndarray leaves (and dict keys which are strings) to and
from whatever objects their program requires.  The dict and list are
precisely the collections provided by the simple and popular JSON data
interchange format.  Thus, in order to use the qnd storage interface,
we are essentially asking the programmer support a portable
organization of the program data.

Note the contrast to the goal of the python pickle module; we
acknowledge that some extra design and maintenance work may be
required to support such a mapping.  Often the additional effort pays
off in a simplified overall design.

Basic Usage
-----------

The first step is to obtain a file handle, say `f`, by opening a file.
The open function belongs to the particular backend, called ``openXXX``,
and defined in a backend module ``XXXf``.  For example::

  from qnd.h5f import openh5
  f = openh5('filename.h5', 'r+')

The qnd mode choices are the same for all backends (copied from the
excellent h5py module mode sematics):

  * 'r' opens the file read-only
  * 'w' creates a new file, clobbering any existing file
  * 'a' opens the file read-write, creating it if it did not exist
  * 'r+' opens the file read-write, raising an error if it did not exist
  * 'w-' creates a new file, raising an error if it exists beforehand

These mode flags are not semantically identical to the python ``open``
function: The 'w-' is not recognized by ``open`` at all, and the
``open`` 'a' guarantees that any existing file bytes will not be
modified, while the qnd 'a' merely means read-write (but has the same
semantics as 'a' in terms of file existence and creation).
Furthermore, qnd files are always readable, even if opened in one of
the 'w' modes.

Python syntax has two operators for extracting named members from a
compound object: The dot operator extracts an `attribute` from an
object, and square brackets extract an `item` from a dict object (or
other mapping).  Qnd file handles support both.  The dot syntax
``f.var`` is best when you know the name 'var' at the time you write
the expression, while the square bracket syntax ``f[expression]`` is
best when the name is the result of an expression or value of a
variable.

The dot operator is overloaded, since it is also used for method
attributes like ``f.close()``.  Thus the qnd file handle `f` really
behaves like a dict for the most part, with its support for the dot
syntax mere sugar to improve code legibility and, at least as
importantly, ease of typing in interactive usage.  If there were a
variable named 'close' in `f` (who knows where `f` came from), you
could always access it as ``f['close']``.  However, qnd provides a
quick and dirty option for using the dot operator even in these cases:
it will remove a single trailing underscore, so that ``f.close_``
refers to the variable ``'close'``, not ``'close_'``.  (``f.close__``
would refer to ``'close_'``.)  This idiom is suggested by the PEP8
python style guide, and you would also need it to escape python
keywords, like ``f.yield_`` to refer to ``f['yield']``.

The bottom line is, you use a qnd file handle `f` as if it were a
python dict, but you are also free to treat items in `f` as if they
were attributes of this dict object::

  x = f.x           # read variable "x" from f, same as f['x']
  f.x = expression  # declare and write "x" to f
  f.update(x=expr1, y=expr2, ...)  # declare and write several variables
  # update also accepts non-keyword dicts and lists of (name, value)
  x = f.get('x', xdefault)  # same as get from dict
  varnames = list(f)        # preferred over f.keys(), as for any dict
  nvars = len(f)
  if name in f: do_something
  for name in f: do_something
  for name, value in f.items(): do_something

In addition to the dict-like `update`, `get`, `keys`, and `items` methods,
qnd files also have a number of non-dict methods and behaviors::

  f.close()
  f.flush()  # like close then reopen
  with openh5('myfile.h5', 'a') as f:
      write_something(f)  # closing f upon exit from with suite
  f.auto(0)        # turn off (or on) auto-read mode
  f.recording(1)   # turn on (or off) recording mode
  f.goto(time=t)   # set to previously recorded record
  with f.push():
      do_something(f)  # temporarily change auto, recording, goto state

The `recording` and `goto` modes are the subject of the next section;
we conclude this section by discussing `auto` mode.  You may have
noticed that ``f.x`` or ``f['x']`` immediately read the variable from
the file, giving you no opportunity to query its data type or shape,
which you might well want to do without incurring the overhead of the
actual read, especially if you know it is a very large array.  We can
get the names of all stored variables with ``list(f)``, but how do we
find out what each one looks like without reading it?

The answer is that a qnd file `f` can be placed into a mode in which
variable references do not trigger an automatic read operation, by
invoking ``f.auto(0)``.  You can also request this mode using the
``auto=0`` keyword when you open the file.  (The default is
``auto=1``.)  With autoread mode off, getting an item returns a qnd
leaf object, which is like a mini-file handle you can use to query,
read, or write only that specific variable.  It has properties similar
to an ndarray::

  f.auto(0)
  xhandle = f.x  # or f['x']
  dtype, shape = xhandle.dtype, xhandle.shape  # also size and ndim
  xhandle = f(0, 'x')  # return handle to x independent of auto mode
  x = xhandle[:]  # read x if x is not scalar
  x = xhandle[()]  # read x no matter what
  xhandle[()] = expression  # write x no matter what
  x = xhandle()  # shorthand for xhandle[()]
  xhandle(expression)  # shorthand for xhandle[()] = expression
  xpart = xhandle[index_expressions]  # read part of x
  xhandle[index_expressions] = xpart  # write part of x

Notice that `xhandle` inherits the obscure indexing behavior of
ndarray scalars, for which ``x[:]`` raises an error.  However,
`xhandle` provides a non-ndarray operation to compensate -- calling a
qnd handle as a function always reads the whole thing, whether or not
it has any dimensions.

Although the qnd leaf handles can be used for partial read and write
operations, if that is all you want to do, you can simply combine the
partial index expressions into a single square bracket::

  xpart = f['x', index_expressions]
  f['x', index_expressions] = xpart

These work no matter how the autoread mode is set, but there is no
equivalent using the dot syntax: Although ``f.x[index_expressions]``
produces the same final result, it reads all of `x` before applying
`index_expressions` to the resulting large ndarray.

(Note that qnd only reads or writes the largest contiguous block of
leading indices specified by `index_expressions`; it only reduces the
intermediate memory footprint when the leading indices are scalar or
small slices of `x`.)

Finally, sometimes you need to declare a variable without writing it.
To do this in qnd, make its value a dtype or a (dtype, shape) tuple::

  f.x = float  # declare x to be a scalar dtype(float), that is f8
  f.y = yy.dtype, yy.shape  # declare y with type and shape of yy
  f.z = bool, yy.shape  # declare z to be boolean with same shape as yy

Such a declaration reserves space for the array in the file, but it is
your responsibility to fill it with sensible values with one later
write or several partial writes.

Recording History
-----------------

Setting an item with ``f.x = value`` or ``f['x'] = value`` both
declares the variable and writes its value.  If you later write it a
second time with ``f.x = value2``, by default this overwrites the
orginal value you wrote.  Sometimes, however, you need to record the
history of a variable which is changing as a simulation progresses.
The idea behind recording mode is to make the second assignment store
the new `value2` in addition to the original `value`, so by repeatedly
assigning values to `x` you can store as many versions of its changing
values as you like.

The HDF5, netCDF, and PDB file formats all support this capability by
allowing the leading dimension of a variable to be "unlimited".  But
in qnd, you can suppress this fictitious leading dimension by using
the `recording` mode to write such variables, and the `goto` mode to
read them::

  f = openh5('myfile.h5', 'w')
  f.x = xa  # x is not a record variable.
  f.recording(1)  # Put f in recoding mode; new variables are recorded.
  f.time = t0  # Time is a record variable with t0 for its first record.
  f.y = y0  # y is a record variable with y0 for its first value.
  f.x = xb  # x remains a non-record variable, xb overwrites xa
  f.time = t1  # Write a second record of time with value t1.
  f.y = y1  # Write a second record of y with value y1.
  f.close()

  f = openh5('myfile.h5', 'r')
  # Initially, goto mode is off (None), and reading a record variable...
  times = f.time[:]  # ...returns a list (not array) of all of its records.
  # Use goto to set a "current record" index for all record variables:
  f.goto(0)  # first record
  t0 = f.time
  y0 = f.y
  xb = f.x  # non-record variables ignore current record
  with f.push():  # current record restored on exit from with suite
      f.goto(-1)  # go to last record, record<0 acts like any other index
      yN = f.y
  # You may use any scalar record variable as a keyword to jump to the
  # record nearest the specified value of that variable (assuming it is
  # monotonic):
  f.goto(time=1.2)  # set to record where f.time nearest 1.2
  y12 = f.y
  for record in f.gotoit():  # iterate over all records
      # gotoit() causes implicit f.goto(record) before each pass
      do_something(f)
  f.goto(None)  # Turn off goto mode.
  ylist = f.y  # list of y arrays at every record

The qnd interface, unlike the existing backend file formats, also supports
the case of record variables whose shape changes from one record to the
next.  To use this feature, set the recording mode to 2 instead of to 1::

  f.recording(2)
  f.x = zeros((nx, ny))  # First x record has shape (nx, ny).
  f.x = zeros((nx+5, ny-2))  # Second x record has shape (nx+5, ny-2).
  f.goto(None)
  xlist = f.x  # list of x arrays at every record

This possibility explains why ``f.recordvar`` returns a list of values at
every record, rather than an array with an extra leading dimension (as in
the fiction employed for the existing file formats).

Groups and lists of variables
-----------------------------

The qnd file handle class is `QGroup`; specifically it is the "root
group" of the file.  But a QGroup may contain subgroups, just as a
python dict may contain other dicts.  To define a subgroup, simply
assign a dict instead of an array-like value to an item::

  f.g = {}  # declare an empty subgroup g
  f.g.update(x=expr1, y=expr2)  # all the methods of f work with g
  g = f.g  # g is a QGroup, a subgroup of f
  y = g.y  # or g['y']
  g.auto(0)  # initially g inherits autoread and other modes from f
  root = g.root()  # returns root QGroup, root is f here
  if f is f.root(): task_if_f_is_root_group()
  f['g/x']  # same as f.g.x
  f['/g/x']  # same as f.root().g.x

Although a subgroup initially inherits its autoread, recording, and goto
modes from its parent, thereafter the modes of `g` are independent of the
modes of `f`.  In a `gotoit` loop, the record number in the iterator will
be necessary to explicitly keep subgroups synchronized::

  g = f.g
  for record in f.gotoit():
      g.goto(record)
      do_something(f, g)

Because of the the fact that a `QGroup` looks like a dict, ``dict(f)``
will read every variable in `f`.  By analogy with the qnd leaf
handles, ``f()`` also reads every item in `f` into a dict, with one
twist: Instead of an ordinary dict, ``f()`` results in a dict subclass
called an `ADict`, which permits access to the dict items as
attributes according to the same rules as for a `QGroup`.  If you want
to convert your own `dict` objects into `Adict` objects, you can use
the `redict` function in the ``qnd.adict`` module.  That module also
contains a generic mix-in class `ItemsAreAttrs` which you can use as a
base class for your own mapping classes.  (Although be sure you read
the comment in the `__getattr__` method before you attempt this, as it
can make your code difficult to debug.)

Note that ``f()`` respects the autoread and goto modes.  Thus if
``auto=0``, you nothing will be read from the file and the returned
dict will contain qnd leaf handles (`QLeaf` objects) rather than
variable values.  When ``auto=1``, the dict item corresponding to any
subgroup will be a `QGroup` object.  If you want to recursively read
all subgroups, set ``auto=2``, which causes subgroups to be read
automatically.  (Note that since ``g = f.g`` produces an `ADict` in
that case rather than a `QGroup`, ``auto=2`` can never be inherited.)

In addition to `QGroup` (a dict with str keys) and `QLeaf` (an
ndarray), the qnd interface provides a third item type, `QList`, which
stores a python heterogeneous list.  A `QList` is a way to store a
sequence of objects anonymously, so that you can reference them simply
by a sequence number instead of by a name.  If you find yourself
inventing sequences of names like 'var00', 'var01', var02', and so on,
to store in a `QGroup`, you want to use a `QList` instead::

  f.var = list  # (the builtin list type) declares empty list var
  var = f.var  # the QList object, assuming f.goto(None)
  var.append(value0)  # QList has list-like append and extend methods
  var.append(value1)
  var.extend([value0, value1, ...])
  value1 = var[1]  # second item of var, negative index, slices work
  var[1] = newvalue1  # overwrite value1
  nitems = len(var)
  var.auto(0)  # QList initially inherits its parent's autoread mode

Although `QList` has an autoread mode like a `QGroup`, it does not have
either a recording mode or a goto mode.  In fact, a record variable is
implemented as a `QList`, so the recording and goto modes in the parent
group will influence how the list presents itself::

  f.goto(1)
  value1 = f.var  # In goto mode, f.var means f.var[current_record].

The ability to store aribtrary str-keyed dict and list trees whose
leaves are ndarrays (or None) gives qnd the ability to support pretty
much arbitrary python objects.  In particular, anything which can be
reduced to JSON format can be stored.

Other attributes
----------------

The HDF5 and netCDF file formats support variable attributes beyond
name, type, and shape.  These attribute metadata are generally not
useful outside a very narrow software suite for which they were
designed, but may provide helpful documentation when first opening a
category of file.  Therefore, qnd supports variable attributes for
backend formats which support them.  In qnd, all attributes belong to
the `QGroup` of the parent.  Thus, `QList` elements may not have
attributes (which is irrelevant since neither HDF5 nor netCDF has
native support for list objects)::

  fattrs = f.attrs()
  attrs = fattrs.x  # or fattrs['x'], attributes of f.x
  attrs = fattrs._  # or fattrs[''], attributes of f itself
  value = attrs.aname  # or attrs['aname'] value of attribute or None
  attrs.aname = value  # declare and set attribute
  attrs.aname = dtype, shape, value  # convert value to dtype and shape
  anames = list(fattrs.x)  # names of attributes of f.x
  if aname in fattrs.x: do_something
  for aname in fattrs.x: do_something
  for aname, avalue in fattrs.x.items(): do_something

Attribute values may not be dict or non-array-like lists.  Also, the
attribute names 'dtype', 'shape', 'size', 'ndim', and 'sshape' will
always return the corresponding properties of the item, even though
they are not stored as variable attributes and are not actually present
in the `attrs` mapping objects.