User Interface for Binary Files

In terms of the scipy environment, qnd addresses the storage and retrieval of numpy ndarrays, excepting arrays with the general python object data type (dtype.kind ‘O’). In scipy programs, these arrays may be attributes of objects surrounded by methods and other non-data, but we assume that the programmer provides some means of initializing all such objects given just the (numerical) ndarrays at their heart.

The python language provides two kinds of collections which qnd directly supports: the dict (a collection of named objects) and the list (a sequence of heterogeneous objects). We presume that a programmer will provide a way to map a nested tree (no loops) of dicts and lists with ndarray leaves (and dict keys which are strings) to and from whatever objects their program requires. The dict and list are precisely the collections provided by the simple and popular JSON data interchange format. Thus, in order to use the qnd storage interface, we are essentially asking the programmer support a portable organization of the program data.

Note the contrast to the goal of the python pickle module; we acknowledge that some extra design and maintenance work may be required to support such a mapping. Often the additional effort pays off in a simplified overall design.

Basic Usage

The first step is to obtain a file handle, say f, by opening a file. The open function belongs to the particular backend, called openXXX, and defined in a backend module XXXf. For example:

from qnd.h5f import openh5
f = openh5('filename.h5', 'r+')

The qnd mode choices are the same for all backends (copied from the excellent h5py module mode sematics):

  • ‘r’ opens the file read-only
  • ‘w’ creates a new file, clobbering any existing file
  • ‘a’ opens the file read-write, creating it if it did not exist
  • ‘r+’ opens the file read-write, raising an error if it did not exist
  • ‘w-’ creates a new file, raising an error if it exists beforehand

These mode flags are not semantically identical to the python open function: The ‘w-’ is not recognized by open at all, and the open ‘a’ guarantees that any existing file bytes will not be modified, while the qnd ‘a’ merely means read-write (but has the same semantics as ‘a’ in terms of file existence and creation). Furthermore, qnd files are always readable, even if opened in one of the ‘w’ modes.

Python syntax has two operators for extracting named members from a compound object: The dot operator extracts an attribute from an object, and square brackets extract an item from a dict object (or other mapping). Qnd file handles support both. The dot syntax f.var is best when you know the name ‘var’ at the time you write the expression, while the square bracket syntax f[expression] is best when the name is the result of an expression or value of a variable.

The dot operator is overloaded, since it is also used for method attributes like f.close(). Thus the qnd file handle f really behaves like a dict for the most part, with its support for the dot syntax mere sugar to improve code legibility and, at least as importantly, ease of typing in interactive usage. If there were a variable named ‘close’ in f (who knows where f came from), you could always access it as f['close']. However, qnd provides a quick and dirty option for using the dot operator even in these cases: it will remove a single trailing underscore, so that f.close_ refers to the variable 'close', not 'close_'. (f.close__ would refer to 'close_'.) This idiom is suggested by the PEP8 python style guide, and you would also need it to escape python keywords, like f.yield_ to refer to f['yield'].

The bottom line is, you use a qnd file handle f as if it were a python dict, but you are also free to treat items in f as if they were attributes of this dict object:

x = f.x           # read variable "x" from f, same as f['x']
f.x = expression  # declare and write "x" to f
f.update(x=expr1, y=expr2, ...)  # declare and write several variables
# update also accepts non-keyword dicts and lists of (name, value)
x = f.get('x', xdefault)  # same as get from dict
varnames = list(f)        # preferred over f.keys(), as for any dict
nvars = len(f)
if name in f: do_something
for name in f: do_something
for name, value in f.items(): do_something

In addition to the dict-like update, get, keys, and items methods, qnd files also have a number of non-dict methods and behaviors:

f.close()
f.flush()  # like close then reopen
with openh5('myfile.h5', 'a') as f:
    write_something(f)  # closing f upon exit from with suite
f.auto(0)        # turn off (or on) auto-read mode
f.recording(1)   # turn on (or off) recording mode
f.goto(time=t)   # set to previously recorded record
with f.push():
    do_something(f)  # temporarily change auto, recording, goto state

The recording and goto modes are the subject of the next section; we conclude this section by discussing auto mode. You may have noticed that f.x or f['x'] immediately read the variable from the file, giving you no opportunity to query its data type or shape, which you might well want to do without incurring the overhead of the actual read, especially if you know it is a very large array. We can get the names of all stored variables with list(f), but how do we find out what each one looks like without reading it?

The answer is that a qnd file f can be placed into a mode in which variable references do not trigger an automatic read operation, by invoking f.auto(0). You can also request this mode using the auto=0 keyword when you open the file. (The default is auto=1.) With autoread mode off, getting an item returns a qnd leaf object, which is like a mini-file handle you can use to query, read, or write only that specific variable. It has properties similar to an ndarray:

f.auto(0)
xhandle = f.x  # or f['x']
dtype, shape = xhandle.dtype, xhandle.shape  # also size and ndim
xhandle = f(0, 'x')  # return handle to x independent of auto mode
x = xhandle[:]  # read x if x is not scalar
x = xhandle[()]  # read x no matter what
xhandle[()] = expression  # write x no matter what
x = xhandle()  # shorthand for xhandle[()]
xhandle(expression)  # shorthand for xhandle[()] = expression
xpart = xhandle[index_expressions]  # read part of x
xhandle[index_expressions] = xpart  # write part of x

Notice that xhandle inherits the obscure indexing behavior of ndarray scalars, for which x[:] raises an error. However, xhandle provides a non-ndarray operation to compensate – calling a qnd handle as a function always reads the whole thing, whether or not it has any dimensions.

Although the qnd leaf handles can be used for partial read and write operations, if that is all you want to do, you can simply combine the partial index expressions into a single square bracket:

xpart = f['x', index_expressions]
f['x', index_expressions] = xpart

These work no matter how the autoread mode is set, but there is no equivalent using the dot syntax: Although f.x[index_expressions] produces the same final result, it reads all of x before applying index_expressions to the resulting large ndarray.

(Note that qnd only reads or writes the largest contiguous block of leading indices specified by index_expressions; it only reduces the intermediate memory footprint when the leading indices are scalar or small slices of x.)

Finally, sometimes you need to declare a variable without writing it. To do this in qnd, make its value a dtype or a (dtype, shape) tuple:

f.x = float  # declare x to be a scalar dtype(float), that is f8
f.y = yy.dtype, yy.shape  # declare y with type and shape of yy
f.z = bool, yy.shape  # declare z to be boolean with same shape as yy

Such a declaration reserves space for the array in the file, but it is your responsibility to fill it with sensible values with one later write or several partial writes.

Recording History

Setting an item with f.x = value or f['x'] = value both declares the variable and writes its value. If you later write it a second time with f.x = value2, by default this overwrites the orginal value you wrote. Sometimes, however, you need to record the history of a variable which is changing as a simulation progresses. The idea behind recording mode is to make the second assignment store the new value2 in addition to the original value, so by repeatedly assigning values to x you can store as many versions of its changing values as you like.

The HDF5, netCDF, and PDB file formats all support this capability by allowing the leading dimension of a variable to be “unlimited”. But in qnd, you can suppress this fictitious leading dimension by using the recording mode to write such variables, and the goto mode to read them:

f = openh5('myfile.h5', 'w')
f.x = xa  # x is not a record variable.
f.recording(1)  # Put f in recoding mode; new variables are recorded.
f.time = t0  # Time is a record variable with t0 for its first record.
f.y = y0  # y is a record variable with y0 for its first value.
f.x = xb  # x remains a non-record variable, xb overwrites xa
f.time = t1  # Write a second record of time with value t1.
f.y = y1  # Write a second record of y with value y1.
f.close()

f = openh5('myfile.h5', 'r')
# Initially, goto mode is off (None), and reading a record variable...
times = f.time[:]  # ...returns a list (not array) of all of its records.
# Use goto to set a "current record" index for all record variables:
f.goto(0)  # first record
t0 = f.time
y0 = f.y
xb = f.x  # non-record variables ignore current record
with f.push():  # current record restored on exit from with suite
    f.goto(-1)  # go to last record, record<0 acts like any other index
    yN = f.y
# You may use any scalar record variable as a keyword to jump to the
# record nearest the specified value of that variable (assuming it is
# monotonic):
f.goto(time=1.2)  # set to record where f.time nearest 1.2
y12 = f.y
for record in f.gotoit():  # iterate over all records
    # gotoit() causes implicit f.goto(record) before each pass
    do_something(f)
f.goto(None)  # Turn off goto mode.
ylist = f.y  # list of y arrays at every record

The qnd interface, unlike the existing backend file formats, also supports the case of record variables whose shape changes from one record to the next. To use this feature, set the recording mode to 2 instead of to 1:

f.recording(2)
f.x = zeros((nx, ny))  # First x record has shape (nx, ny).
f.x = zeros((nx+5, ny-2))  # Second x record has shape (nx+5, ny-2).
f.goto(None)
xlist = f.x  # list of x arrays at every record

This possibility explains why f.recordvar returns a list of values at every record, rather than an array with an extra leading dimension (as in the fiction employed for the existing file formats).

Groups and lists of variables

The qnd file handle class is QGroup; specifically it is the “root group” of the file. But a QGroup may contain subgroups, just as a python dict may contain other dicts. To define a subgroup, simply assign a dict instead of an array-like value to an item:

f.g = {}  # declare an empty subgroup g
f.g.update(x=expr1, y=expr2)  # all the methods of f work with g
g = f.g  # g is a QGroup, a subgroup of f
y = g.y  # or g['y']
g.auto(0)  # initially g inherits autoread and other modes from f
root = g.root()  # returns root QGroup, root is f here
if f is f.root(): task_if_f_is_root_group()
f['g/x']  # same as f.g.x
f['/g/x']  # same as f.root().g.x

Although a subgroup initially inherits its autoread, recording, and goto modes from its parent, thereafter the modes of g are independent of the modes of f. In a gotoit loop, the record number in the iterator will be necessary to explicitly keep subgroups synchronized:

g = f.g
for record in f.gotoit():
    g.goto(record)
    do_something(f, g)

Because of the the fact that a QGroup looks like a dict, dict(f) will read every variable in f. By analogy with the qnd leaf handles, f() also reads every item in f into a dict, with one twist: Instead of an ordinary dict, f() results in a dict subclass called an ADict, which permits access to the dict items as attributes according to the same rules as for a QGroup. If you want to convert your own dict objects into Adict objects, you can use the redict function in the qnd.adict module. That module also contains a generic mix-in class ItemsAreAttrs which you can use as a base class for your own mapping classes. (Although be sure you read the comment in the __getattr__ method before you attempt this, as it can make your code difficult to debug.)

Note that f() respects the autoread and goto modes. Thus if auto=0, you nothing will be read from the file and the returned dict will contain qnd leaf handles (QLeaf objects) rather than variable values. When auto=1, the dict item corresponding to any subgroup will be a QGroup object. If you want to recursively read all subgroups, set auto=2, which causes subgroups to be read automatically. (Note that since g = f.g produces an ADict in that case rather than a QGroup, auto=2 can never be inherited.)

In addition to QGroup (a dict with str keys) and QLeaf (an ndarray), the qnd interface provides a third item type, QList, which stores a python heterogeneous list. A QList is a way to store a sequence of objects anonymously, so that you can reference them simply by a sequence number instead of by a name. If you find yourself inventing sequences of names like ‘var00’, ‘var01’, var02’, and so on, to store in a QGroup, you want to use a QList instead:

f.var = list  # (the builtin list type) declares empty list var
var = f.var  # the QList object, assuming f.goto(None)
var.append(value0)  # QList has list-like append and extend methods
var.append(value1)
var.extend([value0, value1, ...])
value1 = var[1]  # second item of var, negative index, slices work
var[1] = newvalue1  # overwrite value1
nitems = len(var)
var.auto(0)  # QList initially inherits its parent's autoread mode

Although QList has an autoread mode like a QGroup, it does not have either a recording mode or a goto mode. In fact, a record variable is implemented as a QList, so the recording and goto modes in the parent group will influence how the list presents itself:

f.goto(1)
value1 = f.var  # In goto mode, f.var means f.var[current_record].

The ability to store aribtrary str-keyed dict and list trees whose leaves are ndarrays (or None) gives qnd the ability to support pretty much arbitrary python objects. In particular, anything which can be reduced to JSON format can be stored.

Other attributes

The HDF5 and netCDF file formats support variable attributes beyond name, type, and shape. These attribute metadata are generally not useful outside a very narrow software suite for which they were designed, but may provide helpful documentation when first opening a category of file. Therefore, qnd supports variable attributes for backend formats which support them. In qnd, all attributes belong to the QGroup of the parent. Thus, QList elements may not have attributes (which is irrelevant since neither HDF5 nor netCDF has native support for list objects):

fattrs = f.attrs()
attrs = fattrs.x  # or fattrs['x'], attributes of f.x
attrs = fattrs._  # or fattrs[''], attributes of f itself
value = attrs.aname  # or attrs['aname'] value of attribute or None
attrs.aname = value  # declare and set attribute
attrs.aname = dtype, shape, value  # convert value to dtype and shape
anames = list(fattrs.x)  # names of attributes of f.x
if aname in fattrs.x: do_something
for aname in fattrs.x: do_something
for aname, avalue in fattrs.x.items(): do_something

Attribute values may not be dict or non-array-like lists. Also, the attribute names ‘dtype’, ‘shape’, ‘size’, ‘ndim’, and ‘sshape’ will always return the corresponding properties of the item, even though they are not stored as variable attributes and are not actually present in the attrs mapping objects.