User Interface for Binary Files =============================== In terms of the scipy environment, qnd addresses the storage and retrieval of numpy ndarrays, excepting arrays with the general python object data type (dtype.kind 'O'). In scipy programs, these arrays may be attributes of objects surrounded by methods and other non-data, but we assume that the programmer provides some means of initializing all such objects given just the (numerical) ndarrays at their heart. The python language provides two kinds of collections which qnd directly supports: the dict (a collection of named objects) and the list (a sequence of heterogeneous objects). We presume that a programmer will provide a way to map a nested tree (no loops) of dicts and lists with ndarray leaves (and dict keys which are strings) to and from whatever objects their program requires. The dict and list are precisely the collections provided by the simple and popular JSON data interchange format. Thus, in order to use the qnd storage interface, we are essentially asking the programmer support a portable organization of the program data. Note the contrast to the goal of the python pickle module; we acknowledge that some extra design and maintenance work may be required to support such a mapping. Often the additional effort pays off in a simplified overall design. Basic Usage ----------- The first step is to obtain a file handle, say `f`, by opening a file. The open function belongs to the particular backend, called ``openXXX``, and defined in a backend module ``XXXf``. For example:: from qnd.h5f import openh5 f = openh5('filename.h5', 'r+') The qnd mode choices are the same for all backends (copied from the excellent h5py module mode sematics): * 'r' opens the file read-only * 'w' creates a new file, clobbering any existing file * 'a' opens the file read-write, creating it if it did not exist * 'r+' opens the file read-write, raising an error if it did not exist * 'w-' creates a new file, raising an error if it exists beforehand These mode flags are not semantically identical to the python ``open`` function: The 'w-' is not recognized by ``open`` at all, and the ``open`` 'a' guarantees that any existing file bytes will not be modified, while the qnd 'a' merely means read-write (but has the same semantics as 'a' in terms of file existence and creation). Furthermore, qnd files are always readable, even if opened in one of the 'w' modes. Python syntax has two operators for extracting named members from a compound object: The dot operator extracts an `attribute` from an object, and square brackets extract an `item` from a dict object (or other mapping). Qnd file handles support both. The dot syntax ``f.var`` is best when you know the name 'var' at the time you write the expression, while the square bracket syntax ``f[expression]`` is best when the name is the result of an expression or value of a variable. The dot operator is overloaded, since it is also used for method attributes like ``f.close()``. Thus the qnd file handle `f` really behaves like a dict for the most part, with its support for the dot syntax mere sugar to improve code legibility and, at least as importantly, ease of typing in interactive usage. If there were a variable named 'close' in `f` (who knows where `f` came from), you could always access it as ``f['close']``. However, qnd provides a quick and dirty option for using the dot operator even in these cases: it will remove a single trailing underscore, so that ``f.close_`` refers to the variable ``'close'``, not ``'close_'``. (``f.close__`` would refer to ``'close_'``.) This idiom is suggested by the PEP8 python style guide, and you would also need it to escape python keywords, like ``f.yield_`` to refer to ``f['yield']``. The bottom line is, you use a qnd file handle `f` as if it were a python dict, but you are also free to treat items in `f` as if they were attributes of this dict object:: x = f.x # read variable "x" from f, same as f['x'] f.x = expression # declare and write "x" to f f.update(x=expr1, y=expr2, ...) # declare and write several variables # update also accepts non-keyword dicts and lists of (name, value) x = f.get('x', xdefault) # same as get from dict varnames = list(f) # preferred over f.keys(), as for any dict nvars = len(f) if name in f: do_something for name in f: do_something for name, value in f.items(): do_something In addition to the dict-like `update`, `get`, `keys`, and `items` methods, qnd files also have a number of non-dict methods and behaviors:: f.close() f.flush() # like close then reopen with openh5('myfile.h5', 'a') as f: write_something(f) # closing f upon exit from with suite f.auto(0) # turn off (or on) auto-read mode f.recording(1) # turn on (or off) recording mode f.goto(time=t) # set to previously recorded record with f.push(): do_something(f) # temporarily change auto, recording, goto state The `recording` and `goto` modes are the subject of the next section; we conclude this section by discussing `auto` mode. You may have noticed that ``f.x`` or ``f['x']`` immediately read the variable from the file, giving you no opportunity to query its data type or shape, which you might well want to do without incurring the overhead of the actual read, especially if you know it is a very large array. We can get the names of all stored variables with ``list(f)``, but how do we find out what each one looks like without reading it? The answer is that a qnd file `f` can be placed into a mode in which variable references do not trigger an automatic read operation, by invoking ``f.auto(0)``. You can also request this mode using the ``auto=0`` keyword when you open the file. (The default is ``auto=1``.) With autoread mode off, getting an item returns a qnd leaf object, which is like a mini-file handle you can use to query, read, or write only that specific variable. It has properties similar to an ndarray:: f.auto(0) xhandle = f.x # or f['x'] dtype, shape = xhandle.dtype, xhandle.shape # also size and ndim xhandle = f(0, 'x') # return handle to x independent of auto mode x = xhandle[:] # read x if x is not scalar x = xhandle[()] # read x no matter what xhandle[()] = expression # write x no matter what x = xhandle() # shorthand for xhandle[()] xhandle(expression) # shorthand for xhandle[()] = expression xpart = xhandle[index_expressions] # read part of x xhandle[index_expressions] = xpart # write part of x Notice that `xhandle` inherits the obscure indexing behavior of ndarray scalars, for which ``x[:]`` raises an error. However, `xhandle` provides a non-ndarray operation to compensate -- calling a qnd handle as a function always reads the whole thing, whether or not it has any dimensions. Although the qnd leaf handles can be used for partial read and write operations, if that is all you want to do, you can simply combine the partial index expressions into a single square bracket:: xpart = f['x', index_expressions] f['x', index_expressions] = xpart These work no matter how the autoread mode is set, but there is no equivalent using the dot syntax: Although ``f.x[index_expressions]`` produces the same final result, it reads all of `x` before applying `index_expressions` to the resulting large ndarray. (Note that qnd only reads or writes the largest contiguous block of leading indices specified by `index_expressions`; it only reduces the intermediate memory footprint when the leading indices are scalar or small slices of `x`.) Finally, sometimes you need to declare a variable without writing it. To do this in qnd, make its value a dtype or a (dtype, shape) tuple:: f.x = float # declare x to be a scalar dtype(float), that is f8 f.y = yy.dtype, yy.shape # declare y with type and shape of yy f.z = bool, yy.shape # declare z to be boolean with same shape as yy Such a declaration reserves space for the array in the file, but it is your responsibility to fill it with sensible values with one later write or several partial writes. Recording History ----------------- Setting an item with ``f.x = value`` or ``f['x'] = value`` both declares the variable and writes its value. If you later write it a second time with ``f.x = value2``, by default this overwrites the orginal value you wrote. Sometimes, however, you need to record the history of a variable which is changing as a simulation progresses. The idea behind recording mode is to make the second assignment store the new `value2` in addition to the original `value`, so by repeatedly assigning values to `x` you can store as many versions of its changing values as you like. The HDF5, netCDF, and PDB file formats all support this capability by allowing the leading dimension of a variable to be "unlimited". But in qnd, you can suppress this fictitious leading dimension by using the `recording` mode to write such variables, and the `goto` mode to read them:: f = openh5('myfile.h5', 'w') f.x = xa # x is not a record variable. f.recording(1) # Put f in recoding mode; new variables are recorded. f.time = t0 # Time is a record variable with t0 for its first record. f.y = y0 # y is a record variable with y0 for its first value. f.x = xb # x remains a non-record variable, xb overwrites xa f.time = t1 # Write a second record of time with value t1. f.y = y1 # Write a second record of y with value y1. f.close() f = openh5('myfile.h5', 'r') # Initially, goto mode is off (None), and reading a record variable... times = f.time[:] # ...returns a list (not array) of all of its records. # Use goto to set a "current record" index for all record variables: f.goto(0) # first record t0 = f.time y0 = f.y xb = f.x # non-record variables ignore current record with f.push(): # current record restored on exit from with suite f.goto(-1) # go to last record, record<0 acts like any other index yN = f.y # You may use any scalar record variable as a keyword to jump to the # record nearest the specified value of that variable (assuming it is # monotonic): f.goto(time=1.2) # set to record where f.time nearest 1.2 y12 = f.y for record in f.gotoit(): # iterate over all records # gotoit() causes implicit f.goto(record) before each pass do_something(f) f.goto(None) # Turn off goto mode. ylist = f.y # list of y arrays at every record The qnd interface, unlike the existing backend file formats, also supports the case of record variables whose shape changes from one record to the next. To use this feature, set the recording mode to 2 instead of to 1:: f.recording(2) f.x = zeros((nx, ny)) # First x record has shape (nx, ny). f.x = zeros((nx+5, ny-2)) # Second x record has shape (nx+5, ny-2). f.goto(None) xlist = f.x # list of x arrays at every record This possibility explains why ``f.recordvar`` returns a list of values at every record, rather than an array with an extra leading dimension (as in the fiction employed for the existing file formats). Groups and lists of variables ----------------------------- The qnd file handle class is `QGroup`; specifically it is the "root group" of the file. But a QGroup may contain subgroups, just as a python dict may contain other dicts. To define a subgroup, simply assign a dict instead of an array-like value to an item:: f.g = {} # declare an empty subgroup g f.g.update(x=expr1, y=expr2) # all the methods of f work with g g = f.g # g is a QGroup, a subgroup of f y = g.y # or g['y'] g.auto(0) # initially g inherits autoread and other modes from f root = g.root() # returns root QGroup, root is f here if f is f.root(): task_if_f_is_root_group() f['g/x'] # same as f.g.x f['/g/x'] # same as f.root().g.x Although a subgroup initially inherits its autoread, recording, and goto modes from its parent, thereafter the modes of `g` are independent of the modes of `f`. In a `gotoit` loop, the record number in the iterator will be necessary to explicitly keep subgroups synchronized:: g = f.g for record in f.gotoit(): g.goto(record) do_something(f, g) Because of the the fact that a `QGroup` looks like a dict, ``dict(f)`` will read every variable in `f`. By analogy with the qnd leaf handles, ``f()`` also reads every item in `f` into a dict, with one twist: Instead of an ordinary dict, ``f()`` results in a dict subclass called an `ADict`, which permits access to the dict items as attributes according to the same rules as for a `QGroup`. If you want to convert your own `dict` objects into `Adict` objects, you can use the `redict` function in the ``qnd.adict`` module. That module also contains a generic mix-in class `ItemsAreAttrs` which you can use as a base class for your own mapping classes. (Although be sure you read the comment in the `__getattr__` method before you attempt this, as it can make your code difficult to debug.) Note that ``f()`` respects the autoread and goto modes. Thus if ``auto=0``, you nothing will be read from the file and the returned dict will contain qnd leaf handles (`QLeaf` objects) rather than variable values. When ``auto=1``, the dict item corresponding to any subgroup will be a `QGroup` object. If you want to recursively read all subgroups, set ``auto=2``, which causes subgroups to be read automatically. (Note that since ``g = f.g`` produces an `ADict` in that case rather than a `QGroup`, ``auto=2`` can never be inherited.) In addition to `QGroup` (a dict with str keys) and `QLeaf` (an ndarray), the qnd interface provides a third item type, `QList`, which stores a python heterogeneous list. A `QList` is a way to store a sequence of objects anonymously, so that you can reference them simply by a sequence number instead of by a name. If you find yourself inventing sequences of names like 'var00', 'var01', var02', and so on, to store in a `QGroup`, you want to use a `QList` instead:: f.var = list # (the builtin list type) declares empty list var var = f.var # the QList object, assuming f.goto(None) var.append(value0) # QList has list-like append and extend methods var.append(value1) var.extend([value0, value1, ...]) value1 = var[1] # second item of var, negative index, slices work var[1] = newvalue1 # overwrite value1 nitems = len(var) var.auto(0) # QList initially inherits its parent's autoread mode Although `QList` has an autoread mode like a `QGroup`, it does not have either a recording mode or a goto mode. In fact, a record variable is implemented as a `QList`, so the recording and goto modes in the parent group will influence how the list presents itself:: f.goto(1) value1 = f.var # In goto mode, f.var means f.var[current_record]. The ability to store aribtrary str-keyed dict and list trees whose leaves are ndarrays (or None) gives qnd the ability to support pretty much arbitrary python objects. In particular, anything which can be reduced to JSON format can be stored. Other attributes ---------------- The HDF5 and netCDF file formats support variable attributes beyond name, type, and shape. These attribute metadata are generally not useful outside a very narrow software suite for which they were designed, but may provide helpful documentation when first opening a category of file. Therefore, qnd supports variable attributes for backend formats which support them. In qnd, all attributes belong to the `QGroup` of the parent. Thus, `QList` elements may not have attributes (which is irrelevant since neither HDF5 nor netCDF has native support for list objects):: fattrs = f.attrs() attrs = fattrs.x # or fattrs['x'], attributes of f.x attrs = fattrs._ # or fattrs[''], attributes of f itself value = attrs.aname # or attrs['aname'] value of attribute or None attrs.aname = value # declare and set attribute attrs.aname = dtype, shape, value # convert value to dtype and shape anames = list(fattrs.x) # names of attributes of f.x if aname in fattrs.x: do_something for aname in fattrs.x: do_something for aname, avalue in fattrs.x.items(): do_something Attribute values may not be dict or non-array-like lists. Also, the attribute names 'dtype', 'shape', 'size', 'ndim', and 'sshape' will always return the corresponding properties of the item, even though they are not stored as variable attributes and are not actually present in the `attrs` mapping objects.