Defining your own data objects

Background

By default, data objects in glue are instances of the Data class, and this class assumes that the data are stored in one or more local n-dimensional Numpy arrays. However, glue now includes a way of defining a wider variety of data objects, which may rely for example on large remote datasets, or datasets that are not inherently stored as regular n-dimensional arrays.

The base class for all datasets is BaseData, which is intended to represent completely arbitrary datasets. However, glue does not yet recognize classes that inherit directly from BaseData. Instead, for now, the base class that can be used to define custom data objects is BaseCartesianData, which inherits from BaseData and requires data objects to present an interface that looks like n-dimensional arrays (although the storage of the data could still be unstructured). In future, we will also make it possible to support a more generic interface for data access based on the BaseData class.

Main data interface

Before we dive in, we recommend that you take a look at the Working with Data objects tutorial to understand how the default Data objects work. The most important takeaway from this which is relevant here is that glue data objects are collections of attributes (components in glue-speak) that are assumed to be aligned (for regular cartesian datasets, this means they are on the same grid). For example a table consists of a collection of 1-d attributes that are all the same length. A traditional image consists of a single 2-d attribute. Attributes/components are identified in data objects by ComponentID objects (though these objects don’t contain the actual values – they are just a reference to that attribute).

To define your own data object, you should write a class that inherits from BaseCartesianData. You will then need to define a few properties and methods for the data object to be usable in glue. The properties you need to define are:

  • label: the name of the dataset, as a string

  • shape: the shape of a single component, given as a tuple

  • main_components: a list of all ComponentID that your data object recognizes, excluding coordinate components (more on this later).

The methods you need to define are:

  • get_kind(): given a ComponentID, return a string that should be either 'numerical' (for e.g. floating-point and integer attributes), 'categorical' (for e.g. string attributes), or 'datetime' (for attributes that use the np.datetime64 type).

  • get_data(): given a ComponentID and optionally a view, return a Numpy array. A view can be anything that can be used to slice a Numpy array. For example a single integer (view=4), a tuple of slices ( view=[slice(1, 14, 2), slice(4, 50, 3)]), a list of tuples of indices (view=[(1, 2, 3), (4, 3, 4)]), and so on. If a view is specified, only that subset of values should be returned. For example if the data has an overall shape of (10,) and view=slice(1, 6, 2), get_data should return an array with shape (3,).

  • get_mask(): given a SubsetState object (described in Subset states) and optionally a view, return a boolean array describing which values are in the specified subset (where True indicates values inside the subset).

  • compute_statistic(): given a statistic name (e.g. 'mean') and a ComponentID, as well as optional keyword arguments (see compute_statistic()), return a statistic for the required component. In particular one of the keyword arguments is subset_state, which can be used to indicate that the statistic should only be computed for a subset of values.

  • compute_histogram(): given a list of ComponentID objects, as well as optional keyword arguments (see compute_histogram()), compute an n-dimensional histogram of the required attributes. At the moment glue only makes use of this for one or two dimensions, though we may also use it for three dimensions in future.

  • compute_fixed_resolution_buffer(): given a list of bounds of the form (min, max, n) along each dimension of the data, return a fixed resolution buffer that goes from the respective min to max in n steps. Bounds can also be single scalars in cases where the fixed resolution buffer is a lower-dimensional slice. This method can optionally take a target dataset in the case where the fixed resolution buffer should be computed in the frame of reference of a different dataset, in which case the bounds should be interpreted in the frame of reference of the target dataset (but this is only needed if data linking is used). See compute_fixed_resolution_buffer() for a full list of arguments.

Subset states

In the above section, we mentioned the concept of a SubsetState. In glue parlance, a subset state is an abstract representation of a subset in the data – for instance the subset of data where a > 3 or the subset of data inside a polygon with vertices vx and vy. Several of the methods in Main data interface can take subset states, and so in your data object, you should decide how to most efficiently implement each kind of subset state.

You can find a full list of subset states defined in glue here, and in particular you can look at the documentation of each one to understand how to access the relevant information. For example, if you have an InequalitySubsetState, you can access the relevant information as follows:

>>> subset_state
<InequalitySubsetState: (x > 1.2)>
>>> subset_state.left
x
>>> subset_state.right
1.2
>>> subset_state.operator
<built-in function gt>

SubsetState objects have a to_mask() method that can take a data object and a view and return a mask:

>>> subset_state.to_mask(d)
array([False,  True,  True])

In this case, the subset state essentially accesses the data using get_data(), so this may be very inefficient for large datasets. Therefore, you may choose to re-interpret the subset states and compute a mask yourself.

While developing your data class, one way to make sure that glue doesn’t crash if you haven’t yet implemented support for a specific subset state is to interpret any unimplemented subset state as simply indicating an empty subset.

Using your data object

Assuming you have written your own data class, there are several ways that you can start using it in glue:

  • You can define your own data loader which is a function that takes a filename and should return an instance of a subclass of BaseCartesianData.

  • You can define your own data importer, which is a function that can do anything you need, for example showing a dialog, and should return a list of instances of BaseCartesianData. This is more general than a data loader since a data importer doesn’t need to rely on a filename. It might include for example opening a dialog in which you can log in to a remote service and browse available datasets.

  • You can start up glue programmatically, including constructing your data object. This is particularly useful when initially developing your custom data object:

    # Construct your data object
    d = MyCustomData(...)
    
    # Create glue application and start
    dc = DataCollection([d])
    ga = GlueApplication(dc)
    ga.start()
    

Example

As an example of a minimal custom data class, the following implements a (very uninteresting) dataset that simply generates values randomly in the range [0:1] on-the-fly, and does not take subset states into account. A glue session is then created with one of these data objects:

import numpy as np

from glue.core.component_id import ComponentID
from glue.core.data import BaseCartesianData
from glue.utils import view_shape


class RandomData(BaseCartesianData):

    def __init__(self):
        super(RandomData, self).__init__()
        self.data_cid = ComponentID(label='data', parent=self)

    @property
    def label(self):
        return "Random Data"

    @property
    def shape(self):
        return (512, 512, 512)

    @property
    def main_components(self):
        return [self.data_cid]

    def get_kind(self, cid):
        return 'numerical'

    def get_data(self, cid, view=None):
        if cid in self.pixel_component_ids:
            return super(RandomData, self).get_data(cid, view=view)
        else:
            return np.random.random(view_shape(self.shape, view))

    def get_mask(self, subset_state, view=None):
        return subset_state.to_mask(self, view=view)

    def compute_statistic(self, statistic, cid,
                          axis=None, finite=True,
                          positive=False, subset_state=None,
                          percentile=None, random_subset=None):
        if axis is None:
            if statistic == 'minimum':
                return 0
            elif statistic == 'maximum':
                if cid in self.pixel_component_ids:
                    return self.shape[cid.axis]
                else:
                    return 1
            elif statistic == 'mean' or statistic == 'median':
                return 0.5
            elif statistic == 'percentile':
                return percentile / 100
            elif statistic == 'sum':
                return self.size / 2
        else:
            final_shape = tuple(self.shape[i] for i in range(self.ndim)
                                if i not in axis)
            return np.random.random(final_shape)

    def compute_histogram(self, cid,
                          range=None, bins=None, log=False,
                          subset_state=None, subset_group=None):
        return np.random.random(bins) * 100


# We now create a data object using the above class,
# and launch a a glue session

from glue.core import DataCollection
from glue.app.qt.application import GlueApplication

d = RandomData()
dc = DataCollection([d])
ga = GlueApplication(dc)
ga.start()