Defining your own data objects¶
By default, data objects in glue are instances of the
Data class, and this class assumes that the data are
stored in one or more local n-dimensional Numpy arrays. However, glue now
includes a way of defining a wider variety of data objects, which may rely for
example on large remote datasets, or datasets that are not inherently stored as
regular n-dimensional arrays.
The base class for all datasets is
BaseData, which is
intended to represent completely arbitrary datasets. However, glue does not yet
recognize classes that inherit directly from
Instead, for now, the base class that can be used to define custom data objects
BaseCartesianData, which inherits from
BaseData and requires data objects to present an
interface that looks like n-dimensional arrays (although the storage of the data
could still be unstructured). In future, we will also make it possible to
support a more generic interface for data access based on the
Main data interface¶
Before we dive in, we recommend that you take a look at the Working with Data objects
tutorial to understand how the default
work. The most important takeaway from this which is relevant here is that glue
data objects are collections of attributes (components in glue-speak) that are
assumed to be aligned (for regular cartesian datasets, this means they are on
the same grid). For example a table consists of a collection of 1-d attributes
that are all the same length. A traditional image consists of a single 2-d
attribute. Attributes/components are identified in data objects by
ComponentID objects (though these objects don’t
contain the actual values – they are just a reference to that attribute).
To define your own data object, you should write a class that inherits from
BaseCartesianData. You will then need to define a few
properties and methods for the data object to be usable in glue. The properties
you need to define are:
label: the name of the dataset, as a string
shape: the shape of a single component, given as a tuple
main_components: a list of all
ComponentIDthat your data object recognizes, excluding coordinate components (more on this later).
The methods you need to define are:
get_kind(): given a
ComponentID, return a string that should be either
'numerical'(for e.g. floating-point and integer attributes),
'categorical'(for e.g. string attributes), or
'datetime'(for attributes that use the
get_data(): given a
ComponentIDand optionally a view, return a Numpy array. A view can be anything that can be used to slice a Numpy array. For example a single integer (
view=4), a tuple of slices (
view=[slice(1, 14, 2), slice(4, 50, 3)]), a list of tuples of indices (
view=[(1, 2, 3), (4, 3, 4)]), and so on. If a view is specified, only that subset of values should be returned. For example if the data has an overall shape of
view=slice(1, 6, 2),
get_datashould return an array with shape
get_mask(): given a
SubsetStateobject (described in Subset states) and optionally a
view, return a boolean array describing which values are in the specified subset (where True indicates values inside the subset).
compute_statistic(): given a statistic name (e.g.
'mean') and a
ComponentID, as well as optional keyword arguments (see
compute_statistic()), return a statistic for the required component. In particular one of the keyword arguments is
subset_state, which can be used to indicate that the statistic should only be computed for a subset of values.
compute_histogram(): given a list of
ComponentIDobjects, as well as optional keyword arguments (see
compute_histogram()), compute an n-dimensional histogram of the required attributes. At the moment glue only makes use of this for one or two dimensions, though we may also use it for three dimensions in future.
In the above section, we mentioned the concept of a
SubsetState. In glue parlance, a subset state is an
abstract representation of a subset in the data – for instance the subset of
data where a > 3 or the subset of data inside a polygon with vertices vx and
vy. Several of the methods in Main data interface can
take subset states, and so in your data object, you should decide how to
most efficiently implement each kind of subset state.
You can find a full list of subset states defined in glue
here, and in particular you can look at the documentation of each
one to understand how to access the relevant information. For example, if you
InequalitySubsetState, you can access the
relevant information as follows:
>>> subset_state <InequalitySubsetState: (x > 1.2)> >>> subset_state.left x >>> subset_state.right 1.2 >>> subset_state.operator <built-in function gt>
>>> subset_state.to_mask(d) array([False, True, True])
In this case, the subset state essentially accesses the data using
get_data(), so this may be very
inefficient for large datasets. Therefore, you may choose to re-interpret the
subset states and compute a mask yourself.
While developing your data class, one way to make sure that glue doesn’t crash if you haven’t yet implemented support for a specific subset state is to interpret any unimplemented subset state as simply indicating an empty subset.
Using your data object¶
Assuming you have written your own data class, there are several ways that you can start using it in glue:
You can define your own data importer, which is a function that can do anything you need, for example showing a dialog, and should return a list of instances of
BaseCartesianData. This is more general than a data loader since a data importer doesn’t need to rely on a filename. It might include for example opening a dialog in which you can log in to a remote service and browse available datasets.
You can start up glue programmatically, including constructing your data object. This is particularly useful when initially developing your custom data object:
# Construct your data object d = MyCustomData(...) # Create glue application and start dc = DataCollection([d]) ga = GlueApplication(dc) ga.start()
As an example of a minimal custom data class, the following implements a (very uninteresting) dataset that simply generates values randomly in the range [0:1] on-the-fly, and does not take subset states into account. A glue session is then created with one of these data objects:
import numpy as np from glue.core.component_id import ComponentID from glue.core.data import BaseCartesianData from glue.utils import view_shape class RandomData(BaseCartesianData): def __init__(self): super(RandomData, self).__init__() self.data_cid = ComponentID(label='data', parent=self) @property def label(self): return "Random Data" @property def shape(self): return (512, 512, 512) @property def main_components(self): return [self.data_cid] def get_kind(self, cid): return 'numerical' def get_data(self, cid, view=None): if cid in self.pixel_component_ids: return super(RandomData, self).get_data(cid, view=view) else: return np.random.random(view_shape(self.shape, view)) def get_mask(self, subset_state, view=None): return subset_state.to_mask(self, view=view) def compute_statistic(self, statistic, cid, axis=None, finite=True, positive=False, subset_state=None, percentile=None, random_subset=None): if axis is None: if statistic == 'minimum': return 0 elif statistic == 'maximum': if cid in self.pixel_component_ids: return self.shape[cid.axis] else: return 1 elif statistic == 'mean' or statistic == 'median': return 0.5 elif statistic == 'percentile': return percentile / 100 elif statistic == 'sum': return self.size / 2 else: final_shape = tuple(self.shape[i] for i in range(self.ndim) if i not in axis) return np.random.random(final_shape) def compute_histogram(self, cid, range=None, bins=None, log=False, subset_state=None, subset_group=None): return np.random.random(bins) * 100 # We now create a data object using the above class, # and launch a a glue session from glue.core import DataCollection from glue.app.qt.application import GlueApplication d = RandomData() dc = DataCollection([d]) ga = GlueApplication(dc) ga.start()