# Defining your own data objects#

## Background#

By default, data objects in glue are instances of the
`Data`

class, and this class assumes that the data are
stored in one or more local n-dimensional Numpy arrays. However, glue now
includes a way of defining a wider variety of data objects, which may rely for
example on large remote datasets, or datasets that are not inherently stored as
regular n-dimensional arrays.

The base class for all datasets is `BaseData`

, which is
intended to represent completely arbitrary datasets. However, glue does not yet
recognize classes that inherit directly from `BaseData`

.
Instead, for now, the base class that can be used to define custom data objects
is `BaseCartesianData`

, which inherits from
`BaseData`

and requires data objects to present an
interface that looks like n-dimensional arrays (although the storage of the data
could still be unstructured). In future, we will also make it possible to
support a more generic interface for data access based on the
`BaseData`

class.

## Main data interface#

Before we dive in, we recommend that you take a look at the Working with Data objects
tutorial to understand how the default `Data`

objects
work. The most important takeaway from this which is relevant here is that glue
data objects are collections of attributes (*components* in glue-speak) that are
assumed to be aligned (for regular cartesian datasets, this means they are on
the same grid). For example a table consists of a collection of 1-d attributes
that are all the same length. A traditional image consists of a single 2-d
attribute. Attributes/components are identified in data objects by
`ComponentID`

objects (though these objects don’t
contain the actual values – they are just a reference to that attribute).

To define your own data object, you should write a class that inherits from
`BaseCartesianData`

. You will then need to define a few
properties and methods for the data object to be usable in glue. The properties
you need to define are:

`label`

: the name of the dataset, as a string`shape`

: the shape of a single component, given as a tuple`main_components`

: a list of all`ComponentID`

that your data object recognizes, excluding coordinate components (more on this later).

The methods you need to define are:

`get_kind()`

: given a`ComponentID`

, return a string that should be either`'numerical'`

(for e.g. floating-point and integer attributes),`'categorical'`

(for e.g. string attributes), or`'datetime'`

(for attributes that use the`np.datetime64`

type).`get_data()`

: given a`ComponentID`

and optionally a*view*, return a Numpy array. A view can be anything that can be used to slice a Numpy array. For example a single integer (`view=4`

), a tuple of slices (`view=[slice(1, 14, 2), slice(4, 50, 3)]`

), a list of tuples of indices (`view=[(1, 2, 3), (4, 3, 4)]`

), and so on. If a view is specified, only that subset of values should be returned. For example if the data has an overall shape of`(10,)`

and`view=slice(1, 6, 2)`

,`get_data`

should return an array with shape`(3,)`

. By default,`BaseCartesianData.get_data`

will return values for pixel and world`ComponentID`

objects as well as any linked`ComponentID`

, so we recommend that your implementation calls`BaseCartesianData.get_data`

for any`ComponentID`

you do not expose yourself.`get_mask()`

: given a`SubsetState`

object (described in Subset states) and optionally a`view`

, return a boolean array describing which values are in the specified subset (where True indicates values inside the subset).`compute_statistic()`

: given a statistic name (e.g.`'mean'`

) and a`ComponentID`

, as well as optional keyword arguments (see`compute_statistic()`

), return a statistic for the required component. In particular one of the keyword arguments is`subset_state`

, which can be used to indicate that the statistic should only be computed for a subset of values.`compute_histogram()`

: given a list of`ComponentID`

objects, as well as optional keyword arguments (see`compute_histogram()`

), compute an n-dimensional histogram of the required attributes. At the moment glue only makes use of this for one or two dimensions, though we may also use it for three dimensions in future.`compute_fixed_resolution_buffer()`

: given a list of bounds of the form`(min, max, n)`

along each dimension of the data, return a fixed resolution buffer that goes from the respective`min`

to`max`

in`n`

steps. Bounds can also be single scalars in cases where the fixed resolution buffer is a lower-dimensional slice. This method can optionally take a target dataset in the case where the fixed resolution buffer should be computed in the frame of reference of a different dataset, in which case the bounds should be interpreted in the frame of reference of the target dataset (but this is only needed if data linking is used). See`compute_fixed_resolution_buffer()`

for a full list of arguments.

## Subset states#

In the above section, we mentioned the concept of a
`SubsetState`

. In glue parlance, a *subset state* is an
abstract representation of a subset in the data – for instance *the subset of
data where a > 3* or *the subset of data inside a polygon with vertices vx and
vy*. Several of the methods in Main data interface can
take subset states, and so in your data object, you should decide how to
most efficiently implement each kind of subset state.

You can find a full list of subset states defined in glue `here`

, and in particular you can look at the documentation of each
one to understand how to access the relevant information. For example, if you
have an `InequalitySubsetState`

, you can access the
relevant information as follows:

```
>>> subset_state
<InequalitySubsetState: (x > 1.2)>
>>> subset_state.left
x
>>> subset_state.right
1.2
>>> subset_state.operator
<built-in function gt>
```

`SubsetState`

objects have a
`to_mask()`

method that can take a data
object and a view and return a mask:

```
>>> subset_state.to_mask(d)
array([False, True, True])
```

In this case, the subset state essentially accesses the data using
`get_data()`

, so this may be very
inefficient for large datasets. Therefore, you may choose to re-interpret the
subset states and compute a mask yourself.

While developing your data class, one way to make sure that glue doesn’t crash if you haven’t yet implemented support for a specific subset state is to interpret any unimplemented subset state as simply indicating an empty subset.

## Linking#

You should be able to link data objects that inherit from
`BaseCartesianData`

with other datasets - however
for this to work properly you should make sure that your implementation of
`get_data()`

calls
`BaseCartesianData.get_data`

for any unrecognized `ComponentID`

, as the base
implementation will handle returning linked values.

## Using your data object#

Assuming you have written your own data class, there are several ways that you can start using it in glue:

You can define your own data loader which is a function that takes a filename and should return an instance of a subclass of

`BaseCartesianData`

.You can define your own data importer, which is a function that can do anything you need, for example showing a dialog, and should return a list of instances of

`BaseCartesianData`

. This is more general than a data loader since a data importer doesn’t need to rely on a filename. It might include for example opening a dialog in which you can log in to a remote service and browse available datasets.You can start up glue programmatically, including constructing your data object. This is particularly useful when initially developing your custom data object:

# Construct your data object d = MyCustomData(...) # Create glue application and start dc = DataCollection([d]) ga = GlueApplication(dc) ga.start()

## Example#

As an example of a minimal custom data class, the following implements a (very uninteresting) dataset that simply generates values randomly in the range [0:1] on-the-fly, and does not take subset states into account. A glue session is then created with one of these data objects:

```
import numpy as np
from glue.core.component_id import ComponentID
from glue.core.data import BaseCartesianData
from glue.utils import view_shape
class RandomData(BaseCartesianData):
def __init__(self):
super(RandomData, self).__init__()
self.data_cid = ComponentID(label='data', parent=self)
@property
def label(self):
return "Random Data"
@property
def shape(self):
return (512, 512, 512)
@property
def main_components(self):
return [self.data_cid]
def get_kind(self, cid):
return 'numerical'
def get_data(self, cid, view=None):
if cid is self.data_cid:
return np.random.random(view_shape(self.shape, view))
else:
return super(RandomData, self).get_data(cid, view=view)
def get_mask(self, subset_state, view=None):
return subset_state.to_mask(self, view=view)
def compute_statistic(self, statistic, cid,
axis=None, finite=True,
positive=False, subset_state=None,
percentile=None, random_subset=None):
if axis is None:
if statistic == 'minimum':
return 0
elif statistic == 'maximum':
if cid in self.pixel_component_ids:
return self.shape[cid.axis]
else:
return 1
elif statistic == 'mean' or statistic == 'median':
return 0.5
elif statistic == 'percentile':
return percentile / 100
elif statistic == 'sum':
return self.size / 2
else:
final_shape = tuple(self.shape[i] for i in range(self.ndim)
if i not in axis)
return np.random.random(final_shape)
def compute_histogram(self, cid,
range=None, bins=None, log=False,
subset_state=None, subset_group=None):
return np.random.random(bins) * 100
# We now create a data object using the above class,
# and launch a a glue session
from glue.core import DataCollection
from glue.app.application import GlueApplication
d = RandomData()
dc = DataCollection([d])
ga = GlueApplication(dc)
ga.start()
```