Data

class glue.core.data.Data(label='', coords=None, **kwargs)[source]

Bases: glue.core.data.BaseCartesianData

The basic data container in Glue.

The data object stores data as a collection of Component objects. Each component stored in a dataset must have the same shape.

Catalog data sets are stored such that each column is a distinct 1-dimensional Component.

There are several ways to extract the actual numerical data stored in a Data object:

data = Data(x=[1, 2, 3], label='data')
xid = data.id['x']

data[xid]
data.get_component(xid).data
data['x']  # if 'x' is a unique component name

Likewise, datasets support fancy indexing:

data[xid, 0:2]
data[xid, [True, False, True]]

See also: Working with Data objects

Parameters
labelstr

The name of the dataset

coordsCoordinates

The coordinates object to use to define world coordinates

Attributes Summary

components

All ComponentIDs in the Data

coordinate_components

The ComponentIDs associated with a CoordinateComponent

coordinate_links

A list of the ComponentLinks that connect pixel and world.

coords

The coordinates object for the data.

derived_components

The ComponentIDs for each DerivedComponent

derived_links

A list of the links present inside all of the DerivedComponent objects in this dataset.

externally_derivable_components

label

The name of the dataset

links

A list of all the links internal to the dataset.

main_components

ndim

The number of dimensions of the data, as an integer.

pixel_aligned_data

Information about other datasets in the same data collection that have matching or a subset of pixel component IDs.

pixel_component_ids

The ComponentIDs for each pixel coordinate.

primary_components

The ComponentIDs not associated with a DerivedComponent

shape

The n-dimensional shape of the dataset, as a tuple.

size

The size of the data (the product of the shape dimensions), as an integer.

visible_components

All ComponentIDs in the Data that aren’t coordinates.

world_component_ids

The ComponentIDs for each world coordinate.

Methods Summary

add_component(self, component, label)

Add a new component to this data set.

add_component_link(self, link[, label])

Shortcut method for generating a new DerivedComponent from a ComponentLink object, and adding it to a data set.

component_ids(self)

Equivalent to Data.components

compute_histogram(self, cids[, weights, …])

Compute an n-dimensional histogram with regularly spaced bins.

compute_statistic(self, statistic, cid[, …])

Compute a statistic for the data.

dtype(self, cid)

Lookup the dtype for the data associated with a ComponentID

find_component_id(self, label)

Retrieve component_ids associated by label name.

get_component(self, component_id)

Fetch the component corresponding to component_id.

get_data(self, cid[, view])

Get the data values for a given component

get_kind(self, cid)

Get the kind of data for a given component.

get_mask(self, subset_state[, view])

Get a boolean mask for a given subset state.

join_on_key(self, other, cid, cid_other)

Create an element mapping to another dataset, by joining on values of ComponentIDs in both datasets.

remove_component(self, component_id)

Remove a component from a data set

reorder_components(self, component_ids)

Reorder the components using a list of component IDs.

to_dataframe(self[, index])

Convert the Data object into a pandas.DataFrame object

update_components(self, mapping)

Change the numerical data associated with some of the Components in this Data object.

update_id(self, old, new)

Reassign a component to a different glue.core.component_id.ComponentID

update_values_from_data(self, data)

Replace numerical values in data to match values from another dataset.

Attributes Documentation

components

All ComponentIDs in the Data

Return type

list

coordinate_components

The ComponentIDs associated with a CoordinateComponent

Return type

list

A list of the ComponentLinks that connect pixel and world. If no coordinate transformation object is present, return an empty list.

coords

The coordinates object for the data.

derived_components

The ComponentIDs for each DerivedComponent

Return type

list

A list of the links present inside all of the DerivedComponent objects in this dataset.

externally_derivable_components
label

The name of the dataset

A list of all the links internal to the dataset.

main_components
ndim

The number of dimensions of the data, as an integer.

pixel_aligned_data

Information about other datasets in the same data collection that have matching or a subset of pixel component IDs.

This is returned as a dictionary where each key is a dataset with matching pixel component IDs, and the value is the order in which the pixel component IDs of the other dataset can be found in the current one.

pixel_component_ids

The ComponentIDs for each pixel coordinate.

primary_components

The ComponentIDs not associated with a DerivedComponent

This property is deprecated.

shape

The n-dimensional shape of the dataset, as a tuple.

size

The size of the data (the product of the shape dimensions), as an integer.

visible_components

All ComponentIDs in the Data that aren’t coordinates.

This property is deprecated.

world_component_ids

The ComponentIDs for each world coordinate.

Methods Documentation

add_component(self, component, label)[source]

Add a new component to this data set.

Parameters
Raises

TypeError, if label is invalid ValueError if the component has an incompatible shape

Returns

The ComponentID associated with the newly-added component

Shortcut method for generating a new DerivedComponent from a ComponentLink object, and adding it to a data set.

Parameters
linkComponentLink

The link to use to generate a new component

labelComponentID or str

The ComponentID or label to attach to.

Returns
componentDerivedComponent

The component that was added

component_ids(self)[source]

Equivalent to Data.components

compute_histogram(self, cids, weights=None, range=None, bins=None, log=None, subset_state=None)[source]

Compute an n-dimensional histogram with regularly spaced bins.

Currently this only implements 1-D histograms.

Parameters
cidslist of str or ComponentID

Component IDs to compute the histogram over

weightsstr or ComponentID

Component IDs to use for the histogram weights

rangelist of tuple

The (min, max) of the histogram range

binslist of int

The number of bins

loglist of bool

Whether to compute the histogram in log space

subset_stateSubsetState, optional

If specified, the histogram will only take into account values in the subset state.

compute_statistic(self, statistic, cid, subset_state=None, axis=None, finite=True, positive=False, percentile=None, view=None, random_subset=None, n_chunk_max=40000000)[source]

Compute a statistic for the data.

Parameters
statistic{‘minimum’, ‘maximum’, ‘mean’, ‘median’, ‘sum’, ‘percentile’}

The statistic to compute

cidComponentID or str

The component ID to compute the statistic on - if given as a string this will be assumed to be for the component belonging to the dataset (not external links).

subset_stateSubsetState

If specified, the statistic will only include the values that are in the subset specified by this subset state.

axisNone or int or tuple of int

If specified, the axis/axes to compute the statistic over.

finitebool, optional

Whether to include only finite values in the statistic. This should be True to ignore NaN/Inf values

positivebool, optional

Whether to include only (strictly) positive values in the statistic. This is used for example when computing statistics of data shown in log space.

percentilefloat, optional

If statistic is 'percentile', the percentile argument should be given and specify the percentile to calculate in the range [0:100]

random_subsetint, optional

If specified, this should be an integer giving the number of values to use for the statistic. This can only be used if axis is None

n_chunk_maxint, optional

If there are more elements in the array than this value, operate in chunks with at most this size.

dtype(self, cid)[source]

Lookup the dtype for the data associated with a ComponentID

find_component_id(self, label)[source]

Retrieve component_ids associated by label name.

Parameters

label – ComponentID or string to search for

Returns

The associated ComponentID if label is found and unique, else None. First, this checks whether the component ID is present and unique in the primary (non-derived) components of the data, and if not then the derived components are checked. If there is one instance of the label in the primary and one in the derived components, the primary one takes precedence.

get_component(self, component_id)[source]

Fetch the component corresponding to component_id.

Parameters

component_id – the component_id to retrieve

get_data(self, cid, view=None)[source]

Get the data values for a given component

Parameters
cidComponentID

The component ID to get the data for

view

The ‘view’ on the data - anything that is considered a valid Numpy slice/index.

get_kind(self, cid)[source]

Get the kind of data for a given component.

Parameters
cidComponentID

The component ID to get the data kind for

Returns
kind{‘numerical’, ‘categorical’, ‘datetime’}

The kind of data for the given component ID.

get_mask(self, subset_state, view=None)[source]

Get a boolean mask for a given subset state.

Parameters
subset_stateSubsetState

The subset state to use to compute the mask

view

The ‘view’ on the mask - anything that is considered a valid Numpy slice/index.

join_on_key(self, other, cid, cid_other)[source]

Create an element mapping to another dataset, by joining on values of ComponentIDs in both datasets.

This join allows any subsets defined on other to be propagated to self. The different ways to call this method are described in the Examples section below.

Parameters
otherData

Data object to join with

cidstr or ComponentID or iterable

Component(s) in this dataset to use as a key

cid_otherstr or ComponentID or iterable

Component(s) in the other dataset to use as a key

Examples

There are several ways to use this function, depending on how many components are passed to cid and cid_other.

Joining on single components

First, one can specify a single component ID for both cid and cid_other: this is the standard mode, and joins one component from one dataset to the other:

>>> d1 = Data(x=[1, 2, 3, 4, 5], k1=[0, 0, 1, 1, 2], label='d1')
>>> d2 = Data(y=[2, 4, 5, 8, 4], k2=[1, 3, 1, 2, 3], label='d2')
>>> d2.join_on_key(d1, 'k2', 'k1')

Selecting all values in d1 where x is greater than 2 returns the last three items as expected:

>>> s = d1.new_subset()
>>> s.subset_state = d1.id['x'] > 2
>>> s.to_mask()
array([False, False,  True,  True,  True], dtype=bool)

The linking was done between k1 and k2, and the values of k1 for the last three items are 1 and 2 - this means that the first, third, and fourth item in d2 will then get selected, since k2 has a value of either 1 or 2 for these items.

>>> s = d2.new_subset()
>>> s.subset_state = d1.id['x'] > 2
>>> s.to_mask()
array([ True, False,  True,  True, False], dtype=bool)

Joining on multiple components

Note

This mode is currently slow, and will be optimized significantly in future.

Next, one can specify several components for each dataset: in this case, the number of components given should match for both datasets. This causes items in both datasets to be linked when (and only when) the set of keys match between the two datasets:

>>> d1 = Data(x=[1, 2, 3, 5, 5],
...           y=[0, 0, 1, 1, 2], label='d1')
>>> d2 = Data(a=[2, 5, 5, 8, 4],
...           b=[1, 3, 2, 2, 3], label='d2')
>>> d2.join_on_key(d1, ('a', 'b'), ('x', 'y'))

Selecting all items where x is 5 in d1 in which x is a component works as expected and selects the two last items:

>>> s = d1.new_subset()
>>> s.subset_state = d1.id['x'] == 5
>>> s.to_mask()
array([False, False, False,  True,  True], dtype=bool)

If we apply this selection to d2, only items where a is 5 and b is 2 will be selected:

>>> s = d2.new_subset()
>>> s.subset_state = d1.id['x'] == 5
>>> s.to_mask()
array([False, False,  True, False, False], dtype=bool)

and in particular, the second item (where a is 5 and b is 3) is not selected.

One-to-many and many-to-one joining

Finally, you can specify one component in one dataset and multiple ones in the other. In the case where one component is specified for this dataset and multiple ones for the other dataset, then when an item is selected in the other dataset, it will cause any item in the present dataset which matches any of the keys in the other data to be selected:

>>> d1 = Data(x=[1, 2, 3], label='d1')
>>> d2 = Data(a=[1, 1, 2],
...           b=[2, 3, 3], label='d2')
>>> d1.join_on_key(d2, 'x', ('a', 'b'))

In this case, if we select all items in d2 where a is 2, this will select the third item:

>>> s = d2.new_subset()
>>> s.subset_state = d2.id['a'] == 2
>>> s.to_mask()
array([False, False,  True], dtype=bool)

Since we have joined the datasets using both a and b, we select all items in d1 where x is either the value or a or b (2 or 3) which means we select the second and third item:

>>> s = d1.new_subset()
>>> s.subset_state = d2.id['a'] == 2
>>> s.to_mask()
array([False,  True,  True], dtype=bool)

We can also join the datasets the other way around:

>>> d1 = Data(x=[1, 2, 3], label='d1')
>>> d2 = Data(a=[1, 1, 2],
...           b=[2, 3, 3], label='d2')
>>> d2.join_on_key(d1, ('a', 'b'), 'x')

In this case, selecting items in d1 where x is 1 selects the first item, as expected:

>>> s = d1.new_subset()
>>> s.subset_state = d1.id['x'] == 1
>>> s.to_mask()
array([ True, False, False], dtype=bool)

This then causes any item in d2 where either a or b are 1 to be selected, i.e. the first two items:

>>> s = d2.new_subset()
>>> s.subset_state = d1.id['x'] == 1
>>> s.to_mask()
array([ True,  True, False], dtype=bool)
remove_component(self, component_id)[source]

Remove a component from a data set

Parameters

component_id (ComponentID) – the component to remove

reorder_components(self, component_ids)[source]

Reorder the components using a list of component IDs. The new set of component IDs has to match the existing set (though order may differ).

to_dataframe(self, index=None)[source]

Convert the Data object into a pandas.DataFrame object

Parameters

index – Any ‘index-like’ object that can be passed to the pandas.Series constructor

Returns

pandas.DataFrame

update_components(self, mapping)[source]

Change the numerical data associated with some of the Components in this Data object.

All changes to component numerical data should use this method, which broadcasts the state change to the appropriate places.

Parameters

mapping – A dict mapping Components or ComponenIDs to arrays.

This method has the following restrictions:
  • New components must have the same shape as old components

  • Component subclasses cannot be updated.

update_id(self, old, new)[source]

Reassign a component to a different glue.core.component_id.ComponentID

Parameters
oldglue.core.component_id.ComponentID

The old component ID

newglue.core.component_id.ComponentID

The new component ID

update_values_from_data(self, data)[source]

Replace numerical values in data to match values from another dataset.

Notes

This method drops components that aren’t present in the new data, and adds components that are in the new data that were not in the original data. The matching is done by component label, and components are resized if needed. This means that for components with matching labels in the original and new data, the ComponentID are preserved, and existing plots and selections will be updated to reflect the new values. Note that the coordinates are also copied, but the style is not copied.