cryosparc.dataset#

Classes and utilities for loading, saving and working with .cs Dataset files. A pure-C interface to dataset handles is also available.

A Dataset is everything: particles, volumes, micrographs, etc.

A Result is a dataset + field names + other info.

Datasets are lightweight: multiple can be used at any time, like one per micrograph in picking

The only required field is uid. This field is automatically added to every new dataset.

Datasets are created in on the following ways: - allocated empty with a specific size and field definitions - from a previous dataset source that already has uids (file, record array) - by appending datasets to each other or joining on uid

Dataset supports: - adding new rows (via appending) - adding new fields - joining fields from another dataset on UID

Data:

CSDAT_FORMAT

Compressed stream .cs file format.

DEFAULT_FORMAT

Default save .cs file format.

NEWEST_FORMAT

Newest save .cs file format.

NUMPY_FORMAT

Numpy-array .cs file format.

Classes:

Dataset(allocate, ...[, row_class])

Accessor class for working with CryoSPARC .cs files.

Functions:

generate_uids([num])

Generate the given number of random 64-bit unsigned integer uids.

cryosparc.dataset.CSDAT_FORMAT = 2#

Compressed stream .cs file format. Same as NEWEST_FORMAT.

cryosparc.dataset.DEFAULT_FORMAT = 1#

Default save .cs file format. Same as NUMPY_FORMAT.

class cryosparc.dataset.Dataset(allocate: int | Dataset[Any] | NDArray | ~cryosparc.core.Data | ~typing.Mapping[str, ArrayLike] | ~typing.List[~typing.Tuple[str, ArrayLike]] | None = 0, row_class=<class 'cryosparc.row.Row'>)#

Accessor class for working with CryoSPARC .cs files.

A dataset may be initialized with Dataset(data) where data is one of the following:

  • A size of items to allocate (e.g., 42)

  • A mapping from column names to their contents (dict or tuple list)

  • A numpy record array

Parameters:
  • allocate (int | Dataset | NDArray | Mapping[str, ArrayLike], optional) – Allocation data, as described above. Defaults to 0.

  • row_class (Type[Row], optional) – Class to use for row instances produced by this dataset. Defaults to Row.

Examples

Initialize a dataset

>>> dset = Dataset([
...     ("uid", [1, 2, 3]),
...     ("dat1", ["Hello", "World", "!"]),
...     ("dat2", [3.14, 2.71, 1.61])
... ])
>>> dset.descr()
[('uid', '<u8'), ('dat1', '|O'), ('dat2', '<f8')]

Load a dataset from disk

>>> from cryosparc.dataset import Dataset
>>> dset = Dataset.load('/path/to/particles.cs')
>>> for particle in dset.rows():
...     print(
...         f"Particle located in file {particle['blob/path']} "
...         f"at index {particle['blob/idx']}")

Methods:

add_fields()

Adds the given fields to the dataset.

allocate([size, fields])

Allocate a dataset with the given number of rows and specified fields.

append(*others[, assert_same_fields, ...])

Concatenate many datasets together into one new one.

append_many(*datasets[, assert_same_fields, ...])

Similar to Dataset.append.

cols()

Get current dataset columns, organized by field.

common_fields(*datasets[, assert_same_fields])

Get a list of fields common to all given datasets.

copy()

Create a deep copy of the current dataset.

copy_fields(old_fields, new_fields)

Copy the values at the given old fields into the new fields, allocating them if necessary.

descr([exclude_uid])

Get numpy-compatible description for dataset fields.

drop_fields(names, *[, copy])

Remove the given field names from the dataset.

extend(*others[, repeat_allowed])

Add the given dataset(s) to the end of the current dataset.

fields([exclude_uid])

Get a list of field names available in this dataset.

filter_fields(names, *[, copy])

Keep only the given fields from the dataset.

filter_prefix(keep_prefix, *[, rename, copy])

Similar to filter_prefixes but for a single prefix.

filter_prefixes(prefixes, *[, copy])

Similar to filter_fields, except takes list of prefixes.

from_async_stream(stream)

Asynchronously load from the given binary stream.

handle()

Numeric dataset handle for working with the dataset via C APIs (documentation is not yet available).

innerjoin(*others[, assert_no_drop])

Create a new dataset with fields from all provided datasets and only including rows common to all provided datasets (based on UID)

innerjoin_many(*datasets)

Similar to Dataset.innerjoin.

inspect(file)

Given a path to a dataset file, get information included in its header.

interlace(*datasets[, assert_same_fields])

Combine the current dataset with one or more datasets of the same length by alternating rows from each dataset.

is_equivalent(other)

Check whether two datasets contain the same data, regardless of field order.

load(file, *[, prefixes, fields, cstrs])

Read a dataset from path or file handle.

mask(mask)

Get a subset of the dataset that matches the given boolean mask of rows.

prefixes()

List of field prefixes available in this dataset, assuming fields are have format {prefix}/{field}.

query(query)

Get a subset of data based on whether the fields match the values in the given query.

query_mask(query[, invert])

Get a boolean array representing the items to keep in the dataset that match the given query filter.

reassign_uids()

Reset all values of the uid column to new unique random values.

rename_field(current_name, new_name, *[, copy])

Change name of a dataset field based on the given mapping.

rename_fields(field_map, *[, copy])

Change the name of dataset fields based on the given mapping.

rename_prefix(old_prefix, new_prefix, *[, copy])

Similar to rename_fields, except changes the prefix of all fields with the given old_prefix to new_prefix.

replace(query, *others[, assume_disjoint, ...])

Replaces values matching the given query with others.

rows()

A row-by-row accessor list for items in this dataset.

save(file[, format])

Save a dataset to the given path or I/O buffer.

slice([start, stop, step])

Get subset of the dataset with rows in the given range.

split_by(field)

Create a mapping from possible values of the given field and to a datasets filtered by rows of that value.

stream([compression])

Generate a binary representation for this dataset.

subset(rows)

Get a subset of dataset that only includes the given list of rows (from this dataset).

take(indices)

Get a subset of data with only the matching list of row indices.

to_cstrs(*[, copy])

Convert all Python string columns to C strings.

to_list([exclude_uid])

Convert to a list of lists, each value of the outer list representing one dataset row.

to_pystrs(*[, copy])

Convert all C string columns to Python strings.

to_records([fixed])

Convert to a numpy record array.

union(*others[, assert_same_fields, ...])

Take the row union of all the given datasets, based on their uid fields.

union_many(*datasets[, assert_same_fields, ...])

Similar to Dataset.union.

add_fields(fields: Sequence[Tuple[str, str] | Tuple[str, str, Tuple[int, ...]]]) Dataset[R]#
add_fields(fields: Sequence[str], dtypes: str | Sequence['DTypeLike']) Dataset[R]

Adds the given fields to the dataset. If a field with the same name already exists, that field will not be added (even if types don’t match). Fields are initialized with zeros (or “” for object fields).

Parameters:
  • fields (list[str] | list[Field]) – Field names or description to add. If a list of names is specified, the second dtypes argument must also be specified.

  • dtypes (str | list[DTypeLike], optional) – String with comma-separated data type names or list of data types. Must be specified if the fields argument is a list of strings, Defaults to None.

Returns:

self with added fields

Return type:

Dataset

Examples

>>> dset = Dataset(3)
>>> dset.add_fields(
...     ['foo', 'bar'],
...     ['u8', ('f4', (2,))]
... )
Dataset([
    ('uid', [14727850622008419978 309606388100339041 15935837513913527085]),
    ('foo', [0 0 0]),
    ('bar', [[0. 0.] [0. 0.] [0. 0.]]),
])
>>> dset.add_fields([('baz', "O")])
Dataset([
    ('uid', [14727850622008419978 309606388100339041 15935837513913527085]),
    ('foo', [0 0 0]),
    ('bar', [[0. 0.] [0. 0.] [0. 0.]]),
    ('baz', ["" "" ""]),
])
classmethod allocate(size: int = 0, fields: Sequence[Tuple[str, str] | Tuple[str, str, Tuple[int, ...]]] = [])#

Allocate a dataset with the given number of rows and specified fields.

Parameters:
  • size (int, optional) – Number of rows to allocate. Defaults to 0.

  • fields (list[Field], optional) – Initial fields, excluding uid. Defaults to [].

Returns:

Empty dataset

Return type:

Dataset

append(*others: Dataset, assert_same_fields=False, repeat_allowed=False)#

Concatenate many datasets together into one new one.

May be called either as an instance method or an initializer to create a new dataset from one or more datasets.

To initialize from zero or more datasets, use Dataset.append_many.

Parameters:
  • assert_same_fields (bool, optional) – If not set or False, appends only common dataset fields. If True, fails when input don’t have all fields in common. Defaults to False.

  • repeat_allowed (bool, optional) – If True, does not fail when there are duplicate UIDs. Defaults to False.

Returns:

appended dataset

Return type:

Dataset

Examples

As an instance method

>>> dset = d1.append(d2, d3)

As a class method

>>> dset = Dataset.append(d1, d2, d3)
classmethod append_many(*datasets: Dataset, assert_same_fields=False, repeat_allowed=False)#

Similar to Dataset.append. If no datasets are provided, returns an empty Dataset with just the uid field.

Parameters:
  • assert_same_fields (bool, optional) – Same as for append method. Defaults to False.

  • repeat_allowed (bool, optional) – Same as for append method. Defaults to False.

Returns:

Appended dataset

Return type:

Dataset

cols() Dict[str, Column]#

Get current dataset columns, organized by field.

Returns:

Columns

Return type:

dict[str, Column]

classmethod common_fields(*datasets: Dataset, assert_same_fields=False) List[Tuple[str, str] | Tuple[str, str, Tuple[int, ...]]]#

Get a list of fields common to all given datasets.

Parameters:

assert_same_fields (bool, optional) – If True, fails if datasets don’t all share the same fields. Defaults to False.

Returns:

List of dataset fields and their data types.

Return type:

list[Field]

copy()#

Create a deep copy of the current dataset.

Returns:

copy

Return type:

Dataset

copy_fields(old_fields: List[str], new_fields: List[str])#

Copy the values at the given old fields into the new fields, allocating them if necessary.

Parameters:
  • old_fields (List[str]) – Name of old fields to copy from

  • new_fields (List[str]) – New of new fields to copy to

Returns:

current dataset with modified fields

Return type:

Dataset

descr(exclude_uid=False) List[Tuple[str, str] | Tuple[str, str, Tuple[int, ...]]]#

Get numpy-compatible description for dataset fields.

Parameters:

exclude_uid (bool, optional) – If True, uid field will not be included. Defaults to False.

Returns:

Fields

Return type:

list[Field]

drop_fields(names: Collection[str] | Callable[[str], bool], *, copy: bool = False)#

Remove the given field names from the dataset. Provide a list of fields or a function that takes a field name and returns True if that field should be removed

Parameters:
  • names (list[str] | (str) -> bool) – Collection of fields to remove or function that takes a field name and returns True if that field should be removed

  • copy (bool, optional) – If True, return a copy of dataset rather than mutate. Defaults to False.

Returns:

current dataset or copy with fields removed

Return type:

Dataset

extend(*others: Dataset, repeat_allowed=False)#

Add the given dataset(s) to the end of the current dataset. Other datasets must have at least the same fields of the current dataset.

Parameters:

repeat_allowed (bool, optional) – If True, does not fail when there are duplicate UIDs. Defaults to False.

Returns:

current dataset with others appended

Return type:

Dataset

Examples

>>> len(d1), len(d2), len(d3)
(42, 3, 5)
>>> d1.extend(d2, d3)
Dataset(...)
>>> len(d1)
50
fields(exclude_uid=False) List[str]#

Get a list of field names available in this dataset.

Parameters:
  • exclude_uid (bool, optional) – If True, uid field will not be

  • False. (included. Defaults to)

Returns:

List of field names

Return type:

list[str]

filter_fields(names: Collection[str] | Callable[[str], bool], *, copy: bool = False)#

Keep only the given fields from the dataset. Provide a list of fields or function that returns True if a given field name should be kept.

Parameters:
  • names (list[str] | (str) -> bool) – Collection of fields to keep or function that takes a field name and returns True if that field should be kept

  • copy (bool, optional) – It True, return a copy of the dataset rather than mutate. Defaults to False.

Returns:

current dataset or copy with filtered fields

Return type:

Dataset

filter_prefix(keep_prefix: str, *, rename: str | None = None, copy: bool = False)#

Similar to filter_prefixes but for a single prefix.

Parameters:
  • keep_prefix (str) – Prefix to keep.

  • rename (str, optional) – If specified, rename prefix to this prefix. Defaults to None.

  • copy (bool, optional) – If True, return a copy if the dataset rather than mutate. Defaults to False.

Returns:

current dataset or copy with filtered prefix

Return type:

Dataset

filter_prefixes(prefixes: Collection[str], *, copy: bool = False)#

Similar to filter_fields, except takes list of prefixes.

Parameters:
  • prefixes (list[str]) – Prefixes to keep

  • copy (bool, optional) – If True, return a copy if the dataset rather than mutate. Defaults to False.

Returns:

current dataset or copy with filtered prefixes

Return type:

Dataset

Examples

>>> dset = Dataset([
...     ('uid', [123 456 789]),
...     ('field', [0 0 0]),
...     ('foo/one', [1 2 3]),
...     ('foo/two', [4 5 6]),
...     ('bar/one', ['Hello' 'World' '!']),
... ])
>>> dset.filter_prefixes(['foo'])
Dataset([
    ('uid', [123 456 789]),
    ('foo/one', [1 2 3]),
    ('foo/two', [4 5 6]),
])
async classmethod from_async_stream(stream: AsyncBinaryIO)#

Asynchronously load from the given binary stream. The given stream parameter must at least have async read(n: int | None) -> bytes method.

handle() int#

Numeric dataset handle for working with the dataset via C APIs (documentation is not yet available).

Returns:

Dataset handle that may be used with C API defined in

<cryosparc-tools/dataset.h>

Return type:

int

innerjoin(*others: Dataset, assert_no_drop=False)#

Create a new dataset with fields from all provided datasets and only including rows common to all provided datasets (based on UID)

May be called either as an instance method or an initializer to create a new dataset from one or more datasets.

To initialize from zero or more datasets, use Dataset.innerjoin_many.

Parameters:

assert_no_drop (bool, optional) – Set to True to ensure the provided datasets include at least all UIDs from the first dataset. Defaults to False.

Returns:

combined dataset.

Return type:

Dataset

Examples

As instance method

>>> dset = d1.innerjoin(d2, d3)

As class method

>>> dset = Dataset.innerjoin(d1, d2, d3)
classmethod innerjoin_many(*datasets: Dataset)#

Similar to Dataset.innerjoin. If no datasets are provided, returns an empty Dataset with just the uid field.

Returns:

combined dataset

Return type:

Dataset

classmethod inspect(file: str | PurePath) DatasetHeader#

Given a path to a dataset file, get information included in its header.

Parameters:

file – (str | Path): Readable file path.

Returns:

Dictionary with dataset

Return type:

DatasetHeader

interlace(*datasets: Dataset, assert_same_fields=False)#

Combine the current dataset with one or more datasets of the same length by alternating rows from each dataset.

Parameters:

assert_same_fields (bool, optional) – If True, fails if not all given datasets have the same fields. Otherwise result only includes common fields. Defaults to False.

Returns:

combined dataset

Return type:

Dataset

is_equivalent(other: object)#

Check whether two datasets contain the same data, regardless of field order.

Parameters:

other (object) – dataset to compare

Returns:

True or False

Return type:

bool

classmethod load(file: str | PurePath | IO[bytes], *, prefixes: Sequence[str] | None = None, fields: Sequence[str] | None = None, cstrs: bool = False)#

Read a dataset from path or file handle.

If given a file handle pointing to data in the usual numpy array format (i.e., created by numpy.save()), then the handle must be seekable. This restriction does not apply when loading the newer CSDAT_FORMAT.

Parameters:
  • file (str | Path | IO) – Readable file path or handle. Must be seekable if loading a dataset saved in the default NUMPY_FORMAT

  • prefixes (list[str], optional) – Which field prefixes to load. If not specified, loads either all or specified fields.

  • fields (list[str], optional) – Which fields to load. If not specified, loads either all or specified prefixes.

  • cstrs (bool) – If True, load internal string columns as C strings instead of Python strings. Defaults to False.

Raises:

DatasetLoadError – If cannot load dataset file.

Returns:

loaded dataset.

Return type:

Dataset

mask(mask: List[bool] | NDArray)#

Get a subset of the dataset that matches the given boolean mask of rows.

Parameters:

mask (list[bool] | NDArray[bool]) – mask to keep. Must match length of current dataset.

Returns:

subset with only matching rows

Return type:

Dataset

prefixes() List[str]#

List of field prefixes available in this dataset, assuming fields are have format {prefix}/{field}.

Returns:

List of prefixes

Return type:

list[str]

Examples

>>> dset = Dataset({
...     'uid': [123, 456, 789],
...     'field': [0, 0, 0],
...     'foo/one': [1, 2, 3],
...     'foo/two': [4, 5, 6],
...     'bar/one': ["Hello", "World", "!"]
... })
>>> dset.prefixes()
["field", "foo", "bar"]
query(query: Dict[str, ArrayLike] | Callable[[R], bool])#

Get a subset of data based on whether the fields match the values in the given query. The query is either a test function that is called on each row or a key/value map of allowed field values.

Each value of a query dictionary may either be a single scalar value or a collection of matching values.

If any field is not in the dataset, it is ignored and all data is kept.

Note

Specifying a query function is very slow for large datasets.

Parameters:

query (dict[str, ArrayLike] | (Row) -> bool) – Query description or row test function.

Returns:

Subset matching the given query

Return type:

Dataset

Examples

With a query dictionary

>>> dset.query({
...     'uid': [123456789, 987654321],
...     'micrograph_blob/path': '/path/to/exposure.mrc'
... })
Dataset(...)

With a function (not recommended)

>>> dset.query(
...     lambda row:
...         row['uid'] in [123456789, 987654321] and
...         row['micrograph_blob/path'] == '/path/to/exposure.mrc'
... )
Dataset(...)
query_mask(query: Dict[str, ArrayLike], invert=False) NDArray[n.bool_]#

Get a boolean array representing the items to keep in the dataset that match the given query filter. See query method for example query format.

Parameters:
  • query (dict[str, ArrayLike]) – Query description

  • invert (bool, optional) – If True, returns mask with all items negated. Defaults to False.

Returns:

Query mask, may be used with the mask() method.

Return type:

NDArray[bool]

reassign_uids()#

Reset all values of the uid column to new unique random values.

Returns:

current dataset with modified UIDs

Return type:

Dataset

rename_field(current_name: str, new_name: str, *, copy: bool = False)#

Change name of a dataset field based on the given mapping.

Parameters:
  • current_name (str) – Old field name.

  • new_name (str) – New field name.

  • copy (bool, optional) – If True, return a copy of the dataset rather than mutate. Defaults to False.

Returns:

current dataset or copy with fields renamed

Return type:

Dataset

rename_fields(field_map: Dict[str, str] | Callable[[str], str], *, copy: bool = False)#

Change the name of dataset fields based on the given mapping.

Parameters:
  • field_map (dict[str, str] | (str) -> str) – Field mapping function or dictionary

  • copy (bool, optional) – If True, return a copy of the dataset rather than mutate. Defaults to False.

Returns:

current dataset or copy with fields renamed

Return type:

Dataset

rename_prefix(old_prefix: str, new_prefix: str, *, copy: bool = False)#

Similar to rename_fields, except changes the prefix of all fields with the given old_prefix to new_prefix.

Parameters:
  • old_prefix (str) – old prefix to rename

  • new_prefix (str) – new prefix

  • copy (bool, optional) – If True, return a copy of the dataset rather than mutate. Defaults to False.

Returns:

current dataset or copy with renamed prefix.

Return type:

Dataset

replace(query: Dict[str, ArrayLike], *others: Dataset, assume_disjoint=False, assume_unique=False)#

Replaces values matching the given query with others. The query is a key/value map of allowed field values. The values may be either a single scalar value or a set of possible values. If nothing matches the query (e.g., {} specified), works the same way as append.

All given datasets must have the same fields.

Parameters:
  • query (dict[str, ArrayLike]) – Query description.

  • assume_disjoint (bool, optional) – If True, assumes given datasets do not share any uid values. Defaults to False.

  • assume_unique (bool, optional) – If True, assumes each given dataset has no duplicate uid values. Defaults to False.

Returns:

subset with rows matching query removed and other datasets

appended at the end

Return type:

Dataset

rows() Spool[R]#

A row-by-row accessor list for items in this dataset.

Returns:

List-like row accessor

Return type:

Spool

Examples

>>> dset = Dataset.load('/path/to/dataset.cs')
>>> for row in dset.rows()
...    print(row.to_dict())
save(file: str | PurePath | IO[bytes], format: int = 1)#

Save a dataset to the given path or I/O buffer.

By default, saves as a numpy record array in the .npy format. Specify format=CSDAT_FORMAT to save in the latest .cs file format which is faster and results in a smaller file size but is not numpy-compatible.

Parameters:
  • file (str | Path | IO) – Writeable file path or handle

  • format (int, optional) – Must be of the constants DEFAULT_FORMAT, NUMPY_FORMAT (same as DEFAULT_FORMAT), or CSDAT_FORMAT. Defaults to DEFAULT_FORMAT.

Raises:

TypeError – If invalid format specified

slice(start: int = 0, stop: int | None = None, step: int = 1)#

Get subset of the dataset with rows in the given range.

Parameters:
  • start (int, optional) – Start index to slice from (inclusive). Defaults to 0.

  • stop (int, optional) – End index to slice until (exclusive). Defaults to length of dataset.

  • step (int, optional) – How many entries to step over in resulting slice. Defaults to 1.

Returns:

subset with slice of matching rows

Return type:

Dataset

split_by(field: str)#

Create a mapping from possible values of the given field and to a datasets filtered by rows of that value.

Examples

>>> dset = Dataset([
...     ('uid', [1, 2, 3, 4]),
...     ('foo', ['hello', 'world', 'hello', 'world'])
... ])
>>> dset.split_by('foo')
{
    'hello': Dataset([('uid', [1, 3]), ('foo', ['hello', 'hello'])]),
    'world': Dataset([('uid', [2, 4]), ('foo', ['world', 'world'])])
}
stream(compression: Literal['lz4', None] | None = None) Generator[bytes, None, None]#

Generate a binary representation for this dataset. Results may be written to a file or buffer to be sent over the network.

Buffer will have the same format as Dataset files saved with format=CSDAT_FORMAT. Call Dataset.load on the resulting file/buffer to retrieve the original data.

Parameters:

compression (Literal["lz4", None], optional)

Yields:

bytes – Dataset file chunks

subset(rows: Collection[Row])#

Get a subset of dataset that only includes the given list of rows (from this dataset).

Parameters:

rows (list[Row]) – Target list of rows from this dataset.

Returns:

subset with only matching rows

Return type:

Dataset

take(indices: List[int] | NDArray)#

Get a subset of data with only the matching list of row indices.

Parameters:

indices (list[int] | NDArray[int]) – collection of indices to keep.

Returns:

subset with matching row indices

Return type:

Dataset

to_cstrs(*, copy: bool = False)#

Convert all Python string columns to C strings. Resulting dataset fields that previously had dtype np.object_ (or T_OBJ internally) will get type np.uint64 and may be accessed as via the dataset C API.

Note: This operation takes a long time for large datasets.

Parameters:

copy (bool, optional) – If True, returns a modified copy of the dataset instead of mutation. Defaults to False.

Returns:

same dataset or copy if specified.

Return type:

Dataset

to_list(exclude_uid=False) List[list]#

Convert to a list of lists, each value of the outer list representing one dataset row. Every value in the resulting list is guaranteed to be a python type (no numpy numeric types).

Parameters:

exclude_uid (bool, optional) – If True, uid column will not be included in output list. Defaults to False.

Returns:

list of row lists

Return type:

list

Examples

>>> dset = Dataset([
...     ('uid', [123 456 789]),
...     ('foo/one', [1 2 3]),
...     ('foo/two', [4 5 6]),
... ])
>>> dset.to_list()
[[123, 1, 4], [456, 2, 5], [789, 3, 6]]
to_pystrs(*, copy: bool = False)#

Convert all C string columns to Python strings. Resulting dataset fields that previously had dtype np.uint64 (and T_STR internally) will get type np.object_.

Parameters:

copy (bool, optional) – If True, returns a modified copy of the dataset instead of mutation. Defaults to False.

Returns:

same dataset or copy if specified.

Return type:

Dataset

to_records(fixed=False)#

Convert to a numpy record array.

Parameters:

fixed (bool, optional) – If True, converts string columns (dtype("O")) to fixed-length strings (dtype("S")). Defaults to False.

Returns:

Numpy record array

Return type:

NDArray

union(*others: Dataset, assert_same_fields=False, assume_unique=False)#

Take the row union of all the given datasets, based on their uid fields.

May be called either as an instance method or an initializer to create a new dataset from one or more datasets:

To initialize from zero or more datasets, use Dataset.union_many.

Parameters:
  • assert_same_fields (bool, optional) – Set to True to enforce that datasets have identical fields. Otherwise, result only includes fields common to all datasets. Defaults to False.

  • assume_unique (bool, optional) – Set to True to assume that each input dataset’s UIDs are unique (though there may be common UIDs between datasets). Defaults to False.

Returns:

Combined dataset

Return type:

Dataset

Examples

As instance method

>>> dset = d1.union(d2, d3)

As class method

>>> dset = Dataset.union(d1, d2, d3)
classmethod union_many(*datasets: Dataset, assert_same_fields=False, assume_unique=False)#

Similar to Dataset.union. If no datasets are provided, returns an empty Dataset with just the uid field.

Parameters:
  • assert_same_fields (bool, optional) – Same as for union. Defaults to False.

  • assume_unique (bool, optional) – Same as for union. Defaults to False.

Returns:

combined dataset, or empty dataset if none are provided.

Return type:

Dataset

cryosparc.dataset.NEWEST_FORMAT = 2#

Newest save .cs file format. Same as CSDAT_FORMAT.

cryosparc.dataset.NUMPY_FORMAT = 1#

Numpy-array .cs file format. Same as DEFAULT_FORMAT.

cryosparc.dataset.generate_uids(num: int = 0)#

Generate the given number of random 64-bit unsigned integer uids.

Parameters:

num (int, optional) – Number of UIDs to generate. Defaults to 0.

Returns:

Numpy array of random unsigned 64-bit integers

Return type:

NDArray