cryosparc.dataset#
Classes and utilities for loading, saving and working with .cs Dataset files. A pure-C interface to dataset handles is also available.
A Dataset is everything: particles, volumes, micrographs, etc.
A Result is a dataset + field names + other info.
Datasets are lightweight: multiple can be used at any time, like one per micrograph in picking
The only required field is uid
. This field is automatically added to every
new dataset.
Datasets are created in on the following ways:
- allocated empty with a specific size and field definitions
- from a previous dataset source that already has uids (file, record array)
- by appending datasets to each other or joining on uid
Dataset supports: - adding new rows (via appending) - adding new fields - joining fields from another dataset on UID
Data:
Compressed stream .cs file format. |
|
Default save .cs file format. |
|
Newest save .cs file format. |
|
Numpy-array .cs file format. |
Classes:
|
Accessor class for working with CryoSPARC .cs files. |
Functions:
|
Generate the given number of random 64-bit unsigned integer uids. |
- cryosparc.dataset.CSDAT_FORMAT = 2#
Compressed stream .cs file format. Same as
NEWEST_FORMAT
.
- cryosparc.dataset.DEFAULT_FORMAT = 1#
Default save .cs file format. Same as
NUMPY_FORMAT
.
- class cryosparc.dataset.Dataset(allocate: ~typing.Union[int, Dataset[Any], NDArray, ~cryosparc.core.Data, ~typing.Mapping[str, ArrayLike], ~typing.List[~typing.Tuple[str, ArrayLike]], ~typing.Literal[None]] = 0, row_class: ~typing.Type[~cryosparc.row.R] = <class 'cryosparc.row.Row'>)#
Accessor class for working with CryoSPARC .cs files.
A dataset may be initialized with
Dataset(data)
wheredata
is one of the following:A size of items to allocate (e.g., 42)
A mapping from column names to their contents (dict or tuple list)
A numpy record array
- Parameters
Examples
Initialize a dataset
>>> dset = Dataset([ ... ("uid", [1, 2, 3]), ... ("dat1", ["Hello", "World", "!"]), ... ("dat2", [3.14, 2.71, 1.61]) ... ]) >>> dset.descr() [('uid', '<u8'), ('dat1', '|O'), ('dat2', '<f8')]
Load a dataset from disk
>>> from cryosparc.dataset import Dataset >>> dset = Dataset.load('/path/to/particles.cs') >>> for particle in dset.rows(): ... print( ... f"Particle located in file {particle['blob/path']} " ... f"at index {particle['blob/idx']}")
Methods:
Adds the given fields to the dataset.
allocate
([size, fields])Allocate a dataset with the given number of rows and specified fields.
append
(*others[, assert_same_fields, ...])Concatenate many datasets together into one new one.
append_many
(*datasets[, assert_same_fields, ...])Similar to
Dataset.append
.cols
()Get current dataset columns, organized by field.
common_fields
(*datasets[, assert_same_fields])Get a list of fields common to all given datasets.
copy
()Create a deep copy of the current dataset.
copy_fields
(old_fields, new_fields)Copy the values at the given old fields into the new fields, allocating them if necessary.
descr
([exclude_uid])Get numpy-compatible description for dataset fields.
drop_fields
(names[, copy])Remove the given field names from the dataset.
extend
(*others[, repeat_allowed])Add the given dataset(s) to the end of the current dataset.
fields
([exclude_uid])Get a list of field names available in this dataset.
filter_fields
(names[, copy])Keep only the given fields from the dataset.
filter_prefix
(keep_prefix[, copy])Similar to
filter_prefixes
but for a single prefix.filter_prefixes
(prefixes[, copy])Similar to
filter_fields
, except takes list of prefixes.from_async_stream
(stream)Asynchronously load from the given binary stream.
handle
()Numeric dataset handle for working with the dataset via C APIs (documentation is not yet available).
innerjoin
(*others[, assert_no_drop])Create a new dataset with fields from all provided datasets and only including rows common to all provided datasets (based on UID)
innerjoin_many
(*datasets)Similar to
Dataset.innerjoin
.interlace
(*datasets[, assert_same_fields])Combine the current dataset with one or more datasets of the same length by alternating rows from each dataset.
load
(file[, cstrs])Read a dataset from path or file handle.
mask
(mask)Get a subset of the dataset that matches the given boolean mask of rows.
prefixes
()List of field prefixes available in this dataset, assuming fields are have format
{prefix}/{field}
.query
(query)Get a subset of data based on whether the fields match the values in the given query.
query_mask
(query[, invert])Get a boolean array representing the items to keep in the dataset that match the given query filter.
Reset all values of the uid column to new unique random values.
rename_field
(current_name, new_name[, copy])Change name of a dataset field based on the given mapping.
rename_fields
(field_map[, copy])Change the name of dataset fields based on the given mapping.
rename_prefix
(old_prefix, new_prefix[, copy])Similar to rename_fields, except changes the prefix of all fields with the given
old_prefix
tonew_prefix
.replace
(query, *others[, assume_disjoint, ...])Replaces values matching the given query with others.
rows
()A row-by-row accessor list for items in this dataset.
save
(file[, format])Save a dataset to the given path or I/O buffer.
slice
([start, stop, step])Get subset of the dataset with rows in the given range.
split_by
(field)Create a mapping from possible values of the given field and to a datasets filtered by rows of that value.
stream
([compression])Generate a binary representation for this dataset.
subset
(rows)Get a subset of dataset that only includes the given list of rows (from this dataset).
take
(indices)Get a subset of data with only the matching list of row indices.
to_cstrs
([copy])Convert all Python string columns to C strings.
to_list
([exclude_uid])Convert to a list of lists, each value of the outer list representing one dataset row.
to_pystrs
([copy])Convert all C string columns to Python strings.
to_records
([fixed])Convert to a numpy record array.
union
(*others[, assert_same_fields, ...])Take the row union of all the given datasets, based on their uid fields.
union_many
(*datasets[, assert_same_fields, ...])Similar to
Dataset.union
.- add_fields(fields: List[Union[Tuple[str, str], Tuple[str, str, Tuple[int, ...]]]]) Dataset[R] #
- add_fields(fields: List[str], dtypes: Union[str, List['DTypeLike']]) Dataset[R]
Adds the given fields to the dataset. If a field with the same name already exists, that field will not be added (even if types don’t match). Fields are initialized with zeros (or “” for object fields).
- Parameters
fields (list[str] | list[Field]) – Field names or description to add. If a list of names is specified, the second
dtypes
argument must also be specified.dtypes (str | list[DTypeLike], optional) – String with comma-separated data type names or list of data types. Must be specified if the
fields
argument is a list of strings, Defaults to None.
- Returns
self with added fields
- Return type
Examples
>>> dset = Dataset(3) >>> dset.add_fields( ... ['foo', 'bar'], ... ['u8', ('f4', (2,))] ... ) Dataset([ ('uid', [14727850622008419978 309606388100339041 15935837513913527085]), ('foo', [0 0 0]), ('bar', [[0. 0.] [0. 0.] [0. 0.]]), ]) >>> dset.add_fields([('baz', "O")]) Dataset([ ('uid', [14727850622008419978 309606388100339041 15935837513913527085]), ('foo', [0 0 0]), ('bar', [[0. 0.] [0. 0.] [0. 0.]]), ('baz', ["" "" ""]), ])
- classmethod allocate(size: int = 0, fields: List[Union[Tuple[str, str], Tuple[str, str, Tuple[int, ...]]]] = [])#
Allocate a dataset with the given number of rows and specified fields.
- Parameters
size (int, optional) – Number of rows to allocate. Defaults to 0.
fields (list[Field], optional) – Initial fields, excluding
uid
. Defaults to [].
- Returns
Empty dataset
- Return type
- append(*others: Dataset, assert_same_fields=False, repeat_allowed=False)#
Concatenate many datasets together into one new one.
May be called either as an instance method or an initializer to create a new dataset from one or more datasets.
To initialize from zero or more datasets, use
Dataset.append_many
.- Parameters
assert_same_fields (bool, optional) – If not set or False, appends only common dataset fields. If True, fails when input don’t have all fields in common. Defaults to False.
repeat_allowed (bool, optional) – If True, does not fail when there are duplicate UIDs. Defaults to False.
- Returns
appended dataset
- Return type
Examples
As an instance method
>>> dset = d1.append(d2, d3)
As a class method
>>> dset = Dataset.append(d1, d2, d3)
- classmethod append_many(*datasets: Dataset, assert_same_fields=False, repeat_allowed=False)#
Similar to
Dataset.append
. If no datasets are provided, returns an empty Dataset with just theuid
field.- Parameters
assert_same_fields (bool, optional) – Same as for
append
method. Defaults to False.repeat_allowed (bool, optional) – Same as for
append
method. Defaults to False.
- Returns
Appended dataset
- Return type
- cols() Dict[str, Column] #
Get current dataset columns, organized by field.
- Returns
Columns
- Return type
dict[str, Column]
- classmethod common_fields(*datasets: Dataset, assert_same_fields=False) List[Union[Tuple[str, str], Tuple[str, str, Tuple[int, ...]]]] #
Get a list of fields common to all given datasets.
- Parameters
assert_same_fields (bool, optional) – If True, fails if datasets don’t all share the same fields. Defaults to False.
- Returns
List of dataset fields and their data types.
- Return type
list[Field]
- copy_fields(old_fields: List[str], new_fields: List[str])#
Copy the values at the given old fields into the new fields, allocating them if necessary.
- Parameters
old_fields (List[str]) – Name of old fields to copy from
new_fields (List[str]) – New of new fields to copy to
- Returns
current dataset with modified fields
- Return type
- descr(exclude_uid=False) List[Union[Tuple[str, str], Tuple[str, str, Tuple[int, ...]]]] #
Get numpy-compatible description for dataset fields.
- Parameters
exclude_uid (bool, optional) – If True, uid field will not be included. Defaults to False.
- Returns
Fields
- Return type
list[Field]
- drop_fields(names: Union[Collection[str], Callable[[str], bool]], copy: bool = False)#
Remove the given field names from the dataset. Provide a list of fields or a function that takes a field name and returns True if that field should be removed
- Parameters
names (list[str] | (str) -> bool) – Collection of fields to remove or function that takes a field name and returns True if that field should be removed
copy (bool, optional) – If True, return a copy of dataset rather than mutate. Defaults to False.
- Returns
current dataset or copy with fields removed
- Return type
- extend(*others: Dataset, repeat_allowed=False)#
Add the given dataset(s) to the end of the current dataset. Other datasets must have at least the same fields of the current dataset.
- Parameters
repeat_allowed (bool, optional) – If True, does not fail when there are duplicate UIDs. Defaults to False.
- Returns
current dataset with others appended
- Return type
Examples
>>> len(d1), len(d2), len(d3) (42, 3, 5) >>> d1.extend(d2, d3) Dataset(...) >>> len(d1) 50
- fields(exclude_uid=False) List[str] #
Get a list of field names available in this dataset.
- Parameters
exclude_uid (bool, optional) – If True, uid field will not be
False. (included. Defaults to) –
- Returns
List of field names
- Return type
list[str]
- filter_fields(names: Union[Collection[str], Callable[[str], bool]], copy: bool = False)#
Keep only the given fields from the dataset. Provide a list of fields or function that returns
True
if a given field name should be kept.- Parameters
names (list[str] | (str) -> bool) – Collection of fields to keep or function that takes a field name and returns True if that field should be kept
copy (bool, optional) – It True, return a copy of the dataset rather than mutate. Defaults to False.
- Returns
current dataset or copy with filtered fields
- Return type
- filter_prefix(keep_prefix: str, copy: bool = False)#
Similar to
filter_prefixes
but for a single prefix.- Parameters
keep_prefix (str) – Prefix to keep
copy (bool, optional) – If True, return a copy if the dataset rather than mutate. Defaults to False.
- Returns
current dataset or copy with filtered prefix
- Return type
- filter_prefixes(prefixes: Collection[str], copy: bool = False)#
Similar to
filter_fields
, except takes list of prefixes.- Parameters
prefixes (list[str]) – Prefixes to keep
copy (bool, optional) – If True, return a copy if the dataset rather than mutate. Defaults to False.
- Returns
current dataset or copy with filtered prefixes
- Return type
Examples
>>> dset = Dataset([ ... ('uid', [123 456 789]), ... ('field', [0 0 0]), ... ('foo/one', [1 2 3]), ... ('foo/two', [4 5 6]), ... ('bar/one', ['Hello' 'World' '!']), ... ]) >>> dset.filter_prefixes(['foo']) Dataset([ ('uid', [123 456 789]), ('foo/one', [1 2 3]), ('foo/two', [4 5 6]), ])
- async classmethod from_async_stream(stream: AsyncBinaryIO)#
Asynchronously load from the given binary stream. The given stream parameter must at least have
async read(n: int | None) -> bytes
method.
- handle() int #
Numeric dataset handle for working with the dataset via C APIs (documentation is not yet available).
- Returns
- Dataset handle that may be used with C API defined in
<cryosparc-tools/dataset.h>
- Return type
int
- innerjoin(*others: Dataset, assert_no_drop=False)#
Create a new dataset with fields from all provided datasets and only including rows common to all provided datasets (based on UID)
May be called either as an instance method or an initializer to create a new dataset from one or more datasets.
To initialize from zero or more datasets, use
Dataset.innerjoin_many
.- Parameters
assert_no_drop (bool, optional) – Set to True to ensure the provided datasets include at least all UIDs from the first dataset. Defaults to False.
- Returns
combined dataset.
- Return type
Examples
As instance method
>>> dset = d1.innerjoin(d2, d3)
As class method
>>> dset = Dataset.innerjoin(d1, d2, d3)
- classmethod innerjoin_many(*datasets: Dataset)#
Similar to
Dataset.innerjoin
. If no datasets are provided, returns an empty Dataset with just theuid
field.- Returns
combined dataset
- Return type
- interlace(*datasets: Dataset, assert_same_fields=False)#
Combine the current dataset with one or more datasets of the same length by alternating rows from each dataset.
- Parameters
assert_same_fields (bool, optional) – If True, fails if not all given datasets have the same fields. Otherwise result only includes common fields. Defaults to False.
- Returns
combined dataset
- Return type
- classmethod load(file: Union[str, PurePath, IO[bytes]], cstrs: bool = False)#
Read a dataset from path or file handle.
If given a file handle pointing to data in the usual numpy array format (i.e., created by
numpy.save()
), then the handle must be seekable. This restriction does not apply when loading the newerCSDAT_FORMAT
.- Parameters
file (str | Path | IO) – Readable file path or handle. Must be seekable if loading a dataset saved in the default
NUMPY_FORMAT
cstrs (bool) – If True, load internal string columns as C strings instead of Python strings. Defaults to False.
- Raises
DatasetLoadError – If cannot load dataset file.
- Returns
loaded dataset.
- Return type
- mask(mask: Union[List[bool], NDArray])#
Get a subset of the dataset that matches the given boolean mask of rows.
- Parameters
mask (list[bool] | NDArray[bool]) – mask to keep. Must match length of current dataset.
- Returns
subset with only matching rows
- Return type
- prefixes() List[str] #
List of field prefixes available in this dataset, assuming fields are have format
{prefix}/{field}
.- Returns
List of prefixes
- Return type
list[str]
Examples
>>> dset = Dataset({ ... 'uid': [123, 456, 789], ... 'field': [0, 0, 0], ... 'foo/one': [1, 2, 3], ... 'foo/two': [4, 5, 6], ... 'bar/one': ["Hello", "World", "!"] ... }) >>> dset.prefixes() ["field", "foo", "bar"]
- query(query: Union[Dict[str, ArrayLike], Callable[[R], bool]])#
Get a subset of data based on whether the fields match the values in the given query. The query is either a test function that is called on each row or a key/value map of allowed field values.
Each value of a query dictionary may either be a single scalar value or a collection of matching values.
If any field is not in the dataset, it is ignored and all data is kept.
Note
Specifying a query function is very slow for large datasets.
- Parameters
query (dict[str, ArrayLike] | (Row) -> bool) – Query description or row test function.
- Returns
Subset matching the given query
- Return type
Examples
With a query dictionary
>>> dset.query({ ... 'uid': [123456789, 987654321], ... 'micrograph_blob/path': '/path/to/exposure.mrc' ... }) Dataset(...)
With a function (not recommended)
>>> dset.query( ... lambda row: ... row['uid'] in [123456789, 987654321] and ... row['micrograph_blob/path'] == '/path/to/exposure.mrc' ... ) Dataset(...)
- query_mask(query: Dict[str, ArrayLike], invert=False) NDArray[n.bool_] #
Get a boolean array representing the items to keep in the dataset that match the given query filter. See
query
method for example query format.- Parameters
query (dict[str, ArrayLike]) – Query description
invert (bool, optional) – If True, returns mask with all items negated. Defaults to False.
- Returns
Query mask, may be used with the
mask()
method.- Return type
NDArray[bool]
- reassign_uids()#
Reset all values of the uid column to new unique random values.
- Returns
current dataset with modified UIDs
- Return type
- rename_field(current_name: str, new_name: str, copy: bool = False)#
Change name of a dataset field based on the given mapping.
- Parameters
current_name (str) – Old field name.
new_name (str) – New field name.
copy (bool, optional) – If True, return a copy of the dataset rather than mutate. Defaults to False.
- Returns
current dataset or copy with fields renamed
- Return type
- rename_fields(field_map: Union[Dict[str, str], Callable[[str], str]], copy: bool = False)#
Change the name of dataset fields based on the given mapping.
- Parameters
field_map (dict[str, str] | (str) -> str) – Field mapping function or dictionary
copy (bool, optional) – If True, return a copy of the dataset rather than mutate. Defaults to False.
- Returns
current dataset or copy with fields renamed
- Return type
- rename_prefix(old_prefix: str, new_prefix: str, copy: bool = False)#
Similar to rename_fields, except changes the prefix of all fields with the given
old_prefix
tonew_prefix
.- Parameters
old_prefix (str) – old prefix to rename
new_prefix (str) – new prefix
copy (bool, optional) – If True, return a copy of the dataset rather than mutate. Defaults to False.
- Returns
current dataset or copy with renamed prefix.
- Return type
- replace(query: Dict[str, ArrayLike], *others: Dataset, assume_disjoint=False, assume_unique=False)#
Replaces values matching the given query with others. The query is a key/value map of allowed field values. The values may be either a single scalar value or a set of possible values. If nothing matches the query (e.g., {} specified), works the same way as append.
All given datasets must have the same fields.
- Parameters
query (dict[str, ArrayLike]) – Query description.
assume_disjoint (bool, optional) – If True, assumes given datasets do not share any uid values. Defaults to False.
assume_unique (bool, optional) – If True, assumes each given dataset has no duplicate uid values. Defaults to False.
- Returns
- subset with rows matching query removed and other datasets
appended at the end
- Return type
- rows() Spool[R] #
A row-by-row accessor list for items in this dataset.
- Returns
List-like row accessor
- Return type
Examples
>>> dset = Dataset.load('/path/to/dataset.cs') >>> for row in dset.rows() ... print(row.to_dict())
- save(file: Union[str, PurePath, IO[bytes]], format: int = 1)#
Save a dataset to the given path or I/O buffer.
By default, saves as a numpy record array in the .npy format. Specify
format=CSDAT_FORMAT
to save in the latest .cs file format which is faster and results in a smaller file size but is not numpy-compatible.- Parameters
file (str | Path | IO) – Writeable file path or handle
format (int, optional) – Must be of the constants
DEFAULT_FORMAT
,NUMPY_FORMAT
(same asDEFAULT_FORMAT
), orCSDAT_FORMAT
. Defaults toDEFAULT_FORMAT
.
- Raises
TypeError – If invalid format specified
- slice(start: int = 0, stop: Optional[int] = None, step: int = 1)#
Get subset of the dataset with rows in the given range.
- Parameters
start (int, optional) – Start index to slice from (inclusive). Defaults to 0.
stop (int, optional) – End index to slice until (exclusive). Defaults to length of dataset.
step (int, optional) – How many entries to step over in resulting slice. Defaults to 1.
- Returns
subset with slice of matching rows
- Return type
- split_by(field: str)#
Create a mapping from possible values of the given field and to a datasets filtered by rows of that value.
Examples
>>> dset = Dataset([ ... ('uid', [1, 2, 3, 4]), ... ('foo', ['hello', 'world', 'hello', 'world']) ... ]) >>> dset.split_by('foo') { 'hello': Dataset([('uid', [1, 3]), ('foo', ['hello', 'hello'])]), 'world': Dataset([('uid', [2, 4]), ('foo', ['world', 'world'])]) }
- stream(compression: Optional[Literal['lz4', None]] = None) Generator[bytes, None, None] #
Generate a binary representation for this dataset. Results may be written to a file or buffer to be sent over the network.
Buffer will have the same format as Dataset files saved with
format=CSDAT_FORMAT
. CallDataset.load
on the resulting file/buffer to retrieve the original data.- Yields
bytes – Dataset file chunks
- subset(rows: Collection[Row])#
Get a subset of dataset that only includes the given list of rows (from this dataset).
- take(indices: Union[List[int], NDArray])#
Get a subset of data with only the matching list of row indices.
- Parameters
indices (list[int] | NDArray[int]) – collection of indices to keep.
- Returns
subset with matching row indices
- Return type
- to_cstrs(copy: bool = False)#
Convert all Python string columns to C strings. Resulting dataset fields that previously had dtype
np.object_
(orT_OBJ
internally) will get typenp.uint64
and may be accessed as via the dataset C API.Note: This operation takes a long time for large datasets.
- Parameters
copy (bool, optional) – If True, returns a modified copy of the dataset instead of mutation. Defaults to False.
- Returns
same dataset or copy if specified.
- Return type
- to_list(exclude_uid=False) List[list] #
Convert to a list of lists, each value of the outer list representing one dataset row. Every value in the resulting list is guaranteed to be a python type (no numpy numeric types).
- Parameters
exclude_uid (bool, optional) – If True, uid column will not be included in output list. Defaults to False.
- Returns
list of row lists
- Return type
list
Examples
>>> dset = Dataset([ ... ('uid', [123 456 789]), ... ('foo/one', [1 2 3]), ... ('foo/two', [4 5 6]), ... ]) >>> dset.to_list() [[123, 1, 4], [456, 2, 5], [789, 3, 6]]
- to_pystrs(copy: bool = False)#
Convert all C string columns to Python strings. Resulting dataset fields that previously had dtype
np.uint64
(andT_STR
internally) will get typenp.object_
.- Parameters
copy (bool, optional) – If True, returns a modified copy of the dataset instead of mutation. Defaults to False.
- Returns
same dataset or copy if specified.
- Return type
- to_records(fixed=False)#
Convert to a numpy record array.
- Parameters
fixed (bool, optional) – If True, converts string columns (
dtype("O")
) to fixed-length strings (dtype("S")
). Defaults to False.- Returns
Numpy record array
- Return type
NDArray
- union(*others: Dataset, assert_same_fields=False, assume_unique=False)#
Take the row union of all the given datasets, based on their uid fields.
May be called either as an instance method or an initializer to create a new dataset from one or more datasets:
To initialize from zero or more datasets, use
Dataset.union_many
.- Parameters
assert_same_fields (bool, optional) – Set to True to enforce that datasets have identical fields. Otherwise, result only includes fields common to all datasets. Defaults to False.
assume_unique (bool, optional) – Set to True to assume that each input dataset’s UIDs are unique (though there may be common UIDs between datasets). Defaults to False.
- Returns
Combined dataset
- Return type
Examples
As instance method
>>> dset = d1.union(d2, d3)
As class method
>>> dset = Dataset.union(d1, d2, d3)
- classmethod union_many(*datasets: Dataset, assert_same_fields=False, assume_unique=False)#
Similar to
Dataset.union
. If no datasets are provided, returns an empty Dataset with just theuid
field.- Parameters
assert_same_fields (bool, optional) – Same as for
union
. Defaults to False.assume_unique (bool, optional) – Same as for
union
. Defaults to False.
- Returns
combined dataset, or empty dataset if none are provided.
- Return type
- cryosparc.dataset.NEWEST_FORMAT = 2#
Newest save .cs file format. Same as
CSDAT_FORMAT
.
- cryosparc.dataset.NUMPY_FORMAT = 1#
Numpy-array .cs file format. Same as
DEFAULT_FORMAT
.
- cryosparc.dataset.generate_uids(num: int = 0)#
Generate the given number of random 64-bit unsigned integer uids.
- Parameters
num (int, optional) – Number of UIDs to generate. Defaults to 0.
- Returns
Numpy array of random unsigned 64-bit integers
- Return type
NDArray