6. Import from EPU XML File#
This example uses the EMPIAR-10409 dataset demonstrate how to import multiple movie or micrograph datasets from an EPU-generated XML file.
First initialize a connection to CryoSPARC and find the project.
from cryosparc.tools import CryoSPARC
cs = CryoSPARC(host="cryoem5", base_port=40000)
assert cs.test_connection()
project = cs.find_project("P251")
Connection succeeded to CryoSPARC command_core at http://cryoem5:40002
Connection succeeded to CryoSPARC command_vis at http://cryoem5:40003
Create a job which receives each set of images in the XML file as outputs.
job = project.create_external_job("W7", title="Import Image Sets")
Load the EPU-generated XML file from disk. Also define some helper functions to access the contents of an XML tree.
from pathlib import Path
from xml.dom import minidom
root_dir = Path("/bulk6/data/EMPIAR2/10409/10409")
with open(root_dir / "10409.xml", "r") as f:
doc = minidom.parse(f)
def get_child(node, child_tag):
return node.getElementsByTagName(child_tag)[0]
def get_child_value(node, child_tag):
return get_child(node, child_tag).firstChild.nodeValue.strip()
The XML file has the following structure (parts truncated for brevity):
<entry xmlns="http://pdbe.org/empiar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ftp://ftp.ebi.ac.uk/pub/databases/emtest/empiar/schema/empiar.xsd" accessionCode="EMPIAR-10409" public="true">
<admin>
...
</admin>
...
<imageSet>
<name>Unaligned TIF movies of SARS-CoV2 RdRp in complex with nsp7, nsp8 and RNA (part 1)</name>
<directory>/data/data_tilt30_round1</directory>
<category>micrographs - multiframe</category>
<headerFormat>TIFF</headerFormat>
<dataFormat>TIFF</dataFormat>
<numImagesOrTiltSeries>3092</numImagesOrTiltSeries>
<framesPerImage>80</framesPerImage>
<voxelType>UNSIGNED BYTE</voxelType>
<dimensions>
<imageWidth>5760</imageWidth>
<pixelWidth>0.834</pixelWidth>
<imageHeight>4092</imageHeight>
<pixelHeight>0.834</pixelHeight>
</dimensions>
<details>...</details>
<segmentationList/>
<micrographsFilePattern>data/data_tilt30_round1/HH691_funky_RNA_tilt30_*.tif</micrographsFilePattern>
<pickedParticlesFilePattern>data/data_tilt30_round1/matching/HH691_funky_RNA_tilt30_*_SARSCoV2_nsp12_net_4.star</pickedParticlesFilePattern>
<pickedParticlesDirectory>data/data_tilt30_round1/matching/</pickedParticlesDirectory>
</imageSet>
<imageSet>
...
</imageSet>
...
</entry>
Find all <imageSet>
tags, take only the first two. These which correspond to two sets of unaligned TIFF movie files. For each image set:
Use the helper functions to get the values of various tags available in this dataset
Use the
glob
module to retrieve the relavant list of movie filesAdd an
exposure
output to the job and allocate a dataset with the relevant fieldsPopulate the required fields
Save the output to the job
from glob import glob
from cryosparc.tools import get_exposure_format, get_import_signatures
job.start()
for i, node in enumerate(doc.getElementsByTagName("imageSet")[:2]):
directory = get_child_value(node, "directory")
file_pattern = get_child_value(node, "micrographsFilePattern")
data_format = get_child_value(node, "dataFormat")
voxel_type = get_child_value(node, "voxelType")
frames_per_image = int(get_child_value(node, "framesPerImage"))
dimensions_node = get_child(node, "dimensions")
pixel_width = float(get_child_value(dimensions_node, "pixelWidth"))
image_width = int(get_child_value(dimensions_node, "imageWidth"))
image_height = int(get_child_value(dimensions_node, "imageHeight"))
paths = glob(str(root_dir / file_pattern))
output_name = f"images_{i}"
dset = job.add_output(
type="exposure",
name=f"images_{i}",
slots=["movie_blob", "mscope_params", "gain_ref_blob"],
alloc=len(paths),
)
dset["movie_blob/path"] = paths
dset["movie_blob/shape"] = (frames_per_image, image_height, image_width)
dset["movie_blob/psize_A"] = pixel_width
dset["movie_blob/format"] = get_exposure_format(data_format, voxel_type)
dset["movie_blob/import_sig"] = get_import_signatures(paths)
# Note: Some of these may also be read from included per-micrograph XML files
dset["mscope_params/accel_kv"] = 300
dset["mscope_params/cs_mm"] = 2.7
dset["mscope_params/total_dose_e_per_A2"] = 60
dset["mscope_params/exp_group_id"] = i
dset["mscope_params/defect_path"] = ""
gain_path = str(root_dir / directory[1:] / "gain" / "CountRef.mrc")
dset["gain_ref_blob/path"] = str(gain_path)
dset["gain_ref_blob/shape"] = (image_height, image_width)
job.save_output(f"images_{i}", dset)
job.stop()
The above will result in an External job with two outputs images_0
and images_1
. Use these for further processing.
Note
When importing single-frame mirographs, use slot micrograph_blob
instead of movie_blob
.