Import from EPU XML File

6. Import from EPU XML File#

This example uses the EMPIAR-10409 dataset demonstrate how to import multiple movie or micrograph datasets from an EPU-generated XML file.

First initialize a connection to CryoSPARC and find the project.

from cryosparc.tools import CryoSPARC

cs = CryoSPARC(host="cryoem5", base_port=40000)
assert cs.test_connection()

project = cs.find_project("P251")
Connection succeeded to CryoSPARC command_core at http://cryoem5:40002
Connection succeeded to CryoSPARC command_vis at http://cryoem5:40003

Create a job which receives each set of images in the XML file as outputs.

job = project.create_external_job("W7", title="Import Image Sets")

Load the EPU-generated XML file from disk. Also define some helper functions to access the contents of an XML tree.

from pathlib import Path
from xml.dom import minidom

root_dir = Path("/bulk6/data/EMPIAR2/10409/10409")
with open(root_dir / "10409.xml", "r") as f:
    doc = minidom.parse(f)


def get_child(node, child_tag):
    return node.getElementsByTagName(child_tag)[0]


def get_child_value(node, child_tag):
    return get_child(node, child_tag).firstChild.nodeValue.strip()

The XML file has the following structure (parts truncated for brevity):

 <entry xmlns="http://pdbe.org/empiar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ftp://ftp.ebi.ac.uk/pub/databases/emtest/empiar/schema/empiar.xsd" accessionCode="EMPIAR-10409" public="true">
    <admin>
        ...
    </admin>
    ...
    <imageSet>
        <name>Unaligned TIF movies of SARS-CoV2 RdRp in complex with nsp7, nsp8 and RNA (part 1)</name>
        <directory>/data/data_tilt30_round1</directory>
        <category>micrographs - multiframe</category>
        <headerFormat>TIFF</headerFormat>
        <dataFormat>TIFF</dataFormat>
        <numImagesOrTiltSeries>3092</numImagesOrTiltSeries>
        <framesPerImage>80</framesPerImage>
        <voxelType>UNSIGNED BYTE</voxelType>
        <dimensions>
            <imageWidth>5760</imageWidth>
            <pixelWidth>0.834</pixelWidth>
            <imageHeight>4092</imageHeight>
            <pixelHeight>0.834</pixelHeight>
        </dimensions>
        <details>...</details>
        <segmentationList/>
        <micrographsFilePattern>data/data_tilt30_round1/HH691_funky_RNA_tilt30_*.tif</micrographsFilePattern>
        <pickedParticlesFilePattern>data/data_tilt30_round1/matching/HH691_funky_RNA_tilt30_*_SARSCoV2_nsp12_net_4.star</pickedParticlesFilePattern>
        <pickedParticlesDirectory>data/data_tilt30_round1/matching/</pickedParticlesDirectory>
    </imageSet>
    <imageSet>
        ...
    </imageSet>
    ...
</entry>

Find all <imageSet> tags, take only the first two. These which correspond to two sets of unaligned TIFF movie files. For each image set:

  1. Use the helper functions to get the values of various tags available in this dataset

  2. Use the glob module to retrieve the relavant list of movie files

  3. Add an exposure output to the job and allocate a dataset with the relevant fields

  4. Populate the required fields

  5. Save the output to the job

from glob import glob

from cryosparc.tools import get_exposure_format, get_import_signatures

job.start()

for i, node in enumerate(doc.getElementsByTagName("imageSet")[:2]):
    directory = get_child_value(node, "directory")
    file_pattern = get_child_value(node, "micrographsFilePattern")
    data_format = get_child_value(node, "dataFormat")
    voxel_type = get_child_value(node, "voxelType")
    frames_per_image = int(get_child_value(node, "framesPerImage"))

    dimensions_node = get_child(node, "dimensions")
    pixel_width = float(get_child_value(dimensions_node, "pixelWidth"))
    image_width = int(get_child_value(dimensions_node, "imageWidth"))
    image_height = int(get_child_value(dimensions_node, "imageHeight"))

    paths = glob(str(root_dir / file_pattern))
    output_name = f"images_{i}"
    dset = job.add_output(
        type="exposure",
        name=f"images_{i}",
        slots=["movie_blob", "mscope_params", "gain_ref_blob"],
        alloc=len(paths),
    )

    dset["movie_blob/path"] = paths
    dset["movie_blob/shape"] = (frames_per_image, image_height, image_width)
    dset["movie_blob/psize_A"] = pixel_width
    dset["movie_blob/format"] = get_exposure_format(data_format, voxel_type)
    dset["movie_blob/import_sig"] = get_import_signatures(paths)

    # Note: Some of these may also be read from included per-micrograph XML files
    dset["mscope_params/accel_kv"] = 300
    dset["mscope_params/cs_mm"] = 2.7
    dset["mscope_params/total_dose_e_per_A2"] = 60
    dset["mscope_params/exp_group_id"] = i
    dset["mscope_params/defect_path"] = ""

    gain_path = str(root_dir / directory[1:] / "gain" / "CountRef.mrc")
    dset["gain_ref_blob/path"] = str(gain_path)
    dset["gain_ref_blob/shape"] = (image_height, image_width)

    job.save_output(f"images_{i}", dset)

job.stop()

The above will result in an External job with two outputs images_0 and images_1. Use these for further processing.

Note

When importing single-frame mirographs, use slot micrograph_blob instead of movie_blob.