Data Curation Exercise#

DMSC Summer SChool#

Today is teh 5th day of the DMSC Summer School. You should have created a new simulated data, reduced them and analized them.
In each one of those steps, you have save on eor multiple files and took notes about the work that you have performed.
Now it is time to curate your data and make them FAIR.

In this notebook, we ask you to perform the following tasks:

  1. Decide which data files you would like to upload in SciCat and make them available to your class mates

  2. Organize them in datasets

  3. For each datasets that you have decided to create, create a set of metadata that you think is relevant to find the dataset and decide if it is relevant for your needs

  4. Create the datasets locally, including metadata and files

  5. Upload the datasets to SciCat

  6. Save the pids of the datasets created

  7. Validate the new datasets through SciCat Frontend or a Jupyter Notebook leveraging Scitacean

You can work on the tasks alone, although we suggest to work on them in a group of two/three people. It helps with deciding in defining your datasets and brainstorming which metadata are important to make your data FAIR.

If you have any questions about python, Scitacean and SciCat, do not hesitate to ask.

This notebook has been prepopulated with the necessary code to connect to SciCat and instatiate the Scitacean client.
We highlighted the cells where you need to fill in the blanks in order to perform the tasks listed above.
They are titled Section n.
We assumed that only one dataset will be created. If you need to create more than one, please duplicate cells as needed.

Good Luck#

Load standard libraries

import sys
import os
import uuid

URL of the scicat instance containing the data

scicat_instance = "https://staging.scicat.ess.eu/api/v3"

Valid Authentication token
(Also called access token or SciCat token)
Follow the steps listed below to obtain the token,

  • visit (ESS SciCat staging environment)[https://staging.scicat.ess.eu]

  • log in using the credentials provided

  • go to User->settings page,

  • and click on the copy to clipboard icon added at the end of the SciCat Token .

SciCat User Settings

Access token example:
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJfaWQiOiI2MzliMmE1MWI0MTU0OWY1M2RmOWVjMzYiLCJyZWFsbSI6ImxvY2FsaG9zdCIsInVzZXJuYW1lIjoiaW5nZXN0b3IiLCJlbWFpbCI6InNjaWNhdGluZ2VzdG9yQHlvdXIuc2l0ZSIsImVtYWlsVmVyaWZpZWQiOnRydWUsImF1dGhTdHJhdGVneSI6ImxvY2FsIiwiaWQiOiI2MzliMmE1MWI0MTU0OWY1M2RmOWVjMzYiLCJpYXQiOjE2OTIwODc0ODUsImV4cCI6MTY5MjA5MTA4NX0.Phca4UF7WKY367-10Whgwd5jaFjiPku6WsgiPeDh_-o

IMPORTANT: make sure to replace <YOUR_SCICAT_TOKEN> with the current token retrieved from SciCat

token="<YOUR_SCICAT_TOKEN>"

Import Scitacean For more information please check the official repository and documentation

from scitacean import Client, Dataset
from scitacean.transfer.sftp import SFTPFileTransfer
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 1
----> 1 from scitacean import Client, Dataset
      2 from scitacean.transfer.sftp import SFTPFileTransfer

ModuleNotFoundError: No module named 'scitacean'

User name and access key used to access files.
The ssh key file is provided at the beginning of the session.

sftp_username = "dss2023"
sftp_key_filename = "<PATH_OF_THE_SSH_KEY_FILE>"

Function to perform some magic and establish connection to the data repository

def connect(host, port):
    from paramiko import SSHClient, AutoAddPolicy

    client = SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(AutoAddPolicy())
    client.connect(
        hostname=host, 
        username=sftp_username,
        key_filename=sftp_key_filename,
        timeout=1)
    return client.open_sftp()

Instantiate scitacean client

client = Client.from_token(
    url=scicat_instance,
    token=token,
    file_transfer=SFTPFileTransfer(
        host="sftpserver2.esss.dk",
        connect=connect
    ))

We need a unique name for the folder where to upload the data to.
We use UUID to achieve that.

run_uuid = str(uuid.uuid4())

Section 1.#

Create a local Scitacean dataset of type raw.
We already included the main template, please populate all the needed fields.

Here is a list of some of the available dataset fields:

  • name

  • description

  • type

  • contact_email

  • principal_investigator

  • owner

  • owner_email

  • data_format

  • is_published

  • keywords

Some of this fields are required.

Please check Scitacean documentation for more information

dataset = Dataset(
    creation_location='/ESS/DMSC/Summer_School',
    owner_group='dmsc_summer_school',
    access_groups=['ess','dram','swap'],
    instrument_id=None,
    techniques=[],
    keywords=[
        'DMSC Summer School', 
        '2023', 
        'DMSC Summer School 2023'
    ],
    license='unknown',
    proposal_id=None,    
    source_folder='/ess/data/dmsc_summer_school/2023/upload/' + run_uuid,
    source_folder_host="SpectrumScale.esss.dk",
    <ADD_MISSING_FIELDS>
)

Section 2.#

Add your scientific metadata to the new dataset.

Remember the dataset scientific metadata is a dictionary where the key is the name of the metadata and the value is a nested dictionary containing the value and the unit.
If unit is not used, leave it empty.

"your-metadata-key" : {
    "value" : "your-metadata-value",
    "unit" : "metadata-SI-unit"
}
dataset._meta = {
    <INSERT_YOUR_METADATA>
}

Section 3#

add the files that you want to link to this dataset.
You need to repeat this step for each file that you would like to include in the dataset

dataset.add_local_files(
    <LOCAL_PATH_TO_THE_FILE>, 
    base_path=<PORTION_OF_THE_PATH_THAT_NEEDS_TO_BE_REMOVE>
)

Section 4#

visually inspect your newly created dataset

Section 5#

We are now ready to upload the dataset to SciCat and save the new dataset to variable uploaded_dataset

uploaded_dataset = ...

Section 6#

Extract the SciCat-assigned pid from the new dataset and print it

dataset_pid = ...

Section 7#

Verify the uploded dataset shows the correct pid with a visual inspection

Section 8#

Reload the newly created dataset from SciCat and validate that matches the information that you specified when you cretaed it.

verification_dataset = client.get_dataset(
    "<DATASET_PID_FROM_SECTION_6>"
)

Section 9#

Check directly on Scicat how your new dataset looks like in the webUI