Software containers & data

Software containers & data#

Learning objectives#

get local data into containers
get online data into containers
bind mount input/output directories to a container during runtime

Motivation#

From time to time, it might be advised or even necessary to provide some data that comes with your Docker container, e.g. for reproducible purposes, tutorials, etc. But how do we get data into our Docker containers? Well, there are two different ways of equipping your container with additional data:

One can either copy the data inside the container during its build. This way the respective data is permanently stored inside our image. As mentioned above, this functionality can come in handy if you want to provide tutorials or user-manuals to the end-user of your container or some test data to make sure the containers behaviour matches the expected behaviour.
However, sometimes it is required to get data into our container during runtime, e.g. if we have set-up an automated analysis workflow within our container and want the user to provide their own data, such that the analysis can be run on that data inside the container. To achieve this, we can mount a directory from our local computer to a directory within the container. This way we can give the container access to specific directories on our local computer, e.g. to load data from our machine into the container. Since mounting is a bidirectional process we can also allow the container to write outputs to sepcific directories on our local machine.

In the following sections, we will go over both ways of getting data inside your conatiner and provide some practical examples.

Getting Data into a container permanently#

Let’s say we want to add a picture of a whale into our Docker container, because we’re such docker fans and whales are nothing but awesome, buuut we’ve learned that the state of a given container cannot be permanently changed from the mounting part of this workshop.

So as we’ve already seen we can use the COPY instruction to add the data (i.e. our .png file) into our Docker container during its build.

So let’s first build a new build context + Dockerfile:

 mkdir docker_data_container
 touch docker_data_container/Dockerfile

Open the file with VScode and add the following line:

    FROM ubuntu:latest

Now we add the relevant file to our build context, i.e. we move the data you want to include in the same directory as the Dockerfile, e.g.

mv Desktop/happy_whale.jpg Desktop/docker_data_container

Now, we add a line to our Dockerfile that indicates that this image should be copied to a specific location inside our Docker container, e.g. /home/images

COPY ./happy_whale.jpg /home/images/happy_whale.jpg

And you guessed it: time to build!

docker build -t whale_container Desktop/docker_data_container

If we now run our freshly build Docker container and check the contents of /home, we find the folder images and in it our happy_whale.jpg

(base) Michaels-MacBook-Pro:Desktop me$ docker run -it --rm whale_container bash
root@8e2a056bed3a:/# ls
bin  boot  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
root@8e2a056bed3a:/# cd home/
root@8e2a056bed3a:/home# ls
images
root@8e2a056bed3a:/home# cd images/
root@8e2a056bed3a:/home/images# ls
'happy_whale.jpg'

If this seems tedious or you have to copy a lot of files you can also directly add a number of files, i.e. from the current working directory where the docker build command is run, replace the above code with:

    COPY . /home/images

Incorporating online data#

In case you don’t have or don’t want everything that should go into the Docker container stored locally, you can also use command line functionality to download data, e.g., using the bash command curl. This can be very helpful when pulling data from an online repository.

Simply add the respective lines to the Dockerfile:

    RUN apt-get update && apt-get install curl -y
    RUN curl --output /home/images/happy_whale_2.jpg  https://cdn.pixabay.com/photo/2023/09/25/06/48/whale-8274342_1280.jpg

Rebuild and checking the outcome, everything worked like a charm!

    (base) Michaels-MacBook-Pro:Desktop me$ docker run -it --rm whale_container bash
    root@b41d20e42fbb:/# ls
    bin  boot  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
    root@b41d20e42fbb:/# cd home/
    root@b41d20e42fbb:/home# cd images/
    root@b41d20e42fbb:/home/images# ls
    happy_whale.jpg  happy_whale_2.jpg

Mounting data inside and outside of your container#

Well, all of you should have heard plenty about mounting in our quickstart section. Once again, mounting describes a mapping from paths outside the container (e.g. your local machine or online data repositories) to paths inside the container.

Now, who remembers the flag we have to use within the `docker run` command to enable **mounting** ? ...

Solution

…correct, you have to use the -v flag within the docker run command to specify the mounted directories. This flag can be untilized as follows:

docker run -v path/outside/container:/path/inside/container name_of_image

You can also restrict the rights of mounted paths, e.g. read-only in case any modification on your local system should be prevented. This ca be done by adding a :ro. It should look something like this:

docker run -v path/outside/container:/path/inside/container:ro name_of_image

Note on server system usage: If you use a mounted directory to store output, produced inside your container, on your local filesystem, make sure that you have administrator rights on your own machine to access/modify the output. Since the container writes the output as a administrator, you will not be able to access/modify your data if you dont have superuser rights, e.g. when working on a compute server. One way to avoid this issue, is the user flag: -u which can be utilized within the docker run command as follows:

docker run -u <userid> name_of_image

How do I get my userid? Well, thats easy! Simply type

id -u <yourusername>

into your terminal and your userid should appear.

Practical application#

The probably most common use-case for mounts in the domain of neuroscientific research is the transfer of research data into and outside of the container.

better option than permanetly copying data into container since neuroimaging data is often lagre
additionally helpful to create permanent outputs written to your local system , e.g. results of analysis (remember every file thats created inside a container is removed upon shutting down the container)

For demonstration purposes we have created an image that can be utilized to convert the MNE-sample-dataset, a test dataset provided by the MNE community, into the BIDS format, a standardized way of organizing your neuroimaging data. MNE is an open-source tool for MEEG data processing and visualisation.

You can find the dataset in the ~/docker_workshop_oldenburg directory on your local machine. Let’s check if its stored where its supposed to be:

aaronreer@FK6P-1158240:~/docker_workshop_oldenburg$ ls
MNE-sample-data

For our container to run the conversion properly, we need to mount the directory from our local system, containing our data to the /input directory of the container. Further, we need to mount the directory to which we want to container to write our BIDS converted data to the output folder of the container:

docker run \
-v /home/aaronreer/docker_workshop_oldenburg/MNE-sample-data/:/input/ \
-v /home/aaronreer/docker_workshop_oldenburg/:/output aaronreer1/ \
mne_conversion:firsttry

Output

    # Output:
    <span style="color:blue"># test1 :</span>
    Opening raw data file /input/MEG/sample/sample_audvis_raw.fif...
        Read a total of 3 projection items:
            PCA-v1 (1 x 102)  idle
            PCA-v2 (1 x 102)  idle
            PCA-v3 (1 x 102)  idle
        Range : 25800 ... 192599 =     42.956 ...   320.670 secs
    Ready.
    Opening raw data file /input/MEG/sample/ernoise_raw.fif...
    Isotrak not found
        Read a total of 3 projection items:
            PCA-v1 (1 x 102)  idle
            PCA-v2 (1 x 102)  idle
            PCA-v3 (1 x 102)  idle
        Range : 19800 ... 85867 =     32.966 ...   142.965 secs
    Ready.
    Opening raw data file /input/MEG/sample/sample_audvis_raw.fif...
        Read a total of 3 projection items:
            PCA-v1 (1 x 102)  idle
            PCA-v2 (1 x 102)  idle
            PCA-v3 (1 x 102)  idle
        Range : 25800 ... 192599 =     42.956 ...   320.670 secs
    Ready.
    Opening raw data file /input/MEG/sample/ernoise_raw.fif...
    Isotrak not found
        Read a total of 3 projection items:
            PCA-v1 (1 x 102)  idle
            PCA-v2 (1 x 102)  idle
            PCA-v3 (1 x 102)  idle
        Range : 19800 ... 85867 =     32.966 ...   142.965 secs
    Ready.
    Writing '/output/MNE-sample-data-bids1/README'...
    Writing '/output/MNE-sample-data-bids1/participants.tsv'...
    Writing '/output/MNE-sample-data-bids1/participants.json'...
    Writing of electrodes.tsv is not supported for data type "meg". Skipping ...
    Writing '/output/MNE-sample-data-bids1/dataset_description.json'...
    Writing '/output/MNE-sample-data-bids1/sub-emptyroom/ses-20021206/meg/sub-emptyroom_ses-20021206_task-noise_meg.json'...
    Writing '/output/MNE-sample-data-bids1/sub-emptyroom/ses-20021206/meg/sub-emptyroom_ses-20021206_task-noise_channels.tsv'...
    Copying data files to sub-emptyroom_ses-20021206_task-noise_meg.fif
    Reserving possible split file sub-emptyroom_ses-20021206_task-noise_split-01_meg.fif
    Writing /output/MNE-sample-data-bids1/sub-emptyroom/ses-20021206/meg/sub-emptyroom_ses-20021206_task-noise_meg.fif
    Closing /output/MNE-sample-data-bids1/sub-emptyroom/ses-20021206/meg/sub-emptyroom_ses-20021206_task-noise_meg.fif
    [done]
    Writing '/output/MNE-sample-data-bids1/sub-emptyroom/ses-20021206/sub-emptyroom_ses-20021206_scans.tsv'...
    Wrote /output/MNE-sample-data-bids1/sub-emptyroom/ses-20021206/sub-emptyroom_ses-20021206_scans.tsv entry with meg/sub-emptyroom_ses-20021206_task-noise_meg.fif.
    Writing '/output/MNE-sample-data-bids1/participants.tsv'...
    Writing '/output/MNE-sample-data-bids1/participants.json'...
    Writing '/output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_coordsystem.json'...
    Writing '/output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_coordsystem.json'...
    Used Annotations descriptions: ['Auditory/Left', 'Auditory/Right', 'Button', 'Smiley', 'Visual/Left', 'Visual/Right']
    Writing '/output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_task-audiovisual_run-1_events.tsv'...
    Writing '/output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_task-audiovisual_run-1_events.json'...
    Writing '/output/MNE-sample-data-bids1/dataset_description.json'...
    Writing '/output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_task-audiovisual_run-1_meg.json'...
    Writing '/output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_task-audiovisual_run-1_channels.tsv'...
    Copying data files to sub-01_ses-01_task-audiovisual_run-1_meg.fif
    Reserving possible split file sub-01_ses-01_task-audiovisual_run-1_split-01_meg.fif
    Writing /output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_task-audiovisual_run-1_meg.fif
    Closing /output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_task-audiovisual_run-1_meg.fif
    [done]
    Writing '/output/MNE-sample-data-bids1/sub-01/ses-01/sub-01_ses-01_scans.tsv'...
    Wrote /output/MNE-sample-data-bids1/sub-01/ses-01/sub-01_ses-01_scans.tsv entry with meg/sub-01_ses-01_task-audiovisual_run-1_meg.fif.
    Writing fine-calibration file to /output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_acq-calibration_meg.dat
    Writing crosstalk file to /output/MNE-sample-data-bids1/sub-01/ses-01/meg/sub-01_ses-01_acq-crosstalk_meg.fif
    |MNE-sample-data-bids1/
    |--- README
    |--- dataset_description.json
    |--- participants.json
    |--- participants.tsv
    |--- sub-01/
    |------ ses-01/
    |--------- sub-01_ses-01_scans.tsv
    |--------- meg/
    |------------ sub-01_ses-01_acq-calibration_meg.dat
    |------------ sub-01_ses-01_acq-crosstalk_meg.fif
    |------------ sub-01_ses-01_coordsystem.json
    |------------ sub-01_ses-01_task-audiovisual_run-1_channels.tsv
    |------------ sub-01_ses-01_task-audiovisual_run-1_events.json
    |------------ sub-01_ses-01_task-audiovisual_run-1_events.tsv
    |------------ sub-01_ses-01_task-audiovisual_run-1_meg.fif
    |------------ sub-01_ses-01_task-audiovisual_run-1_meg.json
    |--- sub-emptyroom/
    |------ ses-20021206/
    |--------- sub-emptyroom_ses-20021206_scans.tsv
    |--------- meg/
    |------------ sub-emptyroom_ses-20021206_task-noise_channels.tsv
    |------------ sub-emptyroom_ses-20021206_task-noise_meg.fif
    |------------ sub-emptyroom_ses-20021206_task-noise_meg.json
    aaronreer@FK6P-1158240:~/data$

When checking our filesystem using the ls command we can observe that a new directory called MNE-sample-data-bids1 has appeared.

aaronreer@FK6P-1158240:~/docker_workshop_oldenburg$ ls
MNE-sample-data  MNE-sample-data-bids1

Docker & data - discussion#

What would you like to have in your Docker containers?
What type of data are you planning on working with?
Let us know and we’ll go through the respective steps!