Data Management
Contents
Data Management¶
General Introduction¶
The topic of data managment can seem quite trivial for beginners, but starting out with best practices will make your future significantly easier and help ensure your data is well-organized and secure.
Where are we now?¶
Before we start a lesson it is usually helpfulto reflect on what we already know about a certain topic or e.g. what a lesson may possibly try to teach us.
So please take a few minutes to reflect on the concept of digital literacy with the following questions.
1.1 What is your understanding of data managment?
1.2 What could you possibly be learn?
1.3 How do you usually store/manage data?
Note
Feel free to do this in your head or in a separate document. Remember, to interactively engage with the material either open it in MyBinder (the small rocket button at the top of the website) or [download the course material](link to course zip), go through the [setup process](link to setup) and open this file (i.e digital_literacy.ipynb in the introduction folder) using Jupyter Notebooks or VScode.
Roadmap¶
Goals
Data managment
Data managment plan
Setup local folder structure (BIDS)
Data storage
Open Brain Consent Form/GDPR & you
Open data
Goals¶
Specific goals of this first session will be to, e.g.
general understanding
get familiar with process
provide a checklist to follow
understanding on why project design is an essential step
Data Management¶
Data Management Plan¶
An initial step when starting any research project should be to setup a data managment plan (DMP). This helps you to flesh out, describe and doucment what data exactly you want to collect, what you’ll be doing with it, where and how it’s stored and eventually shared.
A DMP helps you stay organized and reduces the potential of suprises in the future (e.g. due to too limited data storage capacities or e.g. unexpeted costs). It is at times also required by e.g. your University or agencies funding research endeavours.
Motivation?¶
For the public good
if widely adopted makes it inherently more easier to reproduce code and anylsis pipelines build by others, therefore lowering scientific waste and improving efficiency
For yourself
You are likely the future user of the data and data analysis pipelines you’ve developed, so keeping your file structure standarized removes the need to remember where you’ve stored specific pieces of data etc.
Enables and simplifies collaboration; - allows reader/collarorators to gain a quick understanding of what data you’ll be collecting, where this data can be found and what exactly you’re planing on doing with it
Reviewers and funding agencies like to see clear, reproducible results
Open-science based funding opportunities and awards available (for instance: OHBM Replication Award, Mozilla Open Science Fellowship, Google Summer of Code, and so on.)
FAIR principles¶
FAIR principles: The FAIR principles stand for Findable, Accessible, Interoperable, and Reusable. These principles aim to make research data more accessible and reusable by promoting the use of standardized metadata, persistent identifiers, and open data formats. Allowing research not only to be shared, but also to be actually found.
What to consider in your data management plan¶
Most universites provide either templates, tools or guidance on how to create a DMP, so it is a good idea to check the online presence of your university or get in contact with your local libary.
For the Goethe University Frankfurt researchers can use the following tool: (german) Datenmanagementpläne mit dem Goethe-RDMO
There are also public tools to collect and share DMPs, such as DMPonline for the UK.
Here you also find publicly published plans that you can use to check what your DMP could/should contain.
The Turing way Project lists the following considerations when creating a DMP. Many of the specific points of this checklist have been already discussed in the previous steps.
Turing way DMP checklist¶
1. Roles and Responsibilities of project team members
- discuss who is repsonscible for different tasks related to project/data managment
- e.g. who is responsible to maintain the dataset, how takes care of the research ethics review
2. Type and size of data collected and documentation/metadata generated
- i.e. raw, preprocessed or finalised data (lead to different considerations, as e.g. raw data can generally not be openly shared)
- expected size of dataset
- how well is the dataset described in additional (metadata) files,
- what abbrevations are used, how are e.g. experimental conditions coded
- where, when, how was data collected
- description of sample population
3. Type of data storage used and back-up procedures that are in place
- where is data stored
- data protection procedures
- how are backups handled, i.e location and frequency of backups
- will a version control system be used?
- directory structure, file naming conventions
4. Preservation of the research outputs after the project
- public repositories or local storage
- e.g OSF
5. Reuse of your research outputs by others
- code and coding environment shared? (e.g. GitHUB)
- conditions for reuse of collected dataset (licensing etc.)
6. Costs
- potential costs of equipment and personell for data collection
- costs for data storage
To create your DMP you can either use the discussed tools or create a first draft by noting your toughts/expectation regarding the above checklist in a document.
Additional material¶
Find out more about how to organize and store in the chapte on Data managment
(Youtube) University of Wisconsin Data Services - Data services playlist
2. Setup local folder structure¶
It is recommended to adopt a standarized approach to structuring your data, as this not only helps you stay consistent, but also allows you and possible collaborators to easily identify where specific data is located.
General File Naming Conventions¶
To make sure that it is easily understood what a file contains and to make files easier for computers to process, you should follow certain naming conventions:
- be consistent
- use the date in the format YYYYMMDD
- use underscores `(_)` instead of spaces or
- use camelCase (capitalized first letter of each word in a phrase) instead of spaces
- avoid spaces, special characters `(+-"'|?!~@*%{[<>)`, punctuation `(.,;:)`, slashes and backslashes `(/\)`
- avoid "version" names, e.g. v1, vers1, final, final_really etc. (instead use a version control system like github)
Establish a folder hierarchy¶
Before you begin working on your project you should already start setting up the local folder structure on your system. This helps you keep organized and saves you a lot of work in the long run.
Your folder hierarchy of course depends on your projects specific need (e.g. folders for data, documents, images etc.) and should be as clear and consistent as possible. The easiest way to achieve this is to copy and adapt an already existing folder hierarchy template for research projects.
One for example (including a template) is the Transparent project management template for the OSF plattform by C.H.J. Hartgerink
The contained folder structure would then look like this:
project_name/
└── archive
│ └──
└── analyses
│ └──
│
└── bibliography
│ └──
│
└── data
│ └──
│
└── figure
│ └──
│
└── functions
│ └──
│
└── materials
│ └──
│
└── preregister
│ └──
│
└── submission
│ └──
│
└── supplement
└──
Another example would be the “resarch project structure” by Nikola Vukovic
Where the folder hierarchy would look like this:
project_name/
└── projectManagment/
│ ├── proposals/
│ │ └──
│ ├── finance/
│ │ └──
│ └── reports/
│ └──
│
└── EthicsGovernance
│ ├── ethicsApproval/
│ │ └──
│ └── consentForms/
│ └──
│
└── ExperimentOne/
│ ├── inputs/
│ │ └──
│ ├── data/
│ │ └──
│ ├── analysis/
│ │ └──
│ └── outputs/
│ └──
│
└── Dissemination/
├── presentations/
│ └──
├── publications/
│ └──
└── publicity/
└──
Incorporating experimental data/BIDS standard¶
Now both of these examples provide an “experiment folder”, but tend to utilize/establish their own standards.
But we aim to make our folder structure as easily understandable, interoperable (e.g. between systems and programms) and reproducible, therefore it is best to adapt our “experiment folder” to industry standards.
For most experimental data the most promising approach to this is BIDS (Brain Imaging Data Structure). Originally conceptualizes as a standarized format for the organization and description of fMRI data, the format can be extended to encompass other kinds of neuroimaging and behavioral data. Using the BIDS standard makes the integration of your data into most neuroscience analysis pipelines
Working with neuroimaging data makes the setup of your system a little more complicated.
The most promising/popular approach structuring your data is the BIDS (Brain Imaging Data Structure) standard.
The Bids (Brain Imaging Data Structure) standard is a community-driven specification that aims to facilitate the organization and sharing of neuroimaging data. The Bids standard specifies a common format for storing and organizing neuroimaging data, including MRI, EEG, MEG, and iEEG. The standard can of course additionally be used to store bahvioral data.
The Bids standard defines a specific folder hierarchy for organizing neuroimaging data. This hierarchy is organized into several separate folders, each with a specific purpose. As Bids is mostly concerned with our data it provides a standardized way to organize the data
folder in the diagram above. The data
folder would then be structured in the following way.
data/
├── derivatives/
└── subject/
└── session/
└── datatype/
/derivatives
: contains processed data, such as the results of statistical analyses
/sub- folder
: contains data from one subject. Each subject is identified by a unique code that starts with “sub-“. This folder contains subfolders for each imaging session, which contains separate folders for each imaging file (datatype
in the diagram above) recorded for this specific subject.
Neuroimaging datasets mostly contain data from more than 1 subject, the data folder will therefore necessearily contain multiple subject folder, named sub-01, sub-02 ... sub-0n
. This could look something like this:
project_data
├── dataset_description.json
├── participants.tsv
├── derivatives
├── sub-01
│ ├── anat
│ │ ├── sub-01_inplaneT2.nii.gz
│ │ └── sub-01_T1w.nii.gz
│ └── func
│ ├── sub-01_task-X_run-01_bold.nii.gz
│ ├── sub-01_task-X_run-01_events.tsv
│ ├── sub-01_task-X_run-02_bold.nii.gz
│ ├── sub-01_task-X_run-02_events.tsv
│ ├── sub-01_task-X_run-03_bold.nii.gz
│ └── sub-01_task-X_run-03_events.tsv
├── sub-02
│ ├── anat
│ │ ├── sub-02_inplaneT2.nii.gz
│ │ └── sub-02_T1w.nii.gz
│ └── func
│ ├── sub-02_task-X_run-01_bold.nii.gz
│ ├── sub-02_task-X_run-01_events.tsv
│ ├── sub-02_task-Xk_run-02_bold.nii.gz
│ └── sub-02_task-X-02_events.tsv
...
...
We’ll not go into detail about the differnt neuroimaging files here (the .nii.gz
files), but there is another thing we can learn from this standard: The inclusion of metadata
.
In the above diagram you find 2 metadata files:
The participants.csv
file contains information on about the participants in the study, such as demographics, behavioral data, and other relevant information. Specifically, it typically contains a row for each participant and columns that describe various aspects of each participant, such as their age, sex, handedness, cognitive scores, and any clinical diagnoses.
The dataset_description.json
contains important metadata about the entire dataset, such as:
Name
: A brief and informative name for the dataset;
License
: The license under which the data are released (e.g., CC-BY-SA).
Authors
: A list of individuals who contributed to the dataset.
Funding
: Information about the funding sources that supported the creation of the dataset.
Description
: A detailed description of the dataset, including information about the data collection methods, study participants, and any relevant processing or analysis steps that were taken.
Subjects: A list of the subjects (i.e., study participants) included in the dataset, including information about their demographics and any relevant clinical information.
Sessions
: A list of the scanning sessions or experimental sessions that were conducted for each subject, including information about the acquisition parameters and any relevant task or stimulus information.
Task
: Information about the task or stimulus used in the experiment, if applicable.
Modality
: Information about the imaging modality used to acquire the data (e.g., fMRI, MRI, MEG, etc.).
AnatomicalLandmarkCo
BIDS File naming conventions¶
The BIDS format also specifies how we should name our files. To make sure that e.g. others understand what content to expect in a specific file and to make it easier to use automated tools that expect certain file names for e.g. data analyses.
The general file naming convention looks like this:
key1 - value1 _ key2 - value2 _ suffix .extension
Where key-value
pairs are separated by underscores (e.g. Sub-01-_task-01
), followed by an underscore and a suffix describing the datatype (e.g. _events
), which is followed by the file extension (e.g. .tsv
). Resulting in:
Sub-01-_task-01_events.tsv
It’s recommended that you adopt this filenaming system and apply it to all of your files, e.g. your project report could be called:
firstname-lastname_project-report.txt
You may also want to add a date to non-data files (ideally in the year-month-day format (YYYYMMDD)), e.g.
YYYYMMDD_firstname-lastname_project-report.txt
Avoid adding descriptions such as version_01
or final_version
etc., instead you should rely on digital tools with version history functionality such as Google Docs
. In the next section we’ll further introduce the concept of a version control system
to avoid this issue all together.
To learn more¶
Checkout the BIDS starter-kit
Data storage:¶
Now while having a standarized data format on our system, let’s turn towards the topic of data storage. While for small projects or purely behavioral research your local file system may seem initially sufficient, for larger datasets and file sizes, such as in neuroimaging, you’ll quickly run out of storage. The same applies when you collect a larger number of smaller projects throughtout your career, of course. Further, do we really want to store data simply on some laptop until it is forgotten/deleted and in general inaccessible to others who might make use of it?
(Un/)Fortunately sharing data falls under the jurisdiction of local laws, e.g. the General Data Protection Regulation (GDPR) in germany. It is therfore essential to make sure that where and how you store or share the data you will be collecting or have already collected is in compliance with the law.
1. Open Brain Consent Form/GDPR & you¶
Research especially in pschology or neuroscience is often dependent on the collection of new data from human subjects. Given the complexities of the (german) General Data Protection Regulation (GDPR) and to avoid tassles with the local ethics comittees the “Open Brain Consent Form” was created to be ethically and legally waterproof, while allowing researchers to openly share their collected datasets. While the document was developed for neuroimaging studies it can be adapted to most forms of data with little effort.
As we are comitted to research transaprency and open sciene we advise you to make use of this document for your own research.
Make open data sharing a no-brainer for ethics committees.
GDPR conform:
Outside of germany:
When using this template please acknowledge the creators accordingly:
Open Brain Consent working group (2021). The Open Brain Consent: Informing research participants and obtaining consent to share brain imaging data. Human Brain Mapping, 1-7 https://doi.org/10.1002/hbm.25351.
3. Open data¶
Nowadays there are a lot of cloud storage options, which additionally allow researchers to share their data given certain rules and restrictions. The following is an incomplete, but somewhat exhaustive list of the possible storage options you face.
Open Science Repositories: Repositories such as OSF, OpenNeuro, the OpenfMRI database, and the IEEE DataPort provide open access to MRI and EEG datasets, as well as other neuroimaging data.
U.S. National Institutes of Health: The National Institutes of Health (NIH) of the U.S. provides open access to many MRI and EEG datasets through their National Library of Medicine’s National Institutes of Health Data Sharing Repository.
Research Data Repositories: Zenodo, Figshare, and other research data repositories allow scientists to store, share, and publish their data in an open and transparent manner. These repositories are often committed to open access principles and provide a centralized location for data and metadata, as well as version control and preservation features.
Research Collaborations: Collaborative projects, such as the Human Connectome Project, the International Neuroimaging Data-Sharing Initiative (INDI), and the ABIDE (Autism Brain Imaging Data Exchange) project, provide open access to large datasets collected from multiple sites.
Domain-Specific Repositories: There are also domain-specific repositories for various scientific fields, such as the NCBI Sequence Read Archive for genomics data, the European Geosciences Union Data Centre for Earth and Environmental Science data, and the International Astronomical Union’s SIMBAD database for astronomical data. These repositories often have specific requirements for data deposition and sharing, but provide a dedicated space for researchers to share their data in an open and transparent manner.
Acknowledgments:¶
The go-to ressource for creating and maintaining scientific projects was created by the Turing Way project. We’ve adapted some of their material for the Data Managment Plan section of the lesson above.
The Turing Way Community. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2). Zenodo. https://doi.org/10.5281/zenodo.7625728
Transparent project management template for the OSF plattform by WEC.H.J. Hartgerink
We’ve adapted some of the directory structure diagrams form the BIDS Standard Guide.
Where are we now?¶
Please take a moment to reflect on if and how your understanding of the concept of digital literacy has possibly changed following this lecture.
2.1 How well informed we’re you with regard to the current topic? (i.e., did you know what was expecting you?)
2.2) What was the most surprising part of the training/lecture for you? What information in particular stood out to you?