Home Fundamentals Research Data Management FAIR Data Principles Metadata Ontologies Data Sharing Data Publications Data Management Plan Version Control & Git Public Data Repositories Persistent Identifiers Electronic Lab Notebooks (ELN) DataPLANT Implementations Annotated Research Context ARC specification ARC Commander Swate MetadataQuiz DataHUB DataPLAN Ontology Service Landscape Manuals ARC Commander Setup Git Installation ARC Commander Installation Windows MacOS Linux ARC Commander DataHUB Access Before we start Central Functions Initialize Clone Connect Synchronize Configure Branch ISA Metadata Functions ISA Metadata Investigation Study Assay Update Export ARCitect Installation - Windows Installation - macOS Installation - Linux QuickStart QuickStart - Videos ARCmanager What is the ARCmanager? Connect to your DataHUB View your ARCs Create new ARCs Add new studies and assays Upload files Add metadata to your ARCs Swate QuickStart QuickStart - Videos Annotation tables Building blocks Building Block Types Adding a Building Block Filling cells with ontology terms Advanced Term Search File Picker Templates Contribute Templates ISA-JSON DataHUB Overview User Settings Generate a Personal Access Token (PAT) Projects Panel ARC Panel Forks Working with files ARC Settings ARC Wiki Groups Panel Create a new user group CQC Pipelines & validation Find and use ARC validation packages Data publications Passing Continuous Quality Control Submitting ARCs with ARChigator Track publication status Use your DOIs Guides ARC User Journey Create your ARC ARCitect QuickStart ARCitect QuickStart - Videos ARC Commander QuickStart ARC Commander QuickStart (Experts) Annotate Data in your ARC Annotation Principles ISA File Types Best Practices For Data Annotation Swate QuickStart Swate QuickStart - Videos Swate Walk-through Share your ARC Register at the DataHUB DataPLANT account Invite collaborators to your ARC Sharing ARCs via the DataHUB Work with your ARC Using ARCs with Galaxy Computational Workflows CWL Introduction CWL runner installation CWL Examples CWL Metadata Recommended ARC practices Syncing recommendation Keep files from syncing to the DataHUB Managing ARCs across locations Working with large data files Adding external data to the ARC ARCs in Enabling Platforms Publication to ARC Troubleshooting Git Troubleshooting & Tips Contribute Swate Templates Knowledge Base Teaching Materials Events 2023 Nov: CEPLAS PhD Module Oct: CSCS CEPLAS Start Your ARC Sept: MibiNet CEPLAS Start Your ARC July: RPTU Summer School on RDM July: Data Steward Circle May: CEPLAS Start Your ARC Series Start Your ARC Series - Videos Events 2024 TRR175 Becoming FAIR CEPLAS ARC Trainings – Spring 2024 MibiNet CEPLAS DataPLANT Tool-Workshops TRR175 Tutzing Retreat Frequently Asked Questions

Metadata

last updated at 2022-05-20 What is metadata?

Metadata is "data that provides information about other data" (Merriam Webster). In order to put some plant life into this web dictionary explanation, let us explore metadata with a plant biology example:

Viola investigates the effect of the plant circadian clock on sugar metabolism in W. mirabilis. For her PhD project, which is part of an EU-funded consortium in Prof. Beetroot's lab, she acquires seeds from a South-African botanical society. Viola grows the plants under different light regimes, harvests leaves from a two-day time series experiment, extracts polar metabolites as well as RNA and submits the samples to nearby core facilities for metabolomics and transcriptomics measurements, respectively. After a few weeks of iterative consultation with the facilities' heads as well as technicians and computational biologists involved, Viola receives back a wealth of raw and processed data. From the data she produces figures and wraps everything up to publish the results in the Journal of Wonderful Plant Sciences.

Although overly-simplified, every sentence in this example is packed with metadata. All data that describe the acquisition, transformation, analysis or handling of data "from greenhouse-to-publication" as well as before (Where did the plants originate from?) and beyond (What else was deduced from the same dataset?) is considered metadata. This simple example already illustrates many different types and granular levels as well as common sources and stakeholders in charge for different tasks acting on metadata.

Where does metadata come from?

Metadata arises along common research tasks, including

However, every project typically also produces

The different types of metadata are collected from various analog and digital sources, including project grants, electronic lab notebooks, machines used for data acquisition, software used for data analysis. And they are provided by different contributors (stakeholders) including wet-lab biologists, core facilities, computational biologists, librarians or infrastructure providers.

Why do I benefit from metadata?

Before going any deeper into the what and how, let us understand why we benefit from annotating our data with rich metadata. For this it helps to recap along the FAIR principles how metadata makes your data comprehensible and utilizable for yourself and your peers.

What tasks are important for rich metadata?

The diversity of metadata types, sources and stakeholders highlights that collecting metadata is rarely a linear one-person or one-time event, but occurs continuously and requires constant updates paralleling a project's development. Here we address a few tasks typically acting on metadata.

Collection

Metadata stakeholders from different environments have different understandings of what metadata is required for comprehension of the annotated data. As plant biologists we probably agree that, when retrieving data from a public repository or publication, it is beneficial to know what type of measurement was performed on what species of plants. By contrast, a computational biologist and a librarian might emphasize the importance of the programming environment required to interpret a script or the contributing authors and licenses, respectively.

As plant biologists, we frequently experience how hard it is (if at all possible) to reproduce an experiment described with too little information in a publication. So, the more metadata the merrier – wouldn't it be great to capture all metadata about a project? Realistically we can only collect a portion of metadata. To guide users on what metadata is encouraged to collect, different domains of data experts have formulated these requirements into what is often referred to as "metadata standards" or "minimum information standards". Examples for bibliographic and administrative metadata standards include DublinCore and DataCite. Prominent standards to annotate data relevant to different plant science domains are grouped under the "Minimum Information for Biological and Biomedical Investigations" (MIBBI) and define e.g. minimum information about a high-throughput SEQuencing Experiment (MINSEQE), Proteomics Experiment (MIAPE) or a Plant Phenotyping Experiment (MIAPPE). There are many more metadata standards available which can be explored at fairsharing.org. You can also use DataPLANT's Metadata recommendation quiz for finding suitable metadata standards, metadata checklists for repositories as well as corresponding Swate templates for your data. The metadata standards can be regarded as "checklists", which, when followed, provide that the data is annotated with the required metadata attributes to make it comprehensible at least in the current context.

Structuring

Comprehensibility also strongly depends on the use of language. And not just the spoken language – English, German, Swahili, etc. – itself, but also different meanings of chosen words. While for a librarian the "title" usually refers to a publication item (e.g. manuscript or book) with "contributors" being "authors", "title" in bio-laboratory routines may refer to the name of a funded project with multiple publications (and titles). In said project "contributors" could include authors, but also other experimenters or data analysts and more. Likewise, the processes coming to a plant biologist's mind thinking about "protocols" (growing plants, collecting material, extracting RNA) will be of little help to a computer scientist trying to retrieve data (via HTTP or FTP) from a database. Furthermore, the librarian, the computer scientist and the plant biologist would probably have a very different approach to collect and represent the required metadata according to the intuition or conventions of their respective domain. Two technical solutions, schema and ontologies, help to overcome these ambiguities. Briefly, schema are structured documents that formulate a clear representation of the (metadata) information, thereby making it both human-readable and machine-processable. Ontologies are controlled vocabularies or thesauri with defined relations between the vocables that can be used to fill the schema. In this way, schema define where to find the information and ontologies interpret the information. The use of schema combined with ontologies is well-established for bibliographic and administrative metadata, where high-level and often generic metadata relevant to most scientific domains is described. In fact, the two examples of DublinCore and DataCite are not just checklists, but well-defined metadata schema. In more specialized scientific domains the use of schema is much less established. However, the metadata schema ISA (for investigation – study – assay) can be employed for the most versatile data types relevant in plant sciences. ISA allows intuitive, flexible and yet structured and conclusive data annotation with biological as well as bibliographic and administrative metadata. For more details, see ISA.

Sharing and curation

In order to make your data FAIR, it needs to be packaged with metadata and shared in a findable and accessible repository. There are cases where metadata needs to be made accessible independently of the annotated data, e.g. for legal reasons such as dual-use, intellectual property rights or simply since the data is not yet published. In any case, there are at least as many routines for sharing or publishing metadata as there are for data. For more information please see DataSharing and Repositories. During project lifetime data and metadata continuously develops and requires adaptations and updates. Proper versioning helps to keep an overview of the different stakeholders and tasks acting on metadata. Easy and traceable documentation of these processes can be achieved with general-purpose version control through git. As data and metadata develops the need to harmonize between metadata outputs may occur. If metadata is properly collected in a schema, converters can help to migrate between different metadata schema or securely export information from one schema to the other.

How does DataPLANT support me in metadata annotation?

The following table gives an overview about DataPLANT tools and services related to metadata. Follow the link in the first column for details.

Name Type Tasks on metadata
ARC
(Annotated Research Context)
Standard Structure:
  • Package data with metadata
Swate
(Swate Workflow Annotation Tool for Excel)
Tool Collect and structure:
  • Annotate experimental and computational workflows with ISA metadata schema
  • Easy use of ontologies and controlled vocabularies
  • Metadata templates for versatile data types
Metadata quiz Tool Identify:
  • Identify relevant metadata standards, data repositories and Swate templates for your data
ARC Commander Tool Collect, structure and share:
  • Add bibliographical metadata to your ARC
  • ARC version control and sharing via DataPLANT's DataHUB
  • Automated metadata referencing and version control as your ARC grows
DataHUB Service Share:
  • Federated system to share ARCs
  • Manage who can view or access your ARC

DataPLANT Support

Besides these technical solutions, DataPLANT supports you with community-engaged data stewardship. For further assistance, feel free to reach out via our helpdesk or by contacting us directly .
Contribution Guide 📖
✏️ Edit this page