Home Fundamentals Research Data Management FAIR Data Principles Metadata Ontologies Data Sharing Data Publications Data Management Plan Version Control & Git Public Data Repositories Persistent Identifiers Electronic Lab Notebooks (ELN) DataPLANT Implementations Annotated Research Context ARC specification ARC Commander Swate MetadataQuiz DataHUB DataPLAN Ontology Service Landscape ARC Commander Manual Setup Git Installation ARC Commander Installation Windows MacOS Linux ARC Commander DataHUB Access Before we start Central Functions Initialize Clone Connect Synchronize Configure Branch ISA Metadata Functions ISA Metadata Investigation Study Assay Update Export ARCitect Manual Installation - Windows Installation - macOS Installation - Linux QuickStart QuickStart - Videos ARCmanager Manual What is the ARCmanager? Connect to your DataHUB View your ARCs Create new ARCs Add new studies and assays Upload files Add metadata to your ARCs Swate Manual QuickStart QuickStart - Videos Annotation tables Building blocks Building Block Types Adding a Building Block Filling cells with ontology terms Advanced Term Search File Picker Templates Contribute Templates ISA-JSON DataHUB Manual Overview User Settings Generate a Personal Access Token (PAT) Projects Panel ARC Panel Forks Working with files ARC Settings ARC Wiki Groups Panel Create a new user group CQC Pipelines & validation Find and use ARC validation packages Data publications Passing Continuous Quality Control Submitting ARCs with ARChigator Track publication status Use your DOIs Guides ARC User Journey Create your ARC ARCitect QuickStart ARCitect QuickStart - Videos ARC Commander QuickStart ARC Commander QuickStart (Experts) Annotate Data in your ARC Annotation Principles ISA File Types Best Practices For Data Annotation Swate QuickStart Swate QuickStart - Videos Swate Walk-through Share your ARC Register at the DataHUB DataPLANT account Invite collaborators to your ARC Sharing ARCs via the DataHUB Work with your ARC Using ARCs with Galaxy Computational Workflows CWL Introduction CWL runner installation CWL Examples CWL Metadata Recommended ARC practices Syncing recommendation Keep files from syncing to the DataHUB Managing ARCs across locations Working with large data files Adding external data to the ARC ARCs in Enabling Platforms Publication to ARC Troubleshooting Git Troubleshooting & Tips Contribute Swate Templates Knowledge Base Teaching Materials Events 2023 Nov: CEPLAS PhD Module Oct: CSCS CEPLAS Start Your ARC Sept: MibiNet CEPLAS Start Your ARC July: RPTU Summer School on RDM July: Data Steward Circle May: CEPLAS Start Your ARC Series Start Your ARC Series - Videos Events 2024 TRR175 Becoming FAIR CEPLAS ARC Trainings – Spring 2024 MibiNet CEPLAS DataPLANT Tool-Workshops TRR175 Tutzing Retreat Frequently Asked Questions

Version control & Git

last updated at 2022-05-09 Scientific iteration and versioning

Science is highly iterative. Most outcomes along the data life cycle (between an initial idea and the final publication, see also RDM) are iterated through multiple cycles of design-test-repeat (e.g. laboratory experiments) or draft-review-publish (e.g. manuscripts) and mixes thereof. During these iterations multiple versions of the different outcomes are produced.
There are different options to keep track of these versions. The seemingly simplest option is to duplicate a file and rename it by attaching a version, e.g. manuscript.txtmanuscript_v2.txtmanuscript_final.txt. Although this may work acceptably for individual use it quickly becomes confusing when sharing with other researchers. Cloud services offer options to keep track of changes (what was changed and by whom) within collaborative, multi-party projects (see also Data Sharing). Here, versioning is usually taken care of automatically by the cloud service with little to no control by the user. However, these services are helpful only for version histories of typical office data (documents, presentations) or small datasets and within low-complexity projects.

Cloud Services

Git

A more sophisticated approach addressing the versioning needs in more complex projects originates from the field of software engineering. Software development builds on iterative design-test-repeat cycles, in which multiple versions of files (code, inputs and outputs) or directory structures emerge plus changing dependencies within (e.g. files) and outside (e.g. other software) of the project. So-called "distributed version control systems" (sometimes termed "source control" or "revision control") help software developers to keep track of project changes, guaranteeing stable integrity of the software, ideally before it is rolled-out to the public. The most prominent and vastly established distributed version control system is called Git.

By taking chronological snapshots of a complete project (termed "git repository") rather than single files, Git allows the user to "go back in time" to an earlier version of that project, e.g. when the software was properly functioning. This is further supported by options to make changes to multiple files at once in parallel, safe copies of the project (termed "branch" or "fork") without breaking the original version. In contrast to the versioning of cloud services, active control over these snapshots lies in the user's hand, allowing to evolve a project with a well-documented version history paralleling the iterative steps.

Git and Git Platforms

Git platforms: GitHub and GitLab

Although Git could be used locally as a standalone tool, its full power is unfolded via git platforms such as GitHub and GitLab. Similar to the typical cloud services for file sharing and collaboration, these platforms function as remote share-points for git repositories. They allow data access management (permission control) to share data privately with selected collaborators or the public. Individual contributions and changes by multiple collaborators can be tracked. On top of versioned data sharing, additional features, such as discussing and tracking project tasks and contributions, and wiki-based documentation render these git platforms very valuable for project and research (data) management. Consequently, they nowadays enjoy great popularity outside of software development.

Software developers collaborate via git to develop a software project over time, add new features, improve software parts or embed them into other software projects, and keep it up to date. Likewise, git suits to track the evolution of your plant science project over time, where analyses become more complex, build on top of each other or are embedded from other projects as more data from your own experiments or external resources is added. This is particularly the case, if experimental data is packaged in one git repository together with descriptive metadata, computations, analyses, and their outcomes as well as licenses for reuse.

Git? - it's not for me...

Yes, although we spare the technical details here, Git at first glance is complex and there is quite the learning curve for those who really urge to understand the inner workings. However, the complexity is also part of its strength to capture the parallel, multi-party, multifaceted strings of scientifically iterative projects. And more importantly, there is a growing set of helper tools, GUI solutions and integrations into other tools to ease the work with git.

How does DataPLANT support me to version control my data?

The following table gives an overview about DataPLANT tools and services related to sharing data. Follow the link in the first column for details.

Name Type Tasks on data sharing
ARC Commander Tool Collect, structure and share:
  • Add bibliographical metadata to your ARC
  • ARC version control and sharing via DataPLANT's DataHUB
  • Automated metadata referencing and version control as your ARC grows
ARC
(Annotated Research Context)
Standard Structure:
  • ARCs are git repositories
  • Package data with metadata in a defined format
DataHUB Service Share:
  • DataPLANT-customized GitLab instance
  • Infrastructure-as-code: on-premise solution
  • Federated system to share ARCs
  • Manage who can view or access your ARC
Register with DataPLANT

In order to use the DataHUB and other DataPLANT infrastructure and services, please sign up: with DataPLANT.

DataPLANT Support

Besides these technical solutions, DataPLANT supports you with community-engaged data stewardship. For further assistance, feel free to reach out via our helpdesk or by contacting us directly .
Contribution Guide 📖
✏️ Edit this page