Home Fundamentals Research Data Management FAIR Data Principles Metadata Ontologies Data Sharing Data Publications Data Management Plan Version Control & Git Public Data Repositories Persistent Identifiers Electronic Lab Notebooks (ELN) DataPLANT Implementations Annotated Research Context ARC specification ARC Commander Swate MetadataQuiz DataHUB DataPLAN Ontology Service Landscape ARC Commander Manual Setup Git Installation ARC Commander Installation Windows MacOS Linux ARC Commander DataHUB Access Before we start Central Functions Initialize Clone Connect Synchronize Configure Branch ISA Metadata Functions ISA Metadata Investigation Study Assay Update Export ARCitect Manual Installation - Windows Installation - macOS Installation - Linux QuickStart QuickStart - Videos ARCmanager Manual What is the ARCmanager? Connect to your DataHUB View your ARCs Create new ARCs Add new studies and assays Upload files Add metadata to your ARCs Swate Manual QuickStart QuickStart - Videos Annotation tables Building blocks Building Block Types Adding a Building Block Filling cells with ontology terms Advanced Term Search File Picker Templates Contribute Templates ISA-JSON DataHUB Manual Overview User Settings Generate a Personal Access Token (PAT) Projects Panel ARC Panel Forks Working with files ARC Settings ARC Wiki Groups Panel Create a new user group CQC Pipelines & validation Find and use ARC validation packages Data publications Passing Continuous Quality Control Submitting ARCs with ARChigator Track publication status Use your DOIs Guides ARC User Journey Create your ARC ARCitect QuickStart ARCitect QuickStart - Videos ARC Commander QuickStart ARC Commander QuickStart (Experts) Annotate Data in your ARC Annotation Principles ISA File Types Best Practices For Data Annotation Swate QuickStart Swate QuickStart - Videos Swate Walk-through Share your ARC Register at the DataHUB DataPLANT account Invite collaborators to your ARC Sharing ARCs via the DataHUB Work with your ARC Using ARCs with Galaxy Computational Workflows CWL Introduction CWL runner installation CWL Examples CWL Metadata Recommended ARC practices Syncing recommendation Keep files from syncing to the DataHUB Managing ARCs across locations Working with large data files Adding external data to the ARC ARCs in Enabling Platforms Publication to ARC Troubleshooting Git Troubleshooting & Tips Contribute Swate Templates Knowledge Base Teaching Materials Events 2023 Nov: CEPLAS PhD Module Oct: CSCS CEPLAS Start Your ARC Sept: MibiNet CEPLAS Start Your ARC July: RPTU Summer School on RDM July: Data Steward Circle May: CEPLAS Start Your ARC Series Start Your ARC Series - Videos Events 2024 TRR175 Becoming FAIR CEPLAS ARC Trainings – Spring 2024 MibiNet CEPLAS DataPLANT Tool-Workshops TRR175 Tutzing Retreat Frequently Asked Questions

Working with large data files

last updated at 2023-12-05 About this guide

In this guide we show you how you can actively handle large data files in your ARC using ARC Commander.

💡 If you use ARCitect to manage your ARCs, make sure to select or unselect the boxes LFS (in the "Download ARC" panel) or Download LFS Files (in the "Versions" panel) in order to allow or prevent syncing large files (LFS = large file storage).

UserAdvanced ModeTutorial
Before we can start

☑️ You have created an ARC before using the ARCitect or ARC Commander
☑️ You have a DataPLANT account
☑️ Your computer is linked to the DataHUB via personal access token

Large File Storage (LFS)

ARCs and the DataHUB come with a mechanism to sync and store large files called Large File Storage (LFS). LFS is an efficient way to store your large data files. These files are called "LFS objects". Rather than checking every file during every arc sync (ARC Commander) or DataHUB Sync (ARCitect), the tools first check whether there was a change at all. And only if this is the case, it scans what was changed. This way it saves time and computing power compared to always scanning all large files for possible changes.

ARCitect

The ARCitect offers to activate or deactivate the use of LFS:

In addition you can set a threshold (2) in megabytes (MB) for what you consider a large file in the "Commit" menu (1).

Finally, you can individually download large files via right-click → "Download LFS File" (1)

ARC Commander

By default, the ARC Commander tracks the following files via LFS:

  1. All files stored in an assay's dataset folder, and
  2. All files with a size larger than 150 MB.

The threshold of 150 MB can easily be adjusted using the ARC Commander. For instance, if you want to decrease it to 5 MB (i.e. 5000000 bytes), run

arc config set -g -n "general.gitlfsbytethreshold" -v "5000000"

💡 The LFS system is also the reason why git LFS needs to be installed prior to using the ARC Commander.

Track files via LFS

In addition to the defaults, you can also actively choose, which files to track via LFS.

  1. Update your local ARC via arc sync
  2. Add large files or folders by copying or moving them to your ARC
  3. Track files via
git lfs track "<path/to/FolderWithLargeFiles/**>" git add .gitattributes
  1. Sync your ARC to the DataHUB via arc sync
  2. Open your ARC in the DataHUB and navigate to the folder with LFS objects and see them flagged as "LFS" (1).

⚠️ Please avoid uploading large files without git LFS (i.e. accidentally with pure git, when git-lfs is not available).

Downloading an ARC without large data files

Sometimes you may want to download your ARC to a smaller computer, where you do not need a full copy of your ARC including all its large data files. For instance, you just want to work with smaller derived data sets or want to update ISA metadata. In this case, you can add the -n or --nolfs flag to your arc get command:

arc get --nolfs -r https://gitlab.nfdi4plants.de/<YourUser>/<YourARC>

For example, have a look at the ARC https://gitlab.nfdi4plants.de/shiltemann/physcomitrium-patens-light-signaling-2022/. In the DataHUB this ARC has a storage volume of ~84GB (December 2023), most of which comes from the large RNASeq data files flagged as "LFS".

You can download this ARC without the LFS objects via

arc get --nolfs -r https://gitlab.nfdi4plants.de/shiltemann/physcomitrium-patens-light-signaling-2022/ Selectively download large files

If at some point you wish to selectively download one or more of the LFS objects of your ARC to that machine, you can do so via git lfs pull --include "<path/to/fileOrFolder>"

For example, the following command will download one of the large RNASeq data files.

git lfs pull --include "assays/RNASeq/dataset/R19/R19_1.fq.gz" Download all large files in the ARC

If at some point you wish to download all LFS files of your ARC, you can use the following command

git lfs pull --include "*" Checking usage quota of LFS

If at some point you would like to check how much free storage you have for your ARC, you can easily do so by navigating to your ARC in the DataHUB and clicking on "Project Storage" in the right sidebar (1).

DataPLANT Support

Besides these technical solutions, DataPLANT supports you with community-engaged data stewardship. For further assistance, feel free to reach out via our helpdesk or by contacting us directly .
Contribution Guide 📖
✏️ Edit this page