Home
Fundamentals
Research Data Management
FAIR Data Principles
Metadata
Ontologies
Data Sharing
Data Publications
Data Management Plan
Version Control & Git
Public Data Repositories
Persistent Identifiers
Electronic Lab Notebooks (ELN)
DataPLANT Implementations
Annotated Research Context
ARC specification
ARC Commander
Swate
MetadataQuiz
DataHUB
DataPLAN
Ontology Service Landscape
ARC Commander Manual
Setup
Git Installation
ARC Commander Installation
Windows
MacOS
Linux
ARC Commander DataHUB Access
Before we start
Central Functions
Initialize
Clone
Connect
Synchronize
Configure
Branch
ISA Metadata Functions
ISA Metadata
Investigation
Study
Assay
Update
Export
ARCitect Manual
Installation - Windows
Installation - macOS
Installation - Linux
QuickStart
QuickStart - Videos
ARCmanager Manual
What is the ARCmanager?
Connect to your DataHUB
View your ARCs
Create new ARCs
Add new studies and assays
Upload files
Add metadata to your ARCs
Swate Manual
QuickStart
QuickStart - Videos
Annotation tables
Building blocks
Building Block Types
Adding a Building Block
Filling cells with ontology terms
Advanced Term Search
File Picker
Templates
Contribute Templates
ISA-JSON
DataHUB Manual
Overview
User Settings
Generate a Personal Access Token (PAT)
Projects Panel
ARC Panel
Forks
Working with files
ARC Settings
ARC Wiki
Groups Panel
Create a new user group
CQC Pipelines & validation
Find and use ARC validation packages
Data publications
Passing Continuous Quality Control
Submitting ARCs with ARChigator
Track publication status
Use your DOIs
Guides
ARC User Journey
Create your ARC
ARCitect QuickStart
ARCitect QuickStart - Videos
ARC Commander QuickStart
ARC Commander QuickStart (Experts)
Annotate Data in your ARC
Annotation Principles
ISA File Types
Best Practices For Data Annotation
Swate QuickStart
Swate QuickStart - Videos
Swate Walk-through
Share your ARC
Register at the DataHUB
DataPLANT account
Invite collaborators to your ARC
Sharing ARCs via the DataHUB
Work with your ARC
Using ARCs with Galaxy
Computational Workflows
CWL Introduction
CWL runner installation
CWL Examples
CWL Metadata
Recommended ARC practices
Syncing recommendation
Keep files from syncing to the DataHUB
Managing ARCs across locations
Working with large data files
Adding external data to the ARC
ARCs in Enabling Platforms
Publication to ARC
Troubleshooting
Git Troubleshooting & Tips
Contribute
Swate Templates
Knowledge Base
Teaching Materials
Events 2023
Nov: CEPLAS PhD Module
Oct: CSCS CEPLAS Start Your ARC
Sept: MibiNet CEPLAS Start Your ARC
July: RPTU Summer School on RDM
July: Data Steward Circle
May: CEPLAS Start Your ARC Series
Start Your ARC Series - Videos
Events 2024
TRR175 Becoming FAIR
CEPLAS ARC Trainings – Spring 2024
MibiNet CEPLAS DataPLANT Tool-Workshops
TRR175 Tutzing Retreat
Frequently Asked Questions
last updated at 2023-12-05
About this guide
In this guide we show you how you can actively handle large data files in your ARC using ARC Commander.
💡 If you use ARCitect to manage your ARCs, make sure to select or unselect the boxes LFS
(in the "Download ARC" panel) or Download LFS Files
(in the "Versions" panel) in order to allow or prevent syncing large files (LFS = large file storage).
Before we can start
☑️ You have created an ARC before using the ARCitect or ARC Commander
☑️ You have a DataPLANT account
☑️ Your computer is linked to the DataHUB via personal access token
Large File Storage (LFS)
ARCs and the DataHUB come with a mechanism to sync and store large files called Large File Storage (LFS). LFS is an efficient way to store your large data files. These files are called "LFS objects". Rather than checking every file during every arc sync
(ARC Commander) or DataHUB Sync (ARCitect), the tools first check whether there was a change at all. And only if this is the case, it scans what was changed. This way it saves time and computing power compared to always scanning all large files for possible changes.
ARCitect
The ARCitect offers to activate or deactivate the use of LFS:
- in the "Download ARC" (1) menu via the "LFS" checkbox (2)
- as well as in the "DataHUB Sync" menu (1) via the "Use Large File Storage" checkbox (2), which are available once an ARC has been open in ARCitect.
In addition you can set a threshold (2) in megabytes (MB) for what you consider a large file in the "Commit" menu (1).
Finally, you can individually download large files via right-click → "Download LFS File" (1)
ARC Commander
By default, the ARC Commander tracks the following files via LFS:
- All files stored in an assay's
dataset
folder, and
- All files with a size larger than 150 MB.
The threshold of 150 MB can easily be adjusted using the ARC Commander. For instance, if you want to decrease it to 5 MB (i.e. 5000000 bytes), run
arc config set -g -n "general.gitlfsbytethreshold" -v "5000000"
💡 The LFS system is also the reason why git LFS needs to be installed prior to using the ARC Commander.
Track files via LFS
In addition to the defaults, you can also actively choose, which files to track via LFS.
- Update your local ARC via
arc sync
- Add large files or folders by copying or moving them to your ARC
- Track files via
git lfs track "<path/to/FolderWithLargeFiles/**>"
git add .gitattributes
- Sync your ARC to the DataHUB via
arc sync
- Open your ARC in the DataHUB and navigate to the folder with LFS objects and see them flagged as "LFS" (1).
⚠️ Please avoid uploading large files without git LFS (i.e. accidentally with pure git, when git-lfs is not available).
Downloading an ARC without large data files
Sometimes you may want to download your ARC to a smaller computer, where you do not need a full copy of your ARC including all its large data files. For instance, you just want to work with smaller derived data sets or want to update ISA metadata.
In this case, you can add the -n
or --nolfs
flag to your arc get
command:
arc get --nolfs -r https://gitlab.nfdi4plants.de/<YourUser>/<YourARC>
For example, have a look at the ARC https://gitlab.nfdi4plants.de/shiltemann/physcomitrium-patens-light-signaling-2022/.
In the DataHUB this ARC has a storage volume of ~84GB (December 2023), most of which comes from the large RNASeq data files flagged as "LFS".
You can download this ARC without the LFS objects via
arc get --nolfs -r https://gitlab.nfdi4plants.de/shiltemann/physcomitrium-patens-light-signaling-2022/
Selectively download large files
If at some point you wish to selectively download one or more of the LFS objects of your ARC to that machine, you can do so via git lfs pull --include "<path/to/fileOrFolder>"
For example, the following command will download one of the large RNASeq data files.
git lfs pull --include "assays/RNASeq/dataset/R19/R19_1.fq.gz"
Download all large files in the ARC
If at some point you wish to download all LFS files of your ARC, you can use the following command
git lfs pull --include "*"
Checking usage quota of LFS
If at some point you would like to check how much free storage you have for your ARC, you can easily do so by navigating to your ARC in the DataHUB and clicking on "Project Storage" in the right sidebar (1).
DataPLANT Support
Besides these technical solutions, DataPLANT supports you with community-engaged data stewardship. For further assistance, feel free to reach out via our
helpdesk
or by contacting us
directly
.