Making space on the cluster
Disk space is cheap but quite often we run out on the cluster. Here are some tips on saving space
- Don't keep SAM files
- After a paper is published remove the non-essential intermediate data files
- Gzip everything you can. Compression saves a lot of space in some cases and is very fast to decompress. Many programs accept gzipped files directly
- Delete SAM files
- All major sequence data format's have a compressed format. If you don't have a good reason to keep the uncompressed version - compress it
- Never create SAM files
Compressed file alternatives
| File format | Alternative Format | Compression tool | Notes | 
|---|---|---|---|
| .fasta/.fa | .fasta.gz/.fa.gz | gzip/pigz | |
| .fastq/.fq | .fastq.gz/.fq.gz | gzip/pigz | |
| .sam | .bam/.cram | samtools | CRAM reference based compression that improves on BAM | 
| .ped/.tped | .bed | plink | |
| .vcf | .bcf | bcftools | bcftools works a lot faster on bcf files | 
Helpful one liners
Here are some one liners for finding disk usage and creating some extra space.
Find the biggest subdirectoies in the current directories:
du | sort -k1nr | head -20
Recursively find all large FASTA files in the current directory and all sub directories and gzip them:
find . -type f -size +100M -name "*.fasta" -exec echo {} \; -exec pigz {} \;
Find all SAM files and convert them into BAM (download the sam2bam.py script first, change 20 to number of available theads)
find . -type f -size +100M -name "*.sam" -exec python sam2bam.py {} 20 \;