Making space on the cluster

Disk space is cheap but quite often we run out on the cluster. Here are some tips on saving space

Don't keep SAM files
After a paper is published remove the non-essential intermediate data files
Gzip everything you can. Compression saves a lot of space in some cases and is very fast to decompress. Many programs accept gzipped files directly
Delete SAM files
All major sequence data format's have a compressed format. If you don't have a good reason to keep the uncompressed version - compress it
Never create SAM files

File format	Alternative Format	Compression tool	Notes
.fasta/.fa	.fasta.gz/.fa.gz	gzip/pigz
.fastq/.fq	.fastq.gz/.fq.gz	gzip/pigz
.sam	.bam/.cram	samtools	CRAM reference based compression that improves on BAM
.ped/.tped	.bed	plink
.vcf	.bcf	bcftools	bcftools works a lot faster on bcf files

Here are some one liners for finding disk usage and creating some extra space.

Find the biggest subdirectoies in the current directories:

du | sort -k1nr | head -20

Recursively find all large FASTA files in the current directory and all sub directories and gzip them:

find . -type f -size +100M -name "*.fasta" -exec echo {} \; -exec pigz {} \;

Find all SAM files and convert them into BAM (download the sam2bam.py script first, change 20 to number of available theads)

find . -type f -size +100M -name "*.sam" -exec python sam2bam.py {} 20 \;

Disk usage