Making space on the cluster

Disk space is cheap but quite often we run out on the cluster. Here are some tips on saving space

  1. Don't keep SAM files
  2. After a paper is published remove the non-essential intermediate data files
  3. Gzip everything you can. Compression saves a lot of space in some cases and is very fast to decompress. Many programs accept gzipped files directly
  4. Delete SAM files
  5. All major sequence data format's have a compressed format. If you don't have a good reason to keep the uncompressed version - compress it
  6. Never create SAM files

Compressed file alternatives

File format Alternative Format Compression tool Notes
.fasta/.fa .fasta.gz/.fa.gz gzip/pigz
.fastq/.fq .fastq.gz/.fq.gz gzip/pigz
.sam .bam/.cram samtools CRAM reference based compression that improves on BAM
.ped/.tped .bed plink
.vcf .bcf bcftools bcftools works a lot faster on bcf files

Helpful one liners

Here are some one liners for finding disk usage and creating some extra space.

Find the biggest subdirectoies in the current directories:

du | sort -k1nr | head -20

Recursively find all large FASTA files in the current directory and all sub directories and gzip them:

find . -type f -size +100M -name "*.fasta" -exec echo {} \; -exec pigz {} \;

Find all SAM files and convert them into BAM (download the sam2bam.py script first, change 20 to number of available theads)

find . -type f -size +100M -name "*.sam" -exec python sam2bam.py {} 20 \;

results matching ""

    No results matching ""