CD-HIT

CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik’s Lab at the Burnham Institute (now Sanford-Burnham Medical Research Institute)

CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. (from the CD-HIT home page)

Usage

module load cdhit

Generically

cd-hit -i someproteins.fasta -o my.out

Cluster with minimum 70% identity

cd-hit -i all.prot.fasta -o cdhit.70.out  -c 0.7 -n 5 -d 0

Separate the clusters file into multiple fasta files (for clusters with at least 10 proteins, and place them in the directory clusters70 )

make_multi_seq.pl all.prot.fasta cdhit.70.out.clstr clusters70 10

See the user guide for further documentation.