BLOSUM62: more or less than 62% identity?

edit

"The Henikoffs took a big database of trusted alignments (their BLOCKS database), and (in effect) only counted pairwise sequence alignments related by less than some threshold percentage identity. A threshold of 62% identity or less resulted in the target frequencies for the BLOSUM62 matrix. An 80% threshold gave the more highly conserved target frequencies of the BLOSUM80 matrix, and a 45% threshold gave the more divergent BLOSUM45 matrix."

Source: Sean R. Eddy, Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22, 1035--1036 (2004) doi:10.1038/nbt0804-1035

http://www.nature.com/nbt/journal/v22/n8/full/nbt0804-1035.html


"In order to avoid over-weighting closely-related sequences, the Henikoffs replaced groups of proteins that have sequence identities higher than a threshold by either a single representative or a weighted average. The threshold of 62% produces the commonly used BLOSUM62 substitution matrix."

Source: Arthur M. Lesk, Introduction to Bioinformatics Oxford University Press, 2002, p.175

Winterschlaefer 15:52, 14 February 2007 (UTC)Reply


For what I know a BLOSUM62 matrix is good for alignements which have 62% or MORE identity XApple 00:32, 25 February 2007 (UTC)Reply


I agree with Winterschlaefer. For the BLOSUM62, the Henikoffs weighted all the sequences with similarity 62% or more as one single sequence, thus contributing less to the matrix. As the paper reads,

"To reduce multiple contributions to amino acid pair frequencies from the most closely related members of a family, sequences are clustered within blocks and each cluster is weighted as a single sequence in counting pairs. This is done by specifying a clustering percentage in which sequence segments that are identical for at least that percentage of amino acids are grouped together."

Also, as I can read in the history of this article, the following statement used to be part of the references section: "BLOSUM62 is for sequences of 62% OR GREATER sequence identity, not less than 62% (Voet, D., Voet,J., 2005)" and this may well be what Voet & Voet claim. However, this is different from the following statement, which is now referenced with Voet & Voet: "BLOSUM62 is the matrix calculated by using the observed substitutions between proteins which have 62% or more". What I'm saying is that this reference does not support this claim. The BLOSUM62 matrix actually is calculated (primarily) from sequences which have 62% and less sequence identity. Still, IMHO, BLOSUM62 is designed for sequences with similarities around 62%, not more. If I'ld want to compare sequences with a similarity of 80%, I'ld choose BLOSUM80.

Source: Henikoff & Henikoff Amino acid substitution matrices from protein blocks PNAS 89, pp. 10915-10919 134.34.4.5 21:09, 28 May 2007 (UTC)Reply


It is definitely the case that the BLOSUM62 is based only on sequences that have 62% or more identity while the BLOSUM80 is based on sequences with 80% or more identity. Which one you use is up to your personal taste but as far as I know you would use a BLOSUM that is around your sequence identity where I agree with the speaker above. The error was fixed here. Greetings--hroest 03:39, 4 June 2008 (UTC)Reply


NO! Look at the original paper. If you read the Henikoff & Henikoff paper it is clear that the earlier comments that BLOSUM62 means that sequences with > 62% identity were averaged is correct, i.e. BLOSUM62 mostly represents changes in sequences with less than 62% conservation. Hroest's assertion above is without a reference and is incorrect. I'm going to revert the article.

Source: Henikoff & Henikoff Amino acid substitution matrices from protein blocks PNAS 89, pp. 10915-10919

http://www.pnas.org/content/89/22/10915.full.pdf

Jnmaloof (talk) 23:46, 9 November 2017 (UTC)Reply

Illustration

edit

This badly needs a picture of a typical Blosum matrix XApple 14:52, 12 February 2007 (UTC)Reply

It did get one. --hroest 05:50, 7 March 2008 (UTC)Reply

"BLOSUM matrix" is correct

edit

Some smart people say that one must say "BLOSUM" instead of "BLOSUM matrix" because the "M" in BLOSUM already means "matrix". The latter is correct, but the term BLOSUM is by now a name, not just an abbreviation. BLOSUM is a technical term. It is common sense in the scientific community to speak of "BLOSUM matrices". Just saying "BLOSUM" is counterintuitive and not colloquial.

Furthermore, if we wanted to get it linguistically really right, the article itself contained mistakes. It wrote: "To calculate a matrix for BLOSUM, ...". This is grammatically wrong, whatever opinion one has about BLOSUM matrices. 134.76.81.25 (talk) 10:49, 22 September 2010 (UTC)Reply

edit

Hello fellow Wikipedians,

I have just modified one external link on BLOSUM. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 09:31, 14 September 2017 (UTC)Reply

Confusing grammar

edit

I speak and read correct English and yet the following sentences do not seem to make sense to me:

"Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences."

" By using the block, counting the pairs of amino acids in each column of the multiple alignment."

"Two major forces drive the amino-acid substitution rates away from uniformity: substitutions occur with the different frequencies, and lessen functionally tolerated than others."

Overall pretty confusing