I have a weight matrix of length 20 x 15 (amino acids x sequence positions). Each element of my weight matrix is a relative probability

If I have a sequence say "AAPGTGASMHSGLLW" how would I score it against the matrix? I tried taking the product of probabilities corresponding to the matrix, but I end up with a really small number

Any ideas?

Edit:

Consider the simple matrix:

`1 2 3 4 A 0.3 0.90 0.5 0.0001 B 0.2 0.05 0.4 0.2 C 0.5 0.05 0.1 0.8`

The best match is, with a score of:

`CAAC = 0.5 * 0.9 * 0.5 * 0.8 = 0.18`

If you change the first letter to an B instead of C

you get a match, with a score of:

`BAAC = 0.2 * 0.9 * 0.5 * 0.8 = 0.072`

Which is a huge difference for such a small change… This is even worse with my larger matrix since the score is easily affected by small probabilities

The probabilities are correct. You must take the product (in log space this is equivalent to sum). The reason the probability looks small is just that you are perhaps thinking the score should be close to 1. However, this is not the case. To get a score of 1, you need the PWM to have 1/0/0/0 at all positions and get a perfect match.

So what should you compare to? What people usually do is compare this to a background distribution, the easiest being uniform, so the PWM is 0.25 everywhere. For your example, the score in this case will be 0.25^4 = ~0.004 and this is what you should expect by random.

This is why people usually look at the ratio between the score of the PWM relative to the score for the background model (and usually take the log2 of that), which in your case will be 0.18/0.004 = ~46 so the sequence you got is 46 times more than you would expect by random! And for your second example, 0.072/0.004 = ~18 times more than expected, so that is still high.

More conceptually, what you are doing is comparing two probabilistic models, your PWM and a background PWM, and comparing the probability to get your observed sequence according to each one of them. This is a common approach in general for comparing probabilistic models, even if they are more complicated.

According to [this page][1], you should take the sum and not the product:

Once a Profile has been derived from a set of functionally related sites, the Profile can be used to scan a query sequence for the presence of potential sites. Usually you run a window the length of the matrix along the sequence, and sum the coefficients from the matrix corresponding to each nucleotide in each position on the window sequence. Formally, the score of a matrix M for a site s of length l (s = s1,… , sl, and sk being one of {A, C, G, T}) is computed as

$m_s=sumlimits_{j=1}^lM_{s_{lj}}$

I highly recommend you read the rest of the page, the author, Roderic Guigó, is an authority on the subject.