Problem Set Four

October 17th, 2008 | Tags:

Homework (problem set four) was handed out in class and is available as a pdf here and in the downloads section for download. It is due at the beginning of class Thursday, October 23, 2008.

  1. leec
    October 20th, 2008 at 16:07
    Quote | #1

    A student asked several questions:

    Hi. I have several questions regarding HW#4

    1) In problem#1d, I’m not clear exactly what was meant by “the fluorescence xi depends linearly on the amount of marker i in the sample”. Also, I’m not entirely sure how I would go about writing an expression for this equation because xi and vi are from different subjects, so how can I determine how xi is conditioned upon vi.

    ANSWER
    The problem clearly states that the crime scene observation x_i is a mixture of two people’s DNA given by kappa_i and nu_i (what you called vi). Moreover, the problem states that the observation value x_i depends linearly on the amount of marker i in this mixture. That means if the amount of marker i in the sample is 50%, you will get an x_i observation of around 0.5 (but with some error due to the standard deviation sigma_i); similarly if the amount of marker i in the sample is 20%, you’ll get an x_i observation of around 0.2. That part is straightforward. All that’s left is to ask yourself three questions:
    - given a value of kappa_i, approximately what value of x_i would I expect in a sample from that person? The problem gives you the answer: kappa_i/2.

    - given a value of nu_i, approximately what value of x_i would I expect in a sample from that person?

    - given a mixture of fraction delta from person kappa, and 1-delta from person nu, what value of x_i would I expect from that mixture sample?

    2) In problem#6, I think I am lacking in the biology aspect to understand this problem. For part a, is the likelihood table the table that can be generated by the Viterbi algorithm? If so, then how would I know the probability of each occurring, and the probability of transition to another? Also, I’m still not entirely clear on what’s the pseudocount principle. I’ve looked at the notes, but to me it just means we don’t have enough data for a certain hidden state. But how do I calculate this?

    ANSWER
    There is no “biology” in this problem; the question only asks you about probabilities in a very simple Hidden Markov Model. You are given five observations of an 8 letter “string” and asked to make an HMM that models these data.

    The Viterbi algorithm we studied in Thursday’s lecture requires a likelihood table as input; it does not PRODUCE that likelihood table.

    Problem 6a asks you to construct a reasonable likelihood table from the observations you were given, using simple inference principles we learned in the preceding section of the course, specifically, the pseudocount principle.

    This is very simple. I derived and discussed the pseudocount principle in lecture, so go back to the audio or your own notes; alternatively see pages 33-34 of my chapter 3. The pseudocount principle just means adding an extra count (i.e. +1) to all the possible observation values (even those that had an observation count of zero in our sample). This ensures that we will not set the likelihood of such a state to ZERO just because we only had a small sample of observations.

    3) Also in problem#6, part b, how will I draw a Markov graph with this? I’m not entirely sure as to what is meant by the first and last nucleotide. Is it of the 8 nucleotide sequence mentioned in the problem? If so, how would I know the transition between the 1st to 2nd, 3rd, etc? Is R just any random nucleotide?

    ANSWER

    A “nucleotide” means a letter. So “first nucleotide” means “first letter” of the 8 letter string; “last nucleotide” means “last letter”. All of this was covered in the assigned reading (Jones & Pevzner chapter 3), so please go back to that reading if this seems unclear.

    Yes, R stands for “random sequence state”, as defined in the problem: “state R emits nucleotides with likelihood Pr(X|R)=0.25
    random sequence (uniform probability)”, where the possible values of X are the letters A, T, G, C (again, see JP chapter 3 if this is unfamiliar to you).

    As for the transition probabilities from 1st letter to 2nd, 3rd etc., I think you are making this out to be more complicated than necessary. Just construct the simplest HMM that can produce a good model of the observations and follows the requirements given in part 6b. This is not a trick question…

  2. Yan
    October 21st, 2008 at 22:56
    Quote | #2

    For problem#2, can I assume that the probability from the start state to either alpha or beta is 0.5?

  3. mjanis
    October 22nd, 2008 at 12:02
    Quote | #3

    Yes