1. Probability

    Consider the domain of dealing five card poker hands from a standard deck of 52 cards, under the assumption the dealer is fair.

    1. How many atomic events are there in the joint probability distribution? (I.e., how many five card hands are there?)

      This is simply the number of ways to choose five things from 52:

      525 = 52! 5!52-5! = 525150 4948 5! = 2598960
    2. What is the probability of each atomic event?

      Each hand is equally likely, so:

      12598960
    3. What is the probability of being dealt a royal straight flush?

      There are four royal straight flushes, so:

      42598960 = 1649740
    4. What is the probability of being dealt four of a kind?

      Each hand of four of a kind is distinguished by two things, the card there is four of and the card that is different. There are 13 different cards to have four of and for each given four of a kind there are 48 cards that are different. So, the number of ways to get four of a kind is:

      1348 = 624

      The probability then would be:

      6242598960 = 14165
  2. Show that the statement:

    PABC = PAC PBC

    is equivalent to either of the statements:

    PABC = PAC and PBAC = PBC

    First, some notation. PA,B(a,b) is the joint probability of A and B:

    PABab = PA=aB=b = PA=a|B=b = PB=b|A=a

    The initial statement:

    PABC = PAC PBC

    Is a statement that A and B are independent, that is:

    PAB = PA PBA = PB

    For conditional probability in general:

    PAB = PAB PB = PAB PB = PAB PB

    This can be expanded to conditional joint probability:

    PABC = PABC PC = PABC PC = PABC PC = PABC PC

    Likewise:

    PABC = PABC PBC = PABC PBC = PABC PBC = PABC PBC

    Combining these two forms along with Bayes' rule gives:

    PABC = PABC PC PABC = PABC PC PBC = PBC PC PBC = PBC PC PABC = PABC PBC = PABC PC PBC = PABC PC PBC PC = PABC PBC

    Incorporating the initial assumption of independence:

    PABC = PABC PBC = PAC PBC PBC = PAC
  3. Suppose you are a witness to a nighttime hit-and-run accident involving a taxi in Athens. All taxis in Athens are either blue or green. You swear, under oath, that the taxi was blue. Extensive testing shows that, under dim lighting conditions, discrimination between blue and green is 75% reliable.

    1. Is it possible to calculate the most likely color for the taxi?

      From the question we are given probabilities about the events:

      • tg — The taxi was green
      • tb — The taxi was blue
      • pg — The perceived color was green
      • pb — The perceived color was blue

      The information given is:

      • All taxis are either blue or green:
        • P(tg ∨ tb) = 1
        • P(tg ∧ tb) = 0
      • Discrimination between blue and green is 75% reliable. (An assumption is made that the person making the statement is being honest and that they are accurately reporting their perceptions.) This is somewhat tricky since it is possible to say, "I know what the perception is, so the perception is given." For the purposes of testing reliability though, the actual color is what was known and the perception was determined:
        • P(pg | tg) = P(pb | tb) = .75
        • P(pg | tb) = P(pb | tg) = .25

      The desired quantity is the most likely color for the taxi:

      xi xixj tgtb ij Pxi> Pxj ij

      It is worth noting two facts:

      • If there is a tie for the top place, then xi is not defined
      • For this situation where there are only two colors, the statement can be simplified to: x x tgtb Px>0.5

      Consider the probability that the taxi was in fact blue given it was perceived to be blue, using Bayes' rule: (If this is less than 0.5 then the taxi was probably green.)

      Ptbpb = Ptbpb Ppb = Ptbpb Ptbpb+ Ptb¬pb = Ptbpb Ptbpb+ Ptbpg = Ppbtb Ptb Ppbtb Ptb + Ppbtg Ptg = Ppbtb Ptb Ppbtb Ptb + Ppbtg 1-Ptb

      All of the conditional probabilities are known, but P(tb) isn't, so the solution isn't known. This makes sense. Say for example there are 10,000 taxis in Athens. 9,999 are green and 1 is blue. The likelihood that I mistook one of the green ones for the blue one is much higher than the probability that I actually saw the one blue one.

    2. How about if it is known that 9 out of 10 Athenian taxis are green?

      This adds an additional known probability:

      • P(tg) = 0.9
      • P(tb) = 1 - P(tg) = 0.1

      With this information it is possible to solve the previous equation:

      Ptbpb = Ppbtb Ptb Ppbtb Ptb + Ppbtg 1-Ptb = 0.750.1 0.750.1+ 0.251-0.1 = 0.25

      So, there is only a 14 probability that the taxi was blue given I thought it was blue. I.e., it was probably green.

  4. Two astronomers in different parts of the world make measurements M1 and M2 of the number of stars, N, for a particular region of the sky. Normally, there is a small probability, e, that there will be an error of up to one star in either direction. Each telescope can also be badly out of focus (events F1 and F2) with much smaller probability f, in which case the scientist will undercount by three or more stars. If N is less than 3, they can fail to detect any stars at all. Consider the following Baysean networks:

    1. Which of these Baysean networks are correct (but not necessarily efficient) representations of the question?

      ii is clearly correct because a change in N or Fi will affect Mi. The statement of the question highly suggests that at least one of the other two is an inefficient solution.

      i can be eliminated because Mi is only dependent on Fi. The count found by the first observer, M1, is very likely correlated to the count found by the second, M2.

      iii doesn't have any conditional independences which is correct, though not terribly efficient.

    2. Which is the best network?

      ii is the most succinct network which requires the minimum of information to do computations.

    3. Write out the conditional probability distribution for P(M1 | N), for the case where N ∈ {1,2,3} and M1 ∈ {0,1,2,3,4}. Each element should be expressed in terms of the parameters e and/or f.

      If a telescope is out of focus it will undercount by at least three. Counting errors will cause an error of at most one, so it is not possible to undercount by two. Also it is not possible to overcount by more than one.

      It is assumed that the likelihood of a counting error is equally likely in either direction, so:

      P Mi=k-1 N=k = P Mi=k+1 N=k = e2
      M1 P(M1 | N = 1) P(M1 | N = 2) P(M1 | N = 3)
      0 e2 0 f
      1 1 - e e2 0
      2 e2 1 - e e2
      3 0 e2 1 - e - f
      4 0 0 e2
    4. Suppose M1 = 1 and M2 = 3. What are the possible number of stars if there is no prior constraint on the value of N?

      Since it is possible to undercount by an arbitrary number of stars, there is no maximum possible. Both M1 and M2 could be underestimating by 100,000 stars for all that is specified in the problem. The question is the minimum number. It is only possible to overcount by 1, so the minimum set by M2 is 2.

      2 N <
    5. What is the most likely number of stars given M1 = 1 and M2 = 3? If it is not possible to compute this, explain what additional information is needed and how it would affect the result.

      The likelihood that the telescopes are out of focus and a serious overcount is happening is constant at f. If this is the case then anything above N = 6 (M2 + 3) has the same probability:

      N M1 - N M2 - N P(N | M1 = 1) P(N | M2 = 3) P(N | M1 = 1,M2 = 3)
      ≤ 1≥ 0≥ 2010
      2 -1 1 e2 e2 e22
      3-20020
      4 -3 -1 f e2 ef2
      5-4-2020
      ≥ 6 ≥ -5 ≥ -3 f f f2
      • 1 — can't overcount by more than one
      • 2 — can't undercount by two

      The most likely number of stars is 2. From the problem specification, it is stated that f is "much smaller" than e. This can be taken to assume:

      f < e P N=4 M1=1 M2=3 = ef2 < e22 = P N=2 M1=1 M2=3 f < e2 P N6 M1=1 M2=3 = f2 < e22 = P N=2 M1=1 M2=3
  5. Consider the following variable elimination algorithm:

    • function Elimination-Ask(X, e, bn) returns a distribution over X
      • inputs:
        • X — the query variable
        • e — evidence of a specified event
        • bn — a Baysean network specifying a joint distribution P(X1,…,Xn)
      • variables:
        • factors ⇐ []
        • varsReverse(Vars[bn])
      • algorithm:
        • for each var in vars do
          • factors ⇐ [Make-Factor(X,e)|factors]
          • if var is a hidden variable then factorsSumOut(var,factors)
        • return Normalize(Pointwise-Product(factors))
    1. Apply variable elimination to the query:

      P Burglary JohnCalls=true MaryCalls=true

      For the purposes of succinct representation, the full predicates are shortened:

      • B = {b, ¬b} = Burglary
      • J = {j, ¬j} = JohnCalls
      • M = {m, ¬m} = MaryCalls

      There are also hidden variables:

      • A = {a, ¬a} = Alarm
      • E = {e, ¬e} = Earthquake

      The basics of this problem are predicated on the following Baysean network:

      Recall, in general, for a Bayes' net:

      P x1xn = Πi=1n P xiparentsXi

      The problem is considered in terms of the entire probability distribution, P, as opposed to the probability of a particular event, P. For a given:

      • X — query variable being investigated
      • o — observed values of evidence variables
      • u — unobserved variables
      PXo = αPXo = αu PXou

      So, for this example:

      PBjm = αPBjm = αea PBeajm

      I really don't understand the ee. I thought e meant there was an earthquake whereas E represented the probability distribution. Why isn't the equation:

      PBjm = αPBjm = αea PBEAjm

      The syntax I'm going to use is:

      yPY = Py+P¬y

      The first step is to reconsider the problem in terms of the known conditional independences from the Baysean network:

      P Bjm = αea PBEAjm = αea PB PE P ABE PjA PmA

      Under the laws of summation, this is equivalent to:

      αPB ePE a PABE PjA PmA

      The next steps involve working through the equation from right to left producing a set of factors. Factors are produced in several ways.

      For PmA , m is a fixed value, and the factor for a fixed value is simply:

      fmA = Pma Pm¬a

      Similarly for PjA , j is a fixed value:

      fjA = Pja Pj¬a

      There is another notational change here. Russell and Norvig use:

      fjA = fJjA = Pja Pj¬a fJJA = Pja Pj¬a Pj¬a P ¬j¬a

      Since fjA is the pairing of j with the elements from A, I am going to use fJA as the pairings of the elements of J with the elements of A. This is more intuitive since what would the notation fJA mean in Russell and Norvig's representation? It is a factor of what with the elements of A? If J needs to be present in the argument list to represent its inclusion in the factor then fJA doesn't make sense. So, for my notation, fJJA is a 2×2×2 matrix since there's a dimension for each axis. Specifically:

      fJJA = fjAJ f¬jAJ = Pjaj P j¬aj P ja¬j P j¬a¬j P ¬jaj P ¬j¬aj P ¬j a¬j P ¬j ¬a¬j = 11 00 00 11

      Where AB represents that A and B are the unstacked layers of a three-dimensional matrix. The above result is not useful in any way to the best of my knowledge, it simply serves to illustrate the meaning of the syntax and show the procedure for a three dimensional expansion.

      For the next term, PABE , dealing with the conditional probability on B and E, the factor is:

      fA BE = faBE f¬aBE = Pabe P ab¬e P a¬be P a¬b¬e P ¬abe P ¬ab¬e P ¬a¬be P ¬a ¬b¬e

      The next element in the equation is the summation over a. Factors may be combined with a type of summation where:

      f A_jm BE = a fABE fjA fmA = faBE fja fma + f¬a BE fj¬a fm¬a

      The operation × is the pointwise product. The pointwise product of two factors yields a new factor that is the union of the variables in the factors. Consider one of the elements in the previous equation: (Note that this operation divides the levels of fA(B,E) and reduces the dimensionality by a level.

      faBE fja fma = Pabe P ab¬e P a¬be P a¬b¬e Pja Pma = 0.950.94 0.290.001 × 0.9 × 0.7 = 0.950.63 0.940.63 0.290.63 0.0010.63 = 0.59850.5922 0.18270.00063

      The process is the same for ¬a, and that factor is:

      f¬aBE fj¬a fm¬a = 0.050.6 0.710.999 × 0.1 × 0.3 = 0.00150.0018 0.02130.02997

      The sum of those factors then is:

      f A_jm BE = a fABE fjA fmA = faBE fja fma + f¬aBE fj¬a fm¬a = 0.59850.5922 0.18270.00063 + 0.00150.0018 0.02130.02997 = 0.5985+0.00150.5922+0.0018 0.1827+0.02130.00063+0.02997 = 0.60.594 0.2040.0306

      Note that the process of summing fAB to fA_B reduced fAB from a 2×2 matrix to a 2×1. In general, summing will remove a dimension from the matrix.

      The next element in the equation is PE this is simply the probabilities of E not conditional on anything, and the factor is represented as:

      fE = Pe P¬e

      This is also a deviation from Russell and Norvig who represent this quantity as fEe .

      The next element in the equation is a summation over e which, as before, produces:

      f E_A_jm = e fE f A_jm BE = fe f A_jm BE + f¬e f A_jm BE = Pe × f A_jm BE + P¬e f A_jm BE = 0.002 × 0.60.594 0.2040.0306 + 0.998 × 0.60.594 0.2040.0306 = 0.002 × 0.60.594 0.2040.0306 + 0.998 × 0.60.594 0.2040.0306 PBjm = 0.284 0.716
    2. Calculate the number of computations performed and compare that with the number performed by enumeration.

      By counting:

      • Enumeration: 24
      • Elimination: 7
    3. Suppose a network has the form of a chain: a sequence of boolean variables X1,…,Xn where Parents(Xi) = {Xi - 1} for i = 2,…,n.

      1. What is the complexity of computing P(X1|Xn = true) using enumeration?

        In general:

        P X1Xn=true = α kn=true kn-1 Xn-1 k2X2 P Xn= kn Xn-1= kn-1 X2= k2 X1

        This equation can then be factored based on the knowledge in the Bayes' diagram:

        α kn=true P Xn=kn Xn-1 kn-1 Xn-1 P Xn-1= kn-1 Xn-2 k2X2 P X2=k2 X1 P X1

        Since |ki| = 2 ∀ k < n, the enumeration search tree will be a full binary tree of depth n - 1.

        The number of non-leaf nodes in such a tree (and required computational complexity) is 2n - 2.

      2. What is the complexity using elimination?

        Each reduction will take two computations and there are n of them, so 2n2.

  6. You're a security guard in an underground installation. You want to know if it's raining on a given day, but your only exposure to the outside world occurs when the director arrives each morning either with, or without an umbrella. Thus, for each day, t, the set of evidince varialbes, Et, contains a single element ut representing the presence or absence of an umbrella. The set of unobservable state variables, Xt, contains a single variable rt representing whether it is raining or not.

    Furthermore, assume the following observed probabilities:

    rt - 1P(rt = T|rt - 1)
    T0.7
    F0.3
    rtP(ut = T|rt)
    T0.9
    F0.2
    1. Suppose you observe an unending sequence of days on which the umbrella appears. Show that, as the days go by, the probability of rain on the current day increases monotonically toward a fixed point. What is that fixed point?

      An examination of this question requires an application of chaining for a first-order Markov process which states:

      P Xt u1t = α P utXt rXt-1 P Xt r u1t-1 P r u1t-1 = α P utXt rXt-1 PXtr P r u1t-1

      This function allows the probability of a given event to be built up from a given set of evidence. Like any chain however, it must have an end. In this case, that end is P(R0) = <P(r0),P(¬r0)> which is the probability distribution for rain on day 0.

      For the sake of simplicity, P(R0) will be assumed to be fixed at <0.5,0.5>. If, as the question asks, the distribution is converging on a fixed value, then the start point for the convergance should not matter.

      The calculations start off with a computation of the probability distribution for rain on day 1 which is slightly different from other calculations because P(R0|u0) is defined as P(R0).

      P R1 u10 = r0 P R1 r0 P r0 u10 = r0 P R1 r0 Pr0 = P R1 r0 Pr0 + P R1 ¬r0 P¬r0 = 0.70.30.5 + 0.30.70.5 = 0.350.15 + 0.150.35 = 0.50.5

      Now, this value may be used for the first link in the chain:

      P R1 u11 = α P u1 R1 P R1 u10 = α Pu1r1 P u1¬r1 P R1 u10 = α 0.90.2 0.50.5 = α 0.450.1 0.450.45+0.1 0.10.45+0.1 0.81820.1818

      The next link progresses in the same way, but u1 is now defined and meaningful:

      P R2 u11 = r1 P R2 r1 P r1 u11 = P R2 r1 P r1 u11 + P R2 ¬r1 P ¬r1 u11 0.70.30.8182 + 0.30.70.1818 0.62730.3727

      The next link in the chain is:

      P R2 u12 = α P u2 R2 P R2 u11 α 0.90.2 0.62730.3727 0.88340.1166

      To do one more iteration to solidify the pattern:

      P R3 u12 = P R3 r2 P r2 u12 + P R3 ¬r2 P ¬r2 u12 0.70.30.8834 + 0.30.70.1166 0.65340.3466 P R3 u13 = α P u3 R3 P R3 u12 α 0.90.2 0.65340.3466 0.89460.1054

      The effect of having an umbrealla present every day means that the factors remain the same. The actual math has to date eluded me, but this program will estimate the converged value to an arbitrary number of decimal places. The value it comes up with is, to 50 places:

      <0.89674554944846594496715149092767174611258015717335, 0.10325445055153405503284850907232825388741984282665>

    2. Consider forecasting further and further into the future, given just the first two umbrella observations.

      1. Compute the probability P(rk + 2|u1, u2) for 1 ≤ k ≤ 20 and plot the results.

        P Rt+k+1 u1t = rt+k P Rt+k+1 rt+k P rt+k u1t

        The value of P r2 u12 is somewhat open to debate. In the previous question the value at t = 2 was based on an initial 50/50 probability of rain. Using that assumption:

        P R3 u12 = r2 P R3 r2 P r2 u12 = P R3 r2 P r2 u12 + P R3 ¬r2 P ¬r2 u12 0.70.30.8834 + 0.30.70.1166 0.65340.3466 P R4 u12 = r3 P R4 r3 P r3 u12 = P R4 r3 P r3 u12 + P R4 ¬r3 P ¬r3 u12 0.56140.4386
      2. You should see that this probability converges toward a fixed point. What is this point?

        <0.5,0.5>