Just as we did in the lecture notes, come up with a si
APA format, in-text citation, references include, 1 page
1. Just as we did in the lecture notes, come up with a simple, real-world example of a series of transitions that can be represented as a Markov chain transmission series ( different example in the attachment below)
2. In an HMM, what are Hidden States? If they are truly hidden, how can we determine what they are?
Hidden Markov Models (HMM)s:
Dr. J. Enrique Herrera, Ph.D.
Modifed by Dr. Wolfgang Rumpf, Ph.D.
What is a Hidden Markov Model, or HMM?
An HMM is a way to represent a real-life problem within a statistical framework – creating a model of a process that you may not know all of the variables for, like flipping a coin that isn't fair, or mutations at a base position of DNA. Basically, the way this works is you start with a Markov chain, and if the Markov chain doesn't completely "work" or create a model that's predictable, you can assume that there is a "hidden" component that is required to make a more accurate model. That's why an HMM is called an HMM – it's a "hidden Markov model" that relies on a Markov chain where you can't see the states, only the results of a state transition.
We've already talked about Markov chains, but let's review briefly. This a very basic representation of a Markov Chain:
There are a couple of things to notice here: this model has two states, “A” and “B”. You can also notice that the model can go from A to B or from A back to A. That is a valid move in the model. So, the model would produce a Markov Chain that would look something like: AABBBABBBAAABBB etc.
As an exercise, please follow the dot (if you don’t see the dot moving, you may need to open the document in MS word) and produce your own Markov Chain out this model mine is:
AABBBBAABBBAABBBBAABAAAAA is your similar?
Every time the dot moves, that is call a transition. If you were to observe this model for awhile and write down a long chain, you would eventually realize that some transitions happen more than others and you could then take the ratio of the number for a particular transition (say AB) and divide it by all the transitions you observe, that is called a transition probability.
Now on a more realistic example, say you want to predict the weather and you start to write down the weather daily on a very simple way, you go “S” for sunny and “R” for rainy. You could simply replace A and B for S and R on the previous model. You would then get a sequence like this:
SSSRRRSSRSSRSSRSRRS
You can calculate the transition probability and have a very basic model to predict the weather… I am not sure how accurate it will be, but you can say you used a Markov Chain to predict the weather…. nice right?
As an exercise, please come up with your own real-life example. If your example has all the components described here: states, transitions and predictions then congrats!! You got all the concepts, if you have any issues coming up with you own example, please do not be shy about sharing your questions with the class instructor!!
And then to tie this to Bioinformatics, DNA is a Markov chain that has 4 states, A, C, G, T. When you think about it this way, you can see why it is natural model DNA as a Markov Chain. If you visit these four states, you will get a sequence like
AAGCCGTTTCGAGGTTAGGTACCCCTTTT which is DNA, right?
Note: this may be a good stopping point, come back a read it again later. Although, these concepts appear simple, it is important to make sure you grasp them before you proceed, concepts may start getting mixed together otherwise.
The hidden part
What is hidden about a hidden Markov Model? It turns out that the states in a model may be hidden within another sequence. There are several "classic" examples – for example, let's say you are trying to construct a model of the weather patterns, but you aren't allowed to see the weather itself (perhaps you live in an inside apartment with no windows, and you never leave the apartment). What you do see is what your fellow tenants are wearing – some days they were shorts and t-shirts, while on other days they wear galoshes and raincoats. So, you can make a guess as to what the weather is based on your observation – not of the weather, but of the results of the weather.
Another example that we can walk through using more detail is that of the Dishonest Casino(Durbin et al., 1998).
In this example, you go to a casino an observe a game played with only one die. The dealer shows a number from 1 to 6 and if roll that number you win. So, the probability of a win is 1/6 for each roll.
You write down a series of rolls:
121432423222334332144323232324553342323423214435242323231
When you calculate the transition probabilities you start suspecting something is wrong, it appears the casino is cheating. Perhaps, the casino is switching the die on the player, sometimes it allows a fair (F) die to be played an in occasion it switches to a loaded die (L), a loaded die is one that lands on number more often that it should.
Loaded die (L):
Let us examine what we have so far, we have a Markov Chain, the states (1-6), we have a sequence, the one shown above. So, we would like to find out a couple of things, first is the casino really cheating? And if so, when is it switching?
Well, you will notice that with the model as it is you can only predict the new roll. You could say the next roll will be a “5” , but that is about it, you cannot say anything about if the casino is cheating. So, you do not know when the die is fair (F) or is loaded(L).
Since the casino is changing the dice on you and you do not see it, those are hidden, and you could consider them states L and F, hidden states.
So now, some things will change, since you are interested in the L,F states and not in the face numbers anymore, you cannot call the face numbers states anymore ( those are not the ones you are interested in for HMM). The new states are loaded(L) or fair(F), so what do you call the face number now? The face numbers are now called emissions (emissions are what you can see, the observable part of the model).
Since now your numbers 1-6 are emissions and your hidden states are L and F, you have a problem, you lost your transition probabilities, so what the transition probabilities since you cannot see when the dealer changes the dice?
This is the central question about HMM and probably the part that is not intuitive, but it turns out it is very easy to follow. Basically, what is done is that people observe the game for a long time and occasionally they force the casino to reveal when the chances occurred, that is how transition probabilities are calculated. It, of course, changes for each model, but that it is basically the principle behind it. In practice, for the purpose of this class you will always be given the HMM transition probabilities.
To recap, so far, we have for this model:
1. Emissions: The roll numbers, the ones you can see.
2. Hidden states: Fair or loaded (F or L), you cannot see them.
3. Emission probabilities: the probabilities associated with the emissions.
4. Transition probabilities: The probability of going from one state to another, for example going from loaded to fair.
5. Initial probabilities: this one we have not defined yet, but it is basically the probability for the model to start on a particular model, in the example, the probability to start with a load die or a fair die.
Exercise: Could you try to provide your own example of a real life HMM? You could try to modify your MC to make it an HMM. If you have any issues coming up with your own example, please do not hesitate on contacting your instructor with any questions.
Now, is it not amazing that with all the components listed above, you can now answer all your questions? not only if the casino is cheating, but in addition what rolls were made with the fair die and ones with the loaded die. Well, maybe you cannot know exactly, but you can assign the labels L,F with some probability. Some models will be better than others and there are entire dissertations on how well the model perform, but for now let’s say the models work well enough that we want to know about HMMs.
When you assign the labels, the result will look something like this:
LFFFLLLFFFFLLLLFFFFFFLLLFFFFFFFLLLLLLFFFFFFFLLLLLLFFLLLLLFFFFFLLL
121432423222334332144323232324553342323423214435242323231
Finding those labels (the L or F) is the bulk of the information we want to obtain out of the HMM. If you see more than just a few "L"s, then chances are the casino is cheating!
How do we determine the assignment (L or F)? Making those assignments is the full purpose of this topic and they are done with different algorithms. Two common algorithms used are the Backward-forward algorithm and the Viterbi algorithm. Let's see how we can create an HMM of the Unfair Casino using a Viterbi algorithm in R!
The process takes the emission probabilities, the transition probabilities and keeps track of a sequence for each hidden state, at each step you stablish the probability, this is called a state path. After the “paths” are completed, you then retrace the paths and identify the state that has the higher probability at each point.
You can see how this is done in practice here:
Running the Viterbi Algorithm in R
If you've never used R before, you'll be surprised at how easy it can be to use.
Installing R and R-Studio
1) Go to https://cran.r-project.org/mirrors.html
2) Select a mirror that is close to you geographically
3) Download an appropriate version of R for your operating system and install it
4) Download R-Studio at https://www.rstudio.com/ide/download/desktop
5) Install R-Studio
For a more detailed tutorial in R, please watch this video:
Adding Packages to R
R comes with many statistical functions built in, but it can also be extended by installing and loading special libraries. You'll need to add one of these special libraries (the HMM library) to run the Viterbi algorithm. To install this library, launch RStudio (it will automatically run the R that you downloaded as a background application). Once it is launched, you will see a window that looks like this:
On the left is the CONSOLE (you can see it's name on the tab). On the right is an ENVIRONMENT window (top) and a FILES window (bottom). You can type directly in the console, but if you want to save your code, you'll want to create a new script. To do that, go to the FILE menu, select NEW FILE, then select R SCRIPT. The CONSOLE window will drop to the bottom of the left-hand side, and you'll have a scripting window that you can type into and edit:
You will notice that this is called "Untitled1" in the tab at the top left. When you save it you can give it a name.
Now let's install the package we need. In the CONSOLE (bottom left), type the following command and press RETURN:
install.packages("HMM")
The package downloads and installs. Now let's start editing our R script (the Untitled1 at the top left). Start by adding these lines:
# Week 7 Viterbi Exercise
# BIFS 613
# Invoke Libraries
library(HMM)
The lines that start with the "#" are comment lines – they don't execute in the code, but they help you know what's going on. You should feel free to add as many comment lines as you want! The last line tells R to actually load the library – this is a nice feature, since over the years you may install dozens of libraries, but you may only need to include one or two in any given R script – this means that R runs faster and with less memory, since it doesn't have to load all of the libraries!
Let's continue. As mentioned earlier, we have to give the Viterbi algorithm the emission probabilities and the transition probabilities (the latter are the same probabilities that you see on a Markov chain's digraph). We'll also describe what we suspect are the hidden states. Add these lines to your R script:
# Emission Labels
emission_labels <- c(1,2,3,4,5,6)
# Transition Probabilities
tranxProbs <- matrix(c(.6,.4,.4,.6),2)
# Emission Probabilities
emissionsProb <- matrix( c(.2,.3,.2,.1,.1,.1 ),1 )
# Hidden States
hidden_states <- c('L','F') # L = loaded and F = fair
You should notice several things here – the syntax used by R is a bit unusual; we put our variable at the left, then use a left-pointing arrow ("<-") to tell it to feed whatever is to the right of that arrow into the variable. For example, the term c(1,2,3,4,5,6) is a way of telling R that we are concatenating the values one through six into the emission labels. You may also have noticed that I added a comment at the end of a code line (in the "hidden_states" definition). That's perfectly valid as long as you don't wrap the line!
Now that we have some code in our script, I'd recommend saving it. The FILE menu has a save option; save the script somewhere handy.
Once that's done, we can start to execute our code to make sure there aren't any bugs or other issues. Put your cursor at the very top of the script (at the # Week 7 line). Now, if you look along the top of the script window, towards the right you will see a RUN button. When you push that button, it will execute the line that the cursor is on and move the cursor down one line. That way you can execute your code one line at a time by repeatedly pushing that button (you can also select a block of code with your mouse and run the entire block at once). Go ahead and run everything you have so far. You'll see the CONSOLE (bottom left) window updates with status messages, but more importantly the upper right ENVIRONMENT window will start to show your variables:
This is where the advantage of the R-studio IDE becomes clear. Your variables aren't "hidden" in the memory of the system, they are immediately accessible and browsable by double-clicking. You can also see what the variable type is – e.g., all of the variables except "hidden_states" are numeric. This is important, since R will often cast a variable as the wrong type[footnoteRef:1]. [1: If this sounds like some sort of arcane problem, don't worry about it now – just remember that you can double-check if your variable is the correct type by just glancing at the window!]
With those four pieces in place, we have everything we need to create our HMM model. We can do that by adding the following code to the script and running it:
# Create the HMM Model
Hmm <- initHMM(hidden_states, emission_labels, startProbs=NULL, transProbs=tranxProbs, emissionProbs=emissionsProb)
Now that we have the model, all that's left is to plug in some observations. In this case, we are going to provide a series of results from rolling a 6-sided die. Add these lines to your code and run them:
# Enter the observations:
observed_emissions <- c(1,1,2,3,1,1,1,2,3,3,4,4,4,5,5,5,6,6,1,2,2,2,2)
The last thing that's left to do is to use our HMM model and the observed emissions (dice rolls) in the Viterbi algorithm. That can be done by adding this code and running it:
# Now let's run the Viterbi Algorithm!
viterbi <- viterbi(hmm, observed_emissions)
If you look at the ENVIRONMENT window, you'll see that there is an entry for the Viterbi results – there are 17 of them (one for each die roll), so you can't see them all in that window. But you can tell R to display them in the console using this command:
print(Viterbi)
If you run that command you'll see the console print out the results:
[1] "F" "F" "F" "F" "L" "L" "L" "L" "L" "L" "L" "F" "F" "F" "L" "L" "L"
That's it! There's certainly far more depth of detail to the Viterbi – and R – than we can cover in one week, but you should have a good idea as to what you can do with it for now!
REFERENCES
Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1st ed.). Cambridge University Press. https://doi.org/10.1017/CBO9780511790492
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.