--- title: "protHMM" output: rmarkdown::html_vignette bibliography: References.bib vignette: > %\VignetteIndexEntry{protHMM} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` \DeclareUnicodeCharacter{05F3}{'} ## 1: Introduction The goal of protHMM is to help integrate profile hidden markov model (HMM) representations of proteins into the machine learning and bioinformatics workflow. protHMM ports a number of features from use in Position Specific Scoring Matrices (PSSMs) to HMMs, along with implementing features used with HMMs specifically, which to our knowledge has not been done before. The adoption of HMM representations of proteins derived from [HHblits](https://doi.org/10.1038/nmeth.1818) and [HMMer](http://hmmer.org) also presents an opportunity for innovation; it has been shown that HMMs can benefit from better multiple sequence alignment than PSSMs and thus get better results than corresponding HMMs using similar feature extraction techniques [@lyons-2015]. protHMM implements 20 different feature extraction techniques to provide a comprehensive list of feature sets for use in bioinformatics tasks ranging from protein fold classification to protein-protein interaction. ```{r setup} library(protHMM) ``` ## 2: Autocorrelation: hmm_MB, hmm_MA, hmm_GA protHMM implements normalized Moreau-Broto, Moran and Geary autocorrelation descriptors, each of which measure the correlation between two amino acid residues separated by a lag value, (lg) along the sequence. Each of these feature vectors return a vector of $20 \times lg$. @liang-2015 provide a mathematical representation for each of these features, having used them to predict protein structural class: #### Moreau-Broto ${N_{lg}^j = \frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} H_{i, j} \times H_{i+lg, j}, \space (j = 1,2..20, 0 < d < L)}$ In which ${N_{lg}^j}$ is the Moreau-Broto correlation factor for column $j$, $lg$ is the lag value, $L$ is the length of the protein sequence and $H$ represents the HMM matrix. #### Moran ${N_{lg}^j = \frac{\frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} (H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{\frac {1} {L} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}$ In which ${N_{lg}^j}$ is the Moran correlation factor for column $j$, $lg$ is the lag value, $L$ is the length of the protein sequence, $\bar{H_j}$ represents the average of column $j$ in matrix $H$ and $H$ represents the HMM matrix. #### Geary ${N_{lg}^j = \frac{\frac{1} {2(L-lg)} \sum_{i = 1} ^ {L-lg} (H_{i, j} - H_{i+d, j})^2}{\frac {1} {L-1} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}$ In which ${N_{lg}^j}$ is the Geary correlation factor for column $j$, $lg$ is the lag value, $L$ is the length of the protein sequence and $H$ represents the HMM matrix. #### Examples ```{r example} library(protHMM) ## basic example code h_MB<- hmm_MB(system.file("extdata", "1DLHA2-7", package="protHMM")) h_MA<- hmm_MA(system.file("extdata", "1DLHA2-7", package="protHMM")) h_GA<- hmm_GA(system.file("extdata", "1DLHA2-7", package="protHMM")) head(h_MB, 20) head(h_MA, 20) head(h_GA, 20) ``` ## 3: Covariance: hmm_AC, hmm_CC protHMM implements two covariance-based features, autocovariance (AC) and cross covariance (CC). These features calculate the covariance between two residues separated by a lag value lg along the sequence. AC calculates this for positions in the same column, and CC for positions that are not in the same column of the HMM. It should also be noted that for these features, the HMM matrix is not converted to probabilities, unlike other features. Both features were used for protein fold recognition. #### Autocovariance @dong-2009 provide a mathematical representation of autocovariance: $AC(j, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{L-lg}, \space j = 1,2..20$ In which $AC(j,\space lg)$ is the autocovariance factor for column $j$, $lg$ is the lag value, $L$ is the length of the protein sequence, $\bar{H_j}$ represents the average of column $j$ in matrix $H$ and $H$ represents the HMM matrix. #### Cross covariance @dong-2009 also provide a mathematical representation of cross covariance: $AC(j_1, j_2, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j_1} - \bar{H_{j_1}}) \times (H_{i+lg, j_2} - \bar{H_{j_2})}}{L-lg}, \space j_1,j_2 = 1, 2, 3...20, \space j_1 \neq j_2$ In which $AC(j_1,\space j_2,\space lg)$ is the cross correlation factor for columns $j_1, \space j_2$, $lg$ is the lag value, $L$ is the length of the protein sequence, $\bar{H_{j_i}}$ represents the average of column $j_i$ in matrix $H$ and $H$ represents the HMM matrix. #### Examples ```{r} library(protHMM) ## basic example code h_AC<- hmm_ac(system.file("extdata", "1DLHA2-7", package="protHMM")) h_CC<- hmm_cc(system.file("extdata", "1DLHA2-7", package="protHMM")) head(h_AC, 20) head(h_CC, 20) ``` ## 4: Bigrams and Trigrams The bigrams and trigrams features are outlined by @lyons-2015, and they interpret the likelihood of 2 (bi) or 3 (tri) amino acids occurring consecutively in the sequence. Thus, they take shape as a 20 x 20 matrix for bigrams and a 20 x 20 x 20 array for trigrams, which are then flattened to create vectors of length 400 and 8000. These features were used by @lyons-2015 for protein fold recognition. #### Bigrams @lyons-2015 outlines bigrams mathematically as $B[i, j]$, where: ${B[i, j] = \sum_{a = 1}^{L-1} H_{a, i}H_{a+1, j}}$ And ${H}$ corresponds to the original HMM matrix, and ${L}$ is the number of rows in ${H}$. #### Trigrams @lyons-2015 outlines bigrams mathematically as $B[i, j, k]$, where: ${B[i, j, k] = \sum_{a = 1}^{L-2} H_{a, i}H_{a+1, j}H_{a+2, k}}$ And ${H}$ corresponds to the original HMM matrix, and ${L}$ is the number of rows in ${H}$. #### Examples ```{r} library(protHMM) ## basic example code h_tri<- hmm_trigrams(system.file("extdata", "1DLHA2-7", package="protHMM")) h_bi<- hmm_bigrams(system.file("extdata", "1DLHA2-7", package="protHMM")) head(h_tri, 20) head(h_bi, 20) ``` ## 5: CHMM The CHMM feature begins by creating a CHMM, which is created by constructing 4 matrices, ${A, B, C, D}$ from the original HMM ${H}$. ${A}$ contains the first 75 percent of the original matrix ${H}$ row-wise, ${B}$ the last 75 percent, ${C}$ the middle 75 percent and ${D}$ the entire original matrix [@an-2019]. These are then merged to create the new CHMM ${Z}$. From there, the bigrams feature is calculated with a flattened 20 x 20 matrix ${B}$, in which ${B[i, j] = \sum_{a = 1}^{L-1} Z_{a, i} \times Z_{a+1, j}}$ ${H}$ corresponds to the original HMM matrix, and ${L}$ is the number of rows in ${Z}$ [@an-2019]. Local Average Group, or LAG is calculated by first splitting the CHMM into $j$ groups and then the following: $LAG(k) = \frac{20}{L}\sum_{p = 1}^{N/20} Mt[p+(i-1)\times(N/20),\space j]$ Where ${L}$ is the number of rows in ${Z}$ and $Mt[p+(i-1)\times(N/20),\space j]$ represents the row vector of the CHMM and the $i$th position in the $j$th group [@an-2019]. These features are then fused, and a vector of 800 along the the original length 400 bigrams and LAG vectors are returned. The CHMM features were used to predict protein-protein interactions. #### Examples ```{r} library(protHMM) h_fused<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[1]] h_lag<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[2]] h_bigrams<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[3]] head(h_fused, 20) head(h_lag, 20) head(h_bigrams, 20) ``` ## 6: Distance The distance feature calculates the cosine distance matrix between two HMMs ${A}, \space{B}$ before dynamic time warp is applied to the distance matrix calculate the cumulative distance between the HMMs, which acts as a measure of similarity [@lyons-2016]. The cosine distance matrix ${D}$ is calculated with: ${D[a_i, b_j] = 1 - \frac{a_ib_j^{T}}{a_ia_i^Tb_jb_j^T}}$ in which ${a_i}$ and ${a_i}$ refer to row vectors of ${A}$ and ${B}$ respectively [@lyons-2016]. This in turn means that $D$ is of dimensions ${nrow(A), nrow(b)}$. Dynamic time warp then calculates the cumulative distance by calculating matrix: $C[i, j] = min(C[i-1, j], C[i, j-1], C[i-1, j-1]) + D[i, j]$ where ${C_{i,j}}$ is 0 when ${i}$ or ${j}$ are less than 1 [@lyons-2016]. The lower rightmost point of the matrix ${C}$ is then returned as the cumulative distance between proteins. The distance feature was used by @lyons-2016 for protein fold recognition. #### Example ```{r} library(protHMM) # basic example code h<- hmm_distance(system.file("extdata", "1DLHA2-7", package="protHMM"), system.file("extdata", "1TEN-7", package="protHMM")) h ``` ## 7: FP_HMM FP_HMM consists of two vectors, ${d, s}$. Vector ${d}$ corresponds to the sums across the sequence for each of the 20 amino acid columns [@zahiri-2013]. Vector ${s}$ corresponds to a flattened matrix of $S$ where: ${S[i, j] = \sum_{k = 1}^{L} H[k, j] \times \delta[k, i]}$ in which ${\delta[k, i] = 1}$ when ${A_i = H[k, j]}$ and 0 otherwise [@zahiri-2013]. ${A}$ refers to a list of all possible amino acids, ${i, j}$ span from ${1:20}$. This feature has been used for the prediction of protein-protein interactions. #### Example ```{r} library(protHMM) # basic example code h<- fp_hmm(system.file("extdata", "1DLHA2-7", package="protHMM")) h[[1]] head(h[[2]], 20) ``` ## 8: HMM_GSD This feature was created by @jin-2021 and begins by creates a grouping matrix $G$ by assigning each position a number $1,2, 3$ based on the value at each position of HMM matrix $H$; $1$ represents the low probability group, $2$ the medium and $3$ the high probability group. The number of total points in each group for each column is then calculated, and the sequence is then split based upon the the positions of the 1st, 25th, 50th, 75th and 100th percentile (last) positioned points for each of the three groups, in each of the 20 columns of the grouping matrix. Thus for column $j$: ${S(k, \space j, \space z) = \sum_{i = 1}^{(z) \times.25 \times N} |G[i,\space j] = k|}$ where ${k}$ is the group number, ${z = 1,2,3,4}$ and ${N}$ corresponds to number of rows in matrix ${G}$ [@jin-2021]. #### Example ```{r} library(protHMM) # basic example code h<- hmm_GSD(system.file("extdata", "1DLHA2-7", package="protHMM")) h[1:40] ``` ## 9: Pseudo HMM: IM_psehmm and pse_hmm Both of the pseudo HMM features, pse_hmm and IM_psehmm, are based off of @chou-2007's psuedo PSSM concept. This was engineered in order to create non-variable length representations of proteins, while keeping sequence order information. IM_psehmm stands for improved pseudo hmm, which is based off of @ruan-2020's improvements to the @chou-2007's initial offering; @ruan-2020's offering is still based on the PSSM matrix. As such, this feature is a port for use with HMMs. #### pseudo_hmm Mathematically, pseudo_hmm is made of the fusion of 2 vectors. The first is of the means across the 20 amino acid emission frequency columns of the HMM. The second can be found in the following equation [@chou-2007]: $V_{j}^{i} = \frac{1}{L-i}\sum_{k = 1}^{L-i}H[k, j] \times H[k+i, j], \space j = 1,2..20, \space i < L$ $H$ represents the HMM matrix, $L$ represents the row number of $H$, $i$ represents the amount of contiguous amino acids that correlation is measured across and $V_{j}^{i}$ represents the value of the vector for the $j$th column and $i$th most contiguous amino acids. This results in a vector of $20 + g \times 20$. #### IM_pse_hmm IM_pse_hmm is also the fusion of two vectors, one being the same means vector described above. The second can be found as [@ruan-2020]: $V_{j}^{d} = \frac{1}{20-d}\sum_{k = 1}^{L}H[k, j] \times H[k, j+d], \space j = 1,2..20, \space d < 20$ Where $H$ the HMM matrix, $L$ represents the number of rows in $H$ and $V_{j}^i$ represents the value in the feature vector for column $j$ and columnal distance parameter $d$. This results in a vector of length ${20+20\times d-d\times\frac{d+1}{2}}$. ## 10: Local Binary Pattern This feature is based off of @li-2019's work using local binary pattern to extract features from PSSMs to use in protein-protein interaction prediction. The local binary pattern extraction used here can be defined as $B = b(s(H_{m, n+1}-H_{m, n}), s(H_{m+1, n+1}-H_{m, n}), s(H_{m+1, n}-H_{m, n})...s(H_{m-1, n+1}-H_{m, n})))$, where $H[m,n]$ signifies the center of a local binary pattern neighborhood ($2