Title: | Protein Feature Extraction from Profile Hidden Markov Models |
---|---|
Description: | Calculates a comprehensive list of features from profile hidden Markov models (HMMs) of proteins. Adapts and ports features for use with HMMs instead of Position Specific Scoring Matrices, in order to take advantage of more accurate multiple sequence alignment by programs such as 'HHBlits' <DOI:10.1038/nmeth.1818> and 'HMMer' <http://hmmer.org>. Features calculated by this package can be used for protein fold classification, protein structural class prediction, sub-cellular localization and protein-protein interaction, among other tasks. Some examples of features extracted are found in Song et al. (2018) <DOI:10.3390/app8010089>, Jin & Zhu (2021) <DOI:10.1155/2021/8629776>, Lyons et al. (2015) <DOI:10.1109/tnb.2015.2457906> and Saini et al. (2015) <DOI:10.1016/j.jtbi.2015.05.030>. |
Authors: | Shayaan Emran [aut, cre, cph] |
Maintainer: | Shayaan Emran <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.1 |
Built: | 2025-02-02 03:40:34 UTC |
Source: | https://github.com/semran9/prothmm |
This feature begins by creating a CHMM, which is created by constructing 4 matrices, from
the original HMM
.
contains the first 75 percent of the original matrix
row-wise,
the
last 75 percent,
the middle 75 percent and
the entire original matrix. These are then merged to create the new
CHMM
. From there, the Bigrams feature is calculated with a flattened 20 x 20 matrix
, in which
.
corresponds to the original HMM matrix, and
is the number of rows in
. Local Average Group,
or LAG is then calculated by splitting up the CHMM into 20 groups along the length of the protein sequence and calculating
the sums of each of the columns of each group, making a 1 x 20 vector per group, and a length 20 x 20 vector for all groups. These features are then fused.
chmm(hmm)
chmm(hmm)
hmm |
The name of a profile hidden markov model file. |
A fusion vector of length 800.
A LAG vector of length 400.
A Bigrams vector of length 400.
An, J., Zhou, Y., Zhao, Y., & Yan, Z. (2019). An Efficient Feature Extraction Technique Based on Local Coding PSSM and Multifeatures Fusion for Predicting Protein-Protein Interactions. Evolutionary Bioinformatics, 15, 117693431987992.
h<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature consists of two vectors, . Vector
corresponds to the sums across the sequence
for each of the 20 amino acid columns. Vector
corresponds to a flattened matrix
in which
when
.
refers to a list of
all possible amino acids,
span from
.
fp_hmm(hmm)
fp_hmm(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 20.
A vector of length 400.
Zahiri, J., Yaghoubi, O., Mohammad-Noori, M., Ebrahimpour, R., & Masoudi-Nejad, A. (2013). PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information. Genomics, 102(4), 237–242.
h<- fp_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- fp_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature calculates the covariance between two residues separated by a lag value within the same amino acid emission frequency column along the protein sequence.
hmm_ac(hmm, lg = 4)
hmm_ac(hmm, lg = 4)
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
A vector of length 20 the lag value; by default this is a vector of length 80.
The lag value must be less than the length of the protein sequence
Dong, Q., Zhou, S., & Guan, J. (2009). A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 25(20), 2655–2662.
h<- hmm_ac(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_ac(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature is calculated with a 20 x 20 matrix , in which
.
corresponds to the original HMM matrix, and
is the number of rows in
. Matrix
is then flattened to
a feature vector of length 400, and returned.
hmm_bigrams(hmm)
hmm_bigrams(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 400
Lyons, J., Dehzangi, A., Heffernan, R., Yang, Y., Zhou, Y., Sharma, A., & Paliwal, K. K. (2015). Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models. IEEE Transactions on Nanobioscience, 14(7), 761–772.
h<- hmm_bigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_bigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
The feature calculates the covariance between different residues separated along the protein sequences by a lag value across different amino acid emission frequency columns.
hmm_cc(hmm, lg = 4)
hmm_cc(hmm, lg = 4)
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
A vector of length 20 x 19 x the lag value; by default this is a vector of length 1520.
The lag value must less than the length of the amino acid sequence.
Dong, Q., Zhou, S., & Guan, J. (2009). A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 25(20), 2655–2662.
h<- hmm_cc(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_cc(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature calculates the cosine distance matrix between two HMMs and
before dynamic time warp is applied to
the distance matrix calculate the cumulative distance between the HMMs, which acts as a measure of similarity,
The cosine distance matrix
is found to be
,
in which
and
refer to row vectors of
and
respectively.
This in turn means that
is of dimensions
. Dynamic time warp then calculates the
cumulative distance by calculating matrix
,
where
is 0 when
or
are less than 1. The lower rightmost point of the matrix
is then returned as the cumulative distance between proteins.
hmm_distance(hmm_1, hmm_2)
hmm_distance(hmm_1, hmm_2)
hmm_1 |
The name of a profile hidden markov model file. |
hmm_2 |
The name of another profile hidden markov model file. |
A double that indicates distance between the two proteins.
Lyons, J., Paliwal, K. K., Dehzangi, A., Heffernan, R., Tsunoda, T., & Sharma, A. (2016). Protein fold recognition using HMM–HMM alignment and dynamic programming. Journal of Theoretical Biology, 393, 67–74.
h<- hmm_distance(system.file("extdata", "1DLHA2-7", package="protHMM"), system.file("extdata", "1TEN-7", package="protHMM"))
h<- hmm_distance(system.file("extdata", "1DLHA2-7", package="protHMM"), system.file("extdata", "1TEN-7", package="protHMM"))
This feature calculates the Geary autocorrelation of each amino acid type for each distance d less than or equal to the lag value and greater than or equal to 1.
hmm_GA(hmm, lg = 9)
hmm_GA(hmm, lg = 9)
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
A vector of length lg 20, by default this is 180.
The lag value must be less than the length of the protein sequence
Liang, Y., Liu, S., & Zhang. (2015). Prediction of Protein Structural Class Based on Different Autocorrelation Descriptors of Position–Specific Scoring Matrix. MATCH: Communications in Mathematical and in Computer Chemistry, 73(3), 765–784.
h<- hmm_GA(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_GA(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature initially creates a grouping matrix by assigning each position a number
based on
the value at each position of HMM matrix
;
represents the low probability group,
the medium and
the high probability group.
The number of total points in each group for each column is then calculated, and the sequence is then split
based upon the the positions of the 1st, 25th, 50th, 75th and 100th percentile (last) points for each of the three groups,
in each of the 20 columns of the grouping matrix. Thus for column
,
,
where
is the group number,
and
corresponds to number of rows in matrix
.
hmm_GSD(hmm)
hmm_GSD(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 300.
Jin, D., & Zhu, P. (2021). Protein Subcellular Localization Based on Evolutionary Information and Segmented Distribution. Mathematical Problems in Engineering, 2021, 1–14.
h<- hmm_GSD(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_GSD(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature uses local binary pattern with a neighborhood of radius 1 and 8 sample points to extract features from the HMM. A 256 bin histogram is extracted as a 256 length feature vector.
hmm_LBP(hmm)
hmm_LBP(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 256.
Li, Y., Li, L., Wang, L., Yu, C., Wang, Z., & You, Z. (2019). An Ensemble Classifier to Predict Protein–Protein Interactions by Combining PSSM-based Evolutionary Information with Local Binary Pattern Model. International Journal of Molecular Sciences, 20(14), 3511.
h<- hmm_LBP(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_LBP(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature uses linear predictive coding (LPC) to map each HMM to a dimensional vector,
where for each of the 20 columns of the HMM, LPC is used to extract a 14 dimensional vector
hmm_LPC(hmm)
hmm_LPC(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 280.
Qin, Y., Zheng, X., Wang, J., Chen, M., & Zhou, C. (2015). Prediction of protein structural class based on Linear Predictive Coding of PSI-BLAST profiles. Central European Journal of Biology, 10(1).
h<- hmm_LPC(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_LPC(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature calculates the normalized Moran autocorrelation of each amino acid type, for each distance d less than or equal to the lag value and greater than or equal to 1.
hmm_MA(hmm, lg = 9)
hmm_MA(hmm, lg = 9)
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
A vector of length lg 20, by default this is 180.
The lag value must be less than the length of the protein sequence
Liang, Y., Liu, S., & Zhang. (2015). Prediction of Protein Structural Class Based on Di fferent Autocorrelation Descriptors of Position–Specific Scoring Matrix. MATCH: Communications in Mathematical and in Computer Chemistry, 73(3), 765–784.
h<- hmm_MA(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_MA(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature calculates the normalized Moreau-Broto autocorrelation of each amino acid type, for each distance d less than or equal to the lag value and greater than or equal to 1.
hmm_MB(hmm, lg = 9)
hmm_MB(hmm, lg = 9)
hmm |
The name of a profile hidden markov model file. |
lg |
The lag value, which indicates the distance between residues. |
A vector of length lg 20, by default this is 180.
The lag value must be less than the length of the protein sequence
Liang, Y., Liu, S., & Zhang. (2015). Prediction of Protein Structural Class Based on Different Autocorrelation Descriptors of Position–Specific Scoring Matrix. MATCH: Communications in Mathematical and in Computer Chemistry, 73(3), 765–784.
h<- hmm_MB(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_MB(system.file("extdata", "1DLHA2-7", package="protHMM"))
Reads in the amino acid emission frequency columns of a profile hidden markov model matrix and converts each position to frequencies.
hmm_read(hmm)
hmm_read(hmm)
hmm |
The name of a profile hidden markov model file. |
A 20 x L matrix, in which L is the sequence length.
h<- hmm_read(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_read(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature returns the 2 and 3-mer compositions of the protein sequence. This is done by first
finding all possible 2 and 3-mers for any protein ( and
permutations for 2 and 3-mers respectively).
With those permutations, vectors of length 400 and 8000 are created, each point corresponding to one 2 or 3-mer.
Then, the protein sequence that corresponds to the HMM scores is extracted, and put into a bipartite graph with the protein sequence.
Each possible path of length 1 or 2 is found, and the corresponding vertices on the graph are noted as 2 and 3-mers.
For each 2 or 3-mer found from these paths, 1 is added to the position that responds to that 2/3-mer in the
2-mer and 3-mer vectors , which are the length 400 and 8000 vectors created previously. The vectors are then returned.
hmm_SCSH(hmm)
hmm_SCSH(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 400.
A vector of length 8000.
Mohammadi, A. M., Zahiri, J., Mohammadi, S., Khodarahmi, M., & Arab, S. S. (2022). PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biology Methods and Protocols, 7(1).
h_400<- hmm_SCSH(system.file("extdata", "1DLHA2-7", package="protHMM"))[[1]] h_8000<- hmm_SCSH(system.file("extdata", "1DLHA2-7", package="protHMM"))[[2]]
h_400<- hmm_SCSH(system.file("extdata", "1DLHA2-7", package="protHMM"))[[1]] h_8000<- hmm_SCSH(system.file("extdata", "1DLHA2-7", package="protHMM"))[[2]]
This feature calculates the probabilistic expression of amino acid dimers that are spatially separated by a distance .
Mathematically, this is done with a 20 x 20 matrix
, in which
.
corresponds to the original HMM matrix, and
is the number of rows in
. Matrix
is then flattened to
a feature vector of length 400, and returned.
hmm_SepDim(hmm, l = 7)
hmm_SepDim(hmm, l = 7)
hmm |
The name of a profile hidden markov model file. |
l |
Spatial distance between dimer residues. |
A vector of length 400
Saini, H., Raicar, G., Sharma, A., Lal, S. K., Dehzangi, A., Lyons, J., Paliwal, K. K., Imoto, S., & Miyano, S. (2015). Probabilistic expression of spatially varied amino acid dimers into general form of Chou's pseudo amino acid composition for protein fold recognition. Journal of Theoretical Biology, 380, 291–298.
h<- hmm_SepDim(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_SepDim(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature groups together rows that are related to the same amino acid. This is done using a vector
, in which
spans
and
,
in which
is the HMM matrix,
in the protein sequence,
is an ordered set of amino acids,
the variables
, the variable
when creating the vector,
and
represents Kronecker's delta.
hmm_Single_Average(hmm)
hmm_Single_Average(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 400.
Nanni, L., Lumini, A., & Brahnam, S. (2014). An Empirical Study of Different Approaches for Protein Classification. The Scientific World Journal, 2014, 1–17.
h<- hmm_Single_Average(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_Single_Average(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature smooths the HMM matrix by using sliding window of length
to incorporate information
from up and downstream residues into each row of the HMM matrix. Each HMM row
is made into the summation
of
, for
, where
is the number of rows in
.
For rows such as the beginning and ending rows,
matrices of dimensions
are appended to the
original matrix
.
hmm_smooth(hmm, sw = 7)
hmm_smooth(hmm, sw = 7)
hmm |
The name of a profile hidden markov model file. |
sw |
The size of the sliding window. |
A matrix of dimensions L 20.
Fang, C., Noguchi, T., & Yamana, H. (2013). SCPSSMpred: A General Sequence-based Method for Ligand-binding Site Prediction. IPSJ Transactions on Bioinformatics, 6(0), 35–42.
h<- hmm_smooth(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_smooth(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature uses singular value decomposition (SVD) to reduce the dimensionality of the inputted hidden
markov model matrix. SVD factorizes a matrix C of dimensions to
.
The diagonal values of
are known as the singular values of matrix C, and are what are returned with this function.
hmm_svd(hmm)
hmm_svd(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 20.
Song, X., Chen, Z., Sun, X., You, Z., Li, L., & Zhao, Y. (2018). An Ensemble Classifier with Random Projection for Predicting Protein–Protein Interactions Using Sequence and Evolutionary Information. Applied Sciences, 8(1), 89.
h<- hmm_svd(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_svd(system.file("extdata", "1DLHA2-7", package="protHMM"))
This feature is calculated with a 20 x 20 x 20 block , in which
.
corresponds to the original HMM matrix, and
is the number of rows in
. Matrix
is then flattened to
a feature vector of length 8000, and returned.
hmm_trigrams(hmm)
hmm_trigrams(hmm)
hmm |
The name of a profile hidden markov model file. |
A vector of length 8000
Lyons, J., Dehzangi, A., Heffernan, R., Yang, Y., Zhou, Y., Sharma, A., & Paliwal, K. K. (2015). Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models. IEEE Transactions on Nanobioscience, 14(7), 761–772.
h<- hmm_trigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- hmm_trigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
The first twenty numbers of this feature correspond to the means of each column of the HMM matrix
. The rest of the features in the feature vector are found in matrix
, where
.
IM_psehmm(hmm, d = 13)
IM_psehmm(hmm, d = 13)
hmm |
The name of a profile hidden markov model file. |
d |
The maximum distance between residues column-wise. |
A vector of length
d must be less than 20.
Ruan, X., Zhou, D., Nie, R., & Guo, Y. (2020). Predictions of Apoptosis Proteins by Integrating Different Features Based on Improving Pseudo-Position-Specific Scoring Matrix. BioMed Research International, 2020, 1–13.
h<- IM_psehmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- IM_psehmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
The first twenty numbers of this feature correspond to the means of each column of the HMM matrix
. The rest of the features in the feature vector are given by correlation of the
most
contiguous values along the chain per each amino acid column, where
. This creates a vector of 20
g, and this combines with the first 20 features to create the final feature vector.
pse_hmm(hmm, g = 15)
pse_hmm(hmm, g = 15)
hmm |
The name of a profile hidden markov model file. |
g |
The contiguous distance between residues. |
A vector of length , by default this is 320.
g must be less than the length of the protein sequence
Chou, K., & Shen, H. (2007). MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications, 360(2), 339–345.
h<- pse_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
h<- pse_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))