This is a python package for genomics study with a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework.
- cython
- numpy
- Biopython
- editdistance
- pytorch 1.7.1
- pytorch_geometric 1.7.0
pip install GCNFrameOr
git clone https://github.com/deepomicslab/GCNFrame.git
cd GCNFrame/GCNFrame
python setup.py build_ext --inplace
cd ../The framework makes it easy to train your customized models with a few lines of codes. The example data can be downloaded from Zenodo.
# This is an example to train a two-classes model.
from GCNFrame import Biodata, GCNmodel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = Biodata(fasta_file="example_data/nature_2017.fasta", 
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")
dataset = data.encode(thread=20)
model = GCNmodel.model(label_num=2, other_feature_dim=206).to(device)
GCNmodel.train(dataset, model, weighted_sampling=True)
GCNmodel.test(model_name="GCN_model.pt", fasta_file="example_data/nature_2017.fasta", feature_file="example_data/CDD_protein_feature.txt")The output is shown bellow:
Encoding sequences...
Epoch 0| Loss: 0.6335| Train accuracy: 0.7480| Validation accuracy: 0.8839
Epoch 1| Loss: 0.5605| Train accuracy: 0.8165| Validation accuracy: 0.7032
Epoch 2| Loss: 0.5042| Train accuracy: 0.8469| Validation accuracy: 0.8065
Epoch 3| Loss: 0.4873| Train accuracy: 0.8344| Validation accuracy: 0.7677
Epoch 4| Loss: 0.4559| Train accuracy: 0.8703| Validation accuracy: 0.8194
Epoch 5| Loss: 0.4533| Train accuracy: 0.8763| Validation accuracy: 0.7806
Epoch 6| Loss: 0.4372| Train accuracy: 0.8931| Validation accuracy: 0.8387
Epoch 7| Loss: 0.4409| Train accuracy: 0.8842| Validation accuracy: 0.8581
Epoch 8| Loss: 0.4357| Train accuracy: 0.8858| Validation accuracy: 0.8516
Epoch 9| Loss: 0.4314| Train accuracy: 0.8987| Validation accuracy: 0.8387
Epoch 10| Loss: 0.4246| Train accuracy: 0.8992| Validation accuracy: 0.8581
Epoch 11| Loss: 0.4085| Train accuracy: 0.9180| Validation accuracy: 0.8839
Epoch 12| Loss: 0.4071| Train accuracy: 0.9290| Validation accuracy: 0.8903
Epoch 13| Loss: 0.4095| Train accuracy: 0.9170| Validation accuracy: 0.8839
Epoch 14| Loss: 0.4019| Train accuracy: 0.9241| Validation accuracy: 0.8839
Epoch 15| Loss: 0.3960| Train accuracy: 0.9342| Validation accuracy: 0.9161
The model with best validation accuracy will be saved as GCN_model.pt
Also, the package provides users with functions to mine gapped patterns or motifs of more significant influence in prediction tasks.
# the pattern_contribution_score function returns a score list to record the contribution scores for the 4,096 gapped patterns. 
score_list = pattern_contribution_score(fasta_file="example_data/nature_2017.fasta",
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")The scores for the gapped-patterns will also be saved in a txt file.
# the pattern_group_contribution_score function groups similar gapped patterns and analyzes the occurrence & scores for each group.
pattern_group_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", score_list=score_list)The results are saved as figures.
 

# the motif_contribution_score calculate the contribution score for a given motif.
score = motif_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", motif="AAAAAATTCG", feature_file="example_data/CDD_protein_feature.txt")
print("The contribution score for AAAAAATTCG is %s."%score)
fasta_file: The DNA sequences used for training and evaluation in fasta format.
label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).
K: The length of K-mer for encoding (default=3).
d: The number of spaced distance used for encoding (default=3).
seqtype: The type of input sequence, DNA or RNA (default=DNA).
thread: The number of thread used for encoding (default=10).
save_dataset: Save the encoded dataset and use it next time (default=True).
save_path: The path for saving the encoded dataset (default="./").
label_num: The number of labels.
other_feature_dim: The dimension for other features, 0 if not available.
K: The length of K-mer for encoding (default=3).
d: The number of spaced distance used for encoding (default=3).
node_hidden_dim: The size for kmer nodes after transformation(default=3).
gcn_dim: The size of output of SAGEConv (default=128).
gcn_layer_num: The number of SAGEConv layers (default=2).
cnn_dim: The size of output of convolutional layers (default=64).
cnn_layer_num: The number of convolutional layers (default=3).
cnn_kernel_size: The kernel size of convolutional layers (default=8).
fc_dim: The number of neurons for the fully connected layers (default=100).
dropout_rate: The dropout rate (default=0.2).
pnode_nn: Whether transform primary nodes (default=True).
fnode_nn: Whether transform target nodes (default=True).
learning_rate: The learning rate for training (default=1e-4).
batch_size: The batch_size for training (default=64).
epoch_n: The number of training epoches (default=20).
random_seed: The random seed for train-validation split (default=111).
val_split: The validation size (default=0.1).
weighted_sampling: Whether use weighted sampling for training (default=True).
model_name: The saved model name (default="GCN_model.pt").
fasta_file: The DNA sequences used for test in fasta format.
model_name: The saved model name (default="GCN_model.pt").
feature_file: Other features (like gene density) for the DNA sequences for test (should have the same order as fasta_file) (default=None).
output_file: The output file name (default="test_output.txt").
thread: The number of thread used for encoding (default=10).
K: The length of K-mer for encoding (default=3).
d: The number of spaced distance used for encoding (default=3).
fasta_file: The DNA sequences used for training and evaluation in fasta format.
label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
target_label: The label of the class being analyzed (default=1).
model_name: The saved model name (default="GCN_model.pt").
feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).
output_file: The output file name (default="pattern_contribution_score.txt").
thread: The number of thread used for encoding (default=10).
K: The length of K-mer for encoding (default=3).
d: The number of spaced distance used for encoding (default=3).
fasta_file: The DNA sequences used for training and evaluation in fasta format.
label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
motif: The motif to be analyzed.
target_label: The label of the class being analyzed (default=1).
model_name: The saved model name (default="GCN_model.pt").
feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).
thread: The number of thread used for encoding (default=10).
K: The length of K-mer for encoding (default=3).
d: The number of spaced distance used for encoding (default=3).
fasta_file: The DNA sequences used for training and evaluation in fasta format.
label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
score_list: The contribution scores of the 4,096 gapped patterns.
target_label: The label of the class being analyzed (default=1).
d: The number of spaced distance used for encoding (default=3).
- v0.1.2: Extension to RNA sequences & Enabling saving encoded data.
- v0.1.1: Add contribution score functions.
- v0.0.1: Initial version.
WANG Ruohan ruohawang2-c@my.cityu.edu.hk
Wang, R. H., Ng, Y. K., Zhang, X., Wang, J., & Li, S. C. (2024). Coding genomes with gapped pattern graph convolutional network. Bioinformatics, 40(4), btae188.
