-
Notifications
You must be signed in to change notification settings - Fork 1
6. Advanced usage
triUMPF can be trained by users in multiple different ways. The training data made available to users consists of i) EC number indices with embedding ("biocyc205_tier23_9255_Xe.pkl") and ii) pathway indices ("biocyc205_tier23_9255_y.pkl"). To train triUMPF as per user specifications, you need to have all the necessary data ready for training. If not then you may consider preprocessing.
Note: Make sure to put the source code triUMPF/
(see Installing triUMPF) into the same directory as explained in the Download files section. Additionally, create a log/
and result/
folders (if you have not already created one during pathway prediction) in the same triUMPF_materials/
directory. The final structure should look like this:
triUMPF_materials/
├── objectset/
│ └── ...
├── model/
│ └── ...
├── dataset/
│ └── ...
├── result/
│ └── ...
├── log/
│ └── ...
└── triUMPF/
└── ...
For all experiments, using a terminal
(On Linux and macOS) or an Anaconda command prompt
(On Windows), navigate to the src/
folder in the triUMPF/
directory and then run the commands as shown in the Examples section of Preprocessing and Training .
To display triUMPF' running options use: python main.py --help
. It should be self-contained.
This step is crucial and only performed if users wish to generate embeddings files such as "_Xe.pkl", "_Xa.pkl" etc., from a "[DATANAME]_X.pkl" file in order to use it for training.
The input file used for preprocessing is any matrix file containing EC number indices (e.g. biocyc21_X.pkl, cami_X.pkl)
- biocyc.pkl
- pathway2ec.pkl
- pathway2ec_idx.pkl
- pathway2vec_embeddings.npz
- hin.pkl
The basic command is represented below. Do not use this for preprocessing. This command is only a representation of all the flags used. See Example below on how to preprocess your datasets.
python main.py \
--preprocess-dataset \
--white-links \
--ssample-input-size [NUMERIC] \
--object-name "biocyc.pkl" \
--pathway2ec-name "pathway2ec.pkl" \
--pathway2ec-idx-name "pathway2ec_idx.pkl" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--X-name "[DATANAME]_X*.pkl" \
--file-name "[FILENAME]" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--batch 50 \
--num-jobs 2
The table below summarizes all the command-line arguments that are specific to this framework:
Argument name | Description | Value |
---|---|---|
--preprocess-dataset | Preprocess inputs based on Biocyc collection | True |
--white-links | Add noise to pathway to pathway and enzyme to enzyme association matrices | False |
--ssample-input-size | The size of input subsample | 0.7 |
--object-name | The preprocessed MetaCyc database file | biocyc.pkl |
--pathway2ec-name | The matrix file representing Pathway-EC association | pathway2ec.pkl |
--pathway2ec-idx-name | The pathway2ec association indices file | pathway2ec_idx.pkl |
--hin-name | The heterogeneous information network file | hin.pkl |
--features-name | The features corresponding ECs and pathways | pathway2vec_embeddings.npz |
--X-name | The Input file name to be provided for preprocessing | [DATANAME]_X*.pkl |
--file-name | The names of input preprocessed files (without extension) | [FILENAME] |
--ospath | The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl") | [Outside source code] |
--dspath | The path to the datasets | [Outside source code] |
--batch | Batch size | 50 |
--num-jobs | The number of parallel workers | 2 |
The output files generated after running the command are:
File | Description |
---|---|
[FILENAME]_Xa.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features. |
[FILENAME]_Xc.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features. |
[FILENAME]_Xe.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and embeddings. |
[FILENAME]_Xea.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features. |
[FILENAME]_Xec.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features. |
[FILENAME]_Xm.pkl | A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices, embeddings, abundance, and coverage features. |
M.pkl | A matrix file (stored in the "dspath" location) representing the pathway-enzyme association with possible missing links due to white noise. It contains 2526 pathway indices shown in the first column and 3650 enzymes (represented as EC numbers indices) in the remaining columns. The file representation is similar to pathway2ec.pkl . |
A.pkl | A matrix file (stored in the "dspath" location) representing the pathway-pathway interaction. It contains 2526 pathway indices shown in the first column with their interactions (2526 pathway indices) in the remaining columns. |
B.pkl | A matrix file (stored in the "dspath" location) representing the enzyme-enzyme interaction. It contains 3650 enzymes (represented as EC numbers indices) shown in the first column with their interactions (3650 EC numbers indices) in the remaining columns. |
P.pkl | A matrix file (stored in the "dspath" location) representing the pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz . |
E.pkl | A matrix file (stored in the "dspath" location) representing the enzyme features (represented as EC numbers indices). It contains 3650 EC numbers indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz . |
Note: Each of these files differs in the total number of columns they contain, which is why the file used for training should also be used during prediction if one decides to train their own model based on certain specifications mentioned above.
Execute the following command to preprocess "cami" data (as an example) with no noise:
python main.py --preprocess-dataset --ssample-input-size 1 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2
After running the command, the output will be saved to the dataset/
folder. All the feature files described in the table above are generated. The tree structure for the folder with the outputs will look like this:
triUMPF_materials/
├── objectset/
│ └── ...
├── model/
│ └── ...
├── dataset/
│ ├── cami_Xa.pkl
│ ├── cami_Xc.pkl
│ ├── cami_Xe.pkl
│ ├── cami_Xea.pkl
│ ├── cami_Xec.pkl
│ ├── cami_Xm.pkl
│ ├── M.pkl
│ ├── A.pkl
│ ├── B.pkl
│ ├── P.pkl
│ ├── E.pkl
│ └── ...
├── result/
│ └── ...
└── triUMPF/
└── ...
Execute the following command to preprocess "cami" data (as an example) with 20% noise to the pathway2ec association matrix ("pathway2ec.pkl"):
python main.py --preprocess-dataset --ssample-input-size 0.2 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2
After running the command, the output will be saved to the dataset/
folder. All the feature files described in the table above are generated.
Execute the following command to preprocess "cami" data (as an example) with 20% noise to the pathway2ec ("pathway2ec.pkl"), the pathway to pathway ("A"), and the enzyme to enzyme ("B") association matrices:
python main.py --preprocess-dataset --white-links --ssample-input-size 0.2 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2
After running the command, the output will be saved to the dataset/
folder. All the feature files described in the table above are generated.
Training can be done using one of the output files from the preprocessing step and the [DATANAME]_y.pkl file.
That being said, one has to keep in mind to use the same file during pathway predictions to avoid errors since each output file from the preprocessing step contains a different number of columns. Here we show you the recommended command but you can also use other flags as described in the argument descriptions table to suit your requirements.
The input to the command is the output obtained from the preprocessing step above (any one of the [DATANAME]_X*.pkl and the [DATANAME]_y.pkl)
The basic command is represented below. Do not use this for training. This command is only a representation of all the flags used. See Examples below on how to train a model.
python main.py \
--train \
--no-decomposition \
--fit-features \
--fit-comm \
--num-components 100 \
--num-components-p 90\
--num-communities-e 100 \
--ssample-input-size 0.7 \
--ssample-label-size 2000 \
--lambdas 0.01 0.01 0.01 0.01 0.001 10 \
--penalty "l21" \
--M-name "M.pkl" \
--P-name "P.pkl" \
--E-name "E.pkl" \
--A-name "A.pkl" \
--B-name "B.pkl" \
--X-name "[DATANAME]_X*.pkl" \
--y-name "[DATANAME]_y.pkl" \
--model-name "[MODELNAME] (without extension)" \
--W-name "[MODELNAME]_W.pkl" \
--H-name "[MODELNAME]_H.pkl" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--logpath "[absolute path to the log directory (e.g. log)]" \
--batch 50 \
--max-inner-iter 100 \
--num-epochs 10 \
--num-jobs 2 \
The table below summarizes all the command-line arguments that are specific to this framework:
Argument name | Description | Default Value |
---|---|---|
--train | Training the triUMPF model | True |
--no-decomposition | Whether to decompose pathway to enzyme association matrix | False |
--fit-features | Whether to fit by external features | False |
--fit-comm | Whether to fit community | False |
--num-components | Number of components | 100 |
--num-communities-p | Number of communities for pathways | 90 |
--num-communities-e | Number of communities for enzymes | 100 |
--ssample-input-size | Corresponds to the size of random subsampled inputs | 0.7 |
--ssample-label-size | Corresponds to the size of random subsampled pathway labels | 2000 |
--lambdas | Corresponds to the six hyper-parameters for constraints | 0.01, 0.01, 0.01, 0.01, 0.001, 10 |
--penalty | The type of regularization term to be applied | l21 |
--M-name | The pathway2ec association matrix file name | "M.pkl" |
--P-name | The pathway features file name | "P.pkl" |
--E-name | The enzyme features file name | "E.pkl" |
--A-name | The pathway to pathway association file name | "A.pkl" |
--B-name | The enzyme to enzyme association file name | "B.pkl" |
--X-name | Input space of multi-label data | biocyc_Xe.pkl |
--y-name | Pathway space of multi-label data | biocyc_y.pkl |
--model-name | Corresponds to the name of the model excluding any EXTENSION. The model name will have .pkl extension | "triUMPF" |
--W-name | File name for the pathways latent factors | "[MODELNAME]_W.pkl" |
--H-name | File name for the enzymes (or ECs) latent factors | "[MODELNAME]_H.pkl" |
--mdpath | The path to store model | [Outside source code] |
--rspath | The path to store costs and resulting samples indices | [Outside source code] |
--logpath | The path to the log directory | [Outside source code] |
--ospath | The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl") | [Outside source code] |
--dspath | The path to the datasets | [Outside source code] |
--batch | Batch size | 50 |
--max-inner-iter | Corresponds to the number of inner iteration for logistic regression | 100 |
--num-epochs | Corresponds to the number of iterations over the training set | 10 |
--num-jobs | The number of parallel workers | 2 |
The output files generated after running the command are:
Without --fit-comm and --fit-features flags (see Example 1):
File | Description |
---|---|
[MODELNAME].pkl | The trained model |
[MODELNAME]_W.pkl | A matrix containing the latent factors for pathways |
[MODELNAME]_H.pkl | A matrix containing the latent factors for enzymes (or ECs) |
[MODELNAME]_cost.txt | This file contains error values between predicted values and expected values |
log file | This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored |
With --fit-features and no --fit-comm flags (see Example 2):
File | Description |
---|---|
[MODELNAME].pkl | The trained model |
[MODELNAME]_W.pkl | A matrix containing the latent factors for pathways |
[MODELNAME]_H.pkl | A matrix containing the latent factors for enzymes (or ECs) |
[MODELNAME]_[U, V].pkl | Auxiliary matrices |
[MODELNAME]_cost.txt | This file contains error values between predicted values and expected values |
log file | This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored |
With --fit-comm and --fit-features flags(see Example 3):
File | Description |
---|---|
[MODELNAME].pkl | The trained model |
[MODELNAME]_W.pkl | A matrix containing the latent factors for pathways |
[MODELNAME]_H.pkl | A matrix containing the latent factors for enzymes (or ECs) |
[MODELNAME]_T.pkl | A matrix containing the pathway community representation |
[MODELNAME]_C.pkl | A matrix containing the pathway community indicators |
[MODELNAME]_R.pkl | A matrix containing the enzyme (or ECs) community representation |
[MODELNAME]_K.pkl | A matrix containing the enzyme (or ECs) community indicators |
[MODELNAME]_[U, V, L, Z].pkl | Auxiliary matrices |
[MODELNAME]_cost.txt | This file contains error values between predicted values and expected values |
log file | This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored |
With --fit-comm, --fit-features, and --no-decomposition flags (see Example 4):
File | Description |
---|---|
[MODELNAME].pkl | The trained model |
[MODELNAME]_T.pkl | A matrix containing the pathway community representation |
[MODELNAME]_C.pkl | A matrix containing the pathway community indicators |
[MODELNAME]_R.pkl | A matrix containing the enzyme (or ECs) community representation |
[MODELNAME]_K.pkl | A matrix containing the enzyme (or ECs) community indicators |
[MODELNAME]_[U, V, L, Z].pkl | Auxiliary matrices |
[MODELNAME]_cost.txt | This file contains error values between predicted values and expected values |
log file | This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored |
If you wish to decompose M of 100 components, you will need to provide an additional argument --num-components
. Run the following command:
python main.py --train --num-components 100 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --model-name "triUMPF_retrained_1" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2
After running the command, the output will be saved to the model/
, result/
, and log/
folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:
triUMPF_materials/
├── objectset/
│ └── ...
├── model/
│ ├── triUMPF_retrained_1.pkl
│ ├── triUMPF_retrained_1_W.pkl
│ ├── triUMPF_retrained_1_H.pkl
│ ├── triUMPF_retrained_1_W_final.pkl
│ ├── triUMPF_retrained_1_H_final.pkl
│ └── ...
├── dataset/
│ └── ...
├── result/
| ├── triUMPF_retrained_1_cost.txt
│ └── ...
├── log/
| ├── triUMPF_events
│ └── ...
└── triUMPF/
└── ...
To decompose M of 100 components using external features, you will need to provide additional arguments --fit-features
, --P-name
, and --E-name
. Then, execute the following command:
python main.py --train --fit-features --num-components 100 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --P-name "P.pkl" --E-name "E.pkl" --model-name "triUMPF_retrained_2" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2
After running the command, the output will be saved to the model/
, result/
, and log/
folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:
triUMPF_materials/
├── objectset/
│ └── ...
├── model/
│ ├── triUMPF_retrained_2.pkl
│ ├── triUMPF_retrained_2_W.pkl
│ ├── triUMPF_retrained_2_H.pkl
│ ├── triUMPF_retrained_2_U.pkl
│ ├── triUMPF_retrained_2_V.pkl
│ ├── triUMPF_retrained_2_W_final.pkl
│ ├── triUMPF_retrained_2_H_final.pkl
│ └── ...
├── dataset/
│ └── ...
├── result/
| ├── triUMPF_retrained_2_cost.txt
│ └── ...
├── log/
| ├── triUMPF_events
│ └── ...
└── triUMPF/
└── ...
To train a pathway dataset using external features, you will need to provide additional arguments --num-communities-p
, --num-communities-e
, --A-name
, and --B-name
. Then, execute the following command:
python main.py --train --fit-features --fit-comm --num-components 100 --num-communities-p 90 --num-communities-e 100 --ssample-label-size 2000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --P-name "P.pkl" --E-name "E.pkl" --A-name "A.pkl" --B-name "B.pkl" --X-name "biocyc205_tier23_9255_Xe.pkl" --y-name "biocyc205_tier23_9255_y.pkl" --model-name "triUMPF_retrained_3" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2
After running the command, the output will be saved to the model/
, result/
, and log/
folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:
triUMPF_materials/
├── objectset/
│ └── ...
├── model/
│ ├── triUMPF_retrained_3.pkl
│ ├── triUMPF_retrained_3_[W, H, U, V].pkl
│ ├── triUMPF_retrained_3_[C, K, T, R].pkl
│ ├── triUMPF_retrained_3_[L, Z].pkl
│ ├── triUMPF_retrained_3_[W, H, U, V]_final.pkl
│ ├── triUMPF_retrained_3_[C, K, T, R]_final.pkl
│ ├── triUMPF_retrained_3_[L, Z]_final.pkl
│ └── ...
├── dataset/
│ └── ...
├── result/
| ├── triUMPF_retrained_3_cost.txt
│ └── ...
├── log/
| ├── triUMPF_events
│ └── ...
└── triUMPF/
└── ...
If you wish to use the previously decomposed M of 100 components to train the pathway dataset, you will need to provide additional arguments --no-decomposition
, --W-name
, and --H-name
. Then, execute the following command:
python main.py --train --no-decomposition --fit-features --fit-comm --num-components 100 --num-communities-p 90 --num-communities-e 100 --ssample-label-size 2000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --P-name "P.pkl" --E-name "E.pkl" --A-name "A.pkl" --B-name "B.pkl" --W-name "triUMPF_retrained_3_W.pkl" --H-name "triUMPF_retrained_3_H.pkl" --X-name "biocyc205_tier23_9255_Xe.pkl" --y-name "biocyc205_tier23_9255_y.pkl" --model-name "triUMPF_retrained_4" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2
After running the command, the output will be saved to the model/
, result/
, and log/
folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:
triUMPF_materials/
├── objectset/
│ └── ...
├── model/
│ ├── triUMPF_retrained_4.pkl
│ ├── triUMPF_retrained_3_[W, H].pkl
│ ├── triUMPF_retrained_4_[C, K, T, R].pkl
│ ├── triUMPF_retrained_4_[L, Z].pkl
│ ├── triUMPF_retrained_3_[W, H]_final.pkl
│ ├── triUMPF_retrained_4_[C, K, T, R]_final.pkl
│ ├── triUMPF_retrained_4_[L, Z]_final.pkl
│ └── ...
├── dataset/
│ └── ...
├── result/
| ├── triUMPF_retrained_4_cost.txt
│ └── ...
├── log/
| ├── triUMPF_events
│ └── ...
└── triUMPF/
└── ...