Skip to content

6. Advanced usage

Abdurrahman Abul-Basher edited this page Sep 6, 2021 · 34 revisions

Overview

triUMPF can be trained by users in multiple different ways. The training data made available to users consists of i) EC number indices with embedding ("biocyc205_tier23_9255_Xe.pkl") and ii) pathway indices ("biocyc205_tier23_9255_y.pkl"). To train triUMPF as per user specifications, you need to have all the necessary data ready for training. If not then you may consider preprocessing.

Note: Make sure to put the source code triUMPF/ (see Installing triUMPF) into the same directory as explained in the Download files section. Additionally, create a log/ and result/ folders (if you have not already created one during pathway prediction) in the same triUMPF_materials/ directory. The final structure should look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	├── log/
        │       └── ...
	└── triUMPF/
                └── ...

For all experiments, using a terminal (On Linux and macOS) or an Anaconda command prompt (On Windows), navigate to the src/ folder in the triUMPF/ directory and then run the commands as shown in the Examples section of Preprocessing and Training .

To display triUMPF' running options use: python main.py --help. It should be self-contained.

Table of Contents

Preprocessing

This step is crucial and only performed if users wish to generate embeddings files such as "_Xe.pkl", "_Xa.pkl" etc., from a "[DATANAME]_X.pkl" file in order to use it for training.

Input:

The input file used for preprocessing is any matrix file containing EC number indices (e.g. biocyc21_X.pkl, cami_X.pkl)

Other files required for preprocessing:

  1. biocyc.pkl
  2. pathway2ec.pkl
  3. pathway2ec_idx.pkl
  4. pathway2vec_embeddings.npz
  5. hin.pkl

Command:

The basic command is represented below. Do not use this for preprocessing. This command is only a representation of all the flags used. See Example below on how to preprocess your datasets.

python main.py \
--preprocess-dataset \
--white-links \
--ssample-input-size [NUMERIC] \
--object-name "biocyc.pkl" \
--pathway2ec-name "pathway2ec.pkl" \
--pathway2ec-idx-name "pathway2ec_idx.pkl" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--X-name "[DATANAME]_X*.pkl" \
--file-name "[FILENAME]" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--batch 50 \
--num-jobs 2

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name Description Value
--preprocess-dataset Preprocess inputs based on Biocyc collection True
--white-links Add noise to pathway to pathway and enzyme to enzyme association matrices False
--ssample-input-size The size of input subsample 0.7
--object-name The preprocessed MetaCyc database file biocyc.pkl
--pathway2ec-name The matrix file representing Pathway-EC association pathway2ec.pkl
--pathway2ec-idx-name The pathway2ec association indices file pathway2ec_idx.pkl
--hin-name The heterogeneous information network file hin.pkl
--features-name The features corresponding ECs and pathways pathway2vec_embeddings.npz
--X-name The Input file name to be provided for preprocessing [DATANAME]_X*.pkl
--file-name The names of input preprocessed files (without extension) [FILENAME]
--ospath The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl") [Outside source code]
--dspath The path to the datasets [Outside source code]
--batch Batch size 50
--num-jobs The number of parallel workers 2

Output:

The output files generated after running the command are:

File Description
[FILENAME]_Xa.pkl A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features.
[FILENAME]_Xc.pkl A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features.
[FILENAME]_Xe.pkl A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and embeddings.
[FILENAME]_Xea.pkl A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features.
[FILENAME]_Xec.pkl A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features.
[FILENAME]_Xm.pkl A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices, embeddings, abundance, and coverage features.
M.pkl A matrix file (stored in the "dspath" location) representing the pathway-enzyme association with possible missing links due to white noise. It contains 2526 pathway indices shown in the first column and 3650 enzymes (represented as EC numbers indices) in the remaining columns. The file representation is similar to pathway2ec.pkl.
A.pkl A matrix file (stored in the "dspath" location) representing the pathway-pathway interaction. It contains 2526 pathway indices shown in the first column with their interactions (2526 pathway indices) in the remaining columns.
B.pkl A matrix file (stored in the "dspath" location) representing the enzyme-enzyme interaction. It contains 3650 enzymes (represented as EC numbers indices) shown in the first column with their interactions (3650 EC numbers indices) in the remaining columns.
P.pkl A matrix file (stored in the "dspath" location) representing the pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz.
E.pkl A matrix file (stored in the "dspath" location) representing the enzyme features (represented as EC numbers indices). It contains 3650 EC numbers indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz.

Note: Each of these files differs in the total number of columns they contain, which is why the file used for training should also be used during prediction if one decides to train their own model based on certain specifications mentioned above.

Examples

Example 1:

Execute the following command to preprocess "cami" data (as an example) with no noise:

python main.py --preprocess-dataset --ssample-input-size 1 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── cami_Xa.pkl
        │       ├── cami_Xc.pkl
        │       ├── cami_Xe.pkl
        │       ├── cami_Xea.pkl
        │       ├── cami_Xec.pkl
        │       ├── cami_Xm.pkl
        │       ├── M.pkl
        │       ├── A.pkl
        │       ├── B.pkl
        │       ├── P.pkl
        │       ├── E.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── triUMPF/
                └── ...

Example 2:

Execute the following command to preprocess "cami" data (as an example) with 20% noise to the pathway2ec association matrix ("pathway2ec.pkl"):

python main.py --preprocess-dataset --ssample-input-size 0.2 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated.

Example 3:

Execute the following command to preprocess "cami" data (as an example) with 20% noise to the pathway2ec ("pathway2ec.pkl"), the pathway to pathway ("A"), and the enzyme to enzyme ("B") association matrices:

python main.py --preprocess-dataset --white-links --ssample-input-size 0.2 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated.

Training

Training can be done using one of the output files from the preprocessing step and the [DATANAME]_y.pkl file.

That being said, one has to keep in mind to use the same file during pathway predictions to avoid errors since each output file from the preprocessing step contains a different number of columns. Here we show you the recommended command but you can also use other flags as described in the argument descriptions table to suit your requirements.

Input:

The input to the command is the output obtained from the preprocessing step above (any one of the [DATANAME]_X*.pkl and the [DATANAME]_y.pkl)

Recommended command:

The basic command is represented below. Do not use this for training. This command is only a representation of all the flags used. See Examples below on how to train a model.

python main.py \
--train \
--no-decomposition \
--fit-features \
--fit-comm \
--num-components 100 \
--num-components-p 90\
--num-communities-e 100 \
--ssample-input-size 0.7 \
--ssample-label-size 2000 \
--lambdas 0.01 0.01 0.01 0.01 0.001 10 \
--penalty "l21" \
--M-name "M.pkl" \
--P-name "P.pkl" \
--E-name "E.pkl" \ 
--A-name "A.pkl" \
--B-name "B.pkl" \
--X-name "[DATANAME]_X*.pkl" \
--y-name "[DATANAME]_y.pkl" \
--model-name "[MODELNAME] (without extension)" \
--W-name "[MODELNAME]_W.pkl" \
--H-name "[MODELNAME]_H.pkl" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--logpath "[absolute path to the log directory (e.g. log)]" \
--batch 50 \
--max-inner-iter 100 \
--num-epochs 10 \
--num-jobs 2 \

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name Description Default Value
--train Training the triUMPF model True
--no-decomposition Whether to decompose pathway to enzyme association matrix False
--fit-features Whether to fit by external features False
--fit-comm Whether to fit community False
--num-components Number of components 100
--num-communities-p Number of communities for pathways 90
--num-communities-e Number of communities for enzymes 100
--ssample-input-size Corresponds to the size of random subsampled inputs 0.7
--ssample-label-size Corresponds to the size of random subsampled pathway labels 2000
--lambdas Corresponds to the six hyper-parameters for constraints 0.01, 0.01, 0.01, 0.01, 0.001, 10
--penalty The type of regularization term to be applied l21
--M-name The pathway2ec association matrix file name "M.pkl"
--P-name The pathway features file name "P.pkl"
--E-name The enzyme features file name "E.pkl"
--A-name The pathway to pathway association file name "A.pkl"
--B-name The enzyme to enzyme association file name "B.pkl"
--X-name Input space of multi-label data biocyc_Xe.pkl
--y-name Pathway space of multi-label data biocyc_y.pkl
--model-name Corresponds to the name of the model excluding any EXTENSION. The model name will have .pkl extension "triUMPF"
--W-name File name for the pathways latent factors "[MODELNAME]_W.pkl"
--H-name File name for the enzymes (or ECs) latent factors "[MODELNAME]_H.pkl"
--mdpath The path to store model [Outside source code]
--rspath The path to store costs and resulting samples indices [Outside source code]
--logpath The path to the log directory [Outside source code]
--ospath The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl") [Outside source code]
--dspath The path to the datasets [Outside source code]
--batch Batch size 50
--max-inner-iter Corresponds to the number of inner iteration for logistic regression 100
--num-epochs Corresponds to the number of iterations over the training set 10
--num-jobs The number of parallel workers 2

Output:

The output files generated after running the command are:

Without --fit-comm and --fit-features flags (see Example 1):

File Description
[MODELNAME].pkl The trained model
[MODELNAME]_W.pkl A matrix containing the latent factors for pathways
[MODELNAME]_H.pkl A matrix containing the latent factors for enzymes (or ECs)
[MODELNAME]_cost.txt This file contains error values between predicted values and expected values
log file This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

With --fit-features and no --fit-comm flags (see Example 2):

File Description
[MODELNAME].pkl The trained model
[MODELNAME]_W.pkl A matrix containing the latent factors for pathways
[MODELNAME]_H.pkl A matrix containing the latent factors for enzymes (or ECs)
[MODELNAME]_[U, V].pkl Auxiliary matrices
[MODELNAME]_cost.txt This file contains error values between predicted values and expected values
log file This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

With --fit-comm and --fit-features flags(see Example 3):

File Description
[MODELNAME].pkl The trained model
[MODELNAME]_W.pkl A matrix containing the latent factors for pathways
[MODELNAME]_H.pkl A matrix containing the latent factors for enzymes (or ECs)
[MODELNAME]_T.pkl A matrix containing the pathway community representation
[MODELNAME]_C.pkl A matrix containing the pathway community indicators
[MODELNAME]_R.pkl A matrix containing the enzyme (or ECs) community representation
[MODELNAME]_K.pkl A matrix containing the enzyme (or ECs) community indicators
[MODELNAME]_[U, V, L, Z].pkl Auxiliary matrices
[MODELNAME]_cost.txt This file contains error values between predicted values and expected values
log file This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

With --fit-comm, --fit-features, and --no-decomposition flags (see Example 4):

File Description
[MODELNAME].pkl The trained model
[MODELNAME]_T.pkl A matrix containing the pathway community representation
[MODELNAME]_C.pkl A matrix containing the pathway community indicators
[MODELNAME]_R.pkl A matrix containing the enzyme (or ECs) community representation
[MODELNAME]_K.pkl A matrix containing the enzyme (or ECs) community indicators
[MODELNAME]_[U, V, L, Z].pkl Auxiliary matrices
[MODELNAME]_cost.txt This file contains error values between predicted values and expected values
log file This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

Examples

Example 1:

If you wish to decompose M of 100 components, you will need to provide an additional argument --num-components. Run the following command:

python main.py --train --num-components 100 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --model-name "triUMPF_retrained_1" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_1.pkl
        │       ├── triUMPF_retrained_1_W.pkl
        │       ├── triUMPF_retrained_1_H.pkl
        │       ├── triUMPF_retrained_1_W_final.pkl
        │       ├── triUMPF_retrained_1_H_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_1_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

Example 2:

To decompose M of 100 components using external features, you will need to provide additional arguments --fit-features, --P-name, and --E-name. Then, execute the following command:

python main.py --train --fit-features --num-components 100 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --P-name "P.pkl" --E-name "E.pkl" --model-name "triUMPF_retrained_2" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_2.pkl
        │       ├── triUMPF_retrained_2_W.pkl
        │       ├── triUMPF_retrained_2_H.pkl
        │       ├── triUMPF_retrained_2_U.pkl
        │       ├── triUMPF_retrained_2_V.pkl
        │       ├── triUMPF_retrained_2_W_final.pkl
        │       ├── triUMPF_retrained_2_H_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_2_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

Example 3:

To train a pathway dataset using external features, you will need to provide additional arguments --num-communities-p, --num-communities-e, --A-name, and --B-name. Then, execute the following command:

python main.py --train --fit-features --fit-comm --num-components 100 --num-communities-p 90 --num-communities-e 100 --ssample-label-size 2000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --P-name "P.pkl" --E-name "E.pkl" --A-name "A.pkl" --B-name "B.pkl" --X-name "biocyc205_tier23_9255_Xe.pkl" --y-name "biocyc205_tier23_9255_y.pkl" --model-name "triUMPF_retrained_3" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_3.pkl
        │       ├── triUMPF_retrained_3_[W, H, U, V].pkl
        │       ├── triUMPF_retrained_3_[C, K, T, R].pkl
        │       ├── triUMPF_retrained_3_[L, Z].pkl
        │       ├── triUMPF_retrained_3_[W, H, U, V]_final.pkl
        │       ├── triUMPF_retrained_3_[C, K, T, R]_final.pkl
        │       ├── triUMPF_retrained_3_[L, Z]_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_3_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

Example 4:

If you wish to use the previously decomposed M of 100 components to train the pathway dataset, you will need to provide additional arguments --no-decomposition, --W-name, and --H-name. Then, execute the following command:

python main.py --train --no-decomposition --fit-features --fit-comm --num-components 100 --num-communities-p 90 --num-communities-e 100 --ssample-label-size 2000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --P-name "P.pkl" --E-name "E.pkl" --A-name "A.pkl" --B-name "B.pkl" --W-name "triUMPF_retrained_3_W.pkl" --H-name "triUMPF_retrained_3_H.pkl" --X-name "biocyc205_tier23_9255_Xe.pkl" --y-name "biocyc205_tier23_9255_y.pkl" --model-name "triUMPF_retrained_4" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_4.pkl
        │       ├── triUMPF_retrained_3_[W, H].pkl
        │       ├── triUMPF_retrained_4_[C, K, T, R].pkl
        │       ├── triUMPF_retrained_4_[L, Z].pkl
        │       ├── triUMPF_retrained_3_[W, H]_final.pkl
        │       ├── triUMPF_retrained_4_[C, K, T, R]_final.pkl
        │       ├── triUMPF_retrained_4_[L, Z]_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_4_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

back to top