6. Advanced usage

Overview

triUMPF can be trained by users in multiple different ways. The training data made available to users consists of i) EC number indices with embedding ("biocyc205_tier23_9255_Xe.pkl") and ii) pathway indices ("biocyc205_tier23_9255_y.pkl"). To train triUMPF as per user specifications, you need to have all the necessary data ready for training. If not then you may consider preprocessing.

Note: Make sure to put the source code triUMPF/ (see Installing triUMPF) into the same directory as explained in the Download files section. Additionally, create a log/ and result/ folders (if you have not already created one during pathway prediction) in the same triUMPF_materials/ directory. The final structure should look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	├── log/
        │       └── ...
	└── triUMPF/
                └── ...

For all experiments, using a terminal (On Linux and macOS) or an Anaconda command prompt (On Windows), navigate to the src/ folder in the triUMPF/ directory and then run the commands as shown in the Examples section of Preprocessing and Training .

To display triUMPF' running options use: python main.py --help. It should be self-contained.

Preprocessing

This step is crucial and only performed if users wish to generate embeddings files such as "_Xe.pkl", "_Xa.pkl" etc., from a "[DATANAME]_X.pkl" file in order to use it for training.

Input:

The input file used for preprocessing is any matrix file containing EC number indices (e.g. biocyc21_X.pkl, cami_X.pkl)

Other files required for preprocessing:

biocyc.pkl
pathway2ec.pkl
pathway2ec_idx.pkl
pathway2vec_embeddings.npz
hin.pkl

Command:

The basic command is represented below. Do not use this for preprocessing. This command is only a representation of all the flags used. See Example below on how to preprocess your datasets.

python main.py \
--preprocess-dataset \
--white-links \
--ssample-input-size [NUMERIC] \
--object-name "biocyc.pkl" \
--pathway2ec-name "pathway2ec.pkl" \
--pathway2ec-idx-name "pathway2ec_idx.pkl" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--X-name "[DATANAME]_X*.pkl" \
--file-name "[FILENAME]" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--batch 50 \
--num-jobs 2

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name	Description	Value
--preprocess-dataset	Preprocess inputs based on Biocyc collection	True
--white-links	Add noise to pathway to pathway and enzyme to enzyme association matrices	False
--ssample-input-size	The size of input subsample	0.7
--object-name	The preprocessed MetaCyc database file	biocyc.pkl
--pathway2ec-name	The matrix file representing Pathway-EC association	pathway2ec.pkl
--pathway2ec-idx-name	The pathway2ec association indices file	pathway2ec_idx.pkl
--hin-name	The heterogeneous information network file	hin.pkl
--features-name	The features corresponding ECs and pathways	pathway2vec_embeddings.npz
--X-name	The Input file name to be provided for preprocessing	[DATANAME]_X*.pkl
--file-name	The names of input preprocessed files (without extension)	[FILENAME]
--ospath	The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl")	[Outside source code]
--dspath	The path to the datasets	[Outside source code]
--batch	Batch size	50
--num-jobs	The number of parallel workers	2

Output:

The output files generated after running the command are:

File	Description
[FILENAME]_Xa.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features.
[FILENAME]_Xc.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features.
[FILENAME]_Xe.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and embeddings.
[FILENAME]_Xea.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features.
[FILENAME]_Xec.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features.
[FILENAME]_Xm.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices, embeddings, abundance, and coverage features.
M.pkl	A matrix file (stored in the "dspath" location) representing the pathway-enzyme association with possible missing links due to white noise. It contains 2526 pathway indices shown in the first column and 3650 enzymes (represented as EC numbers indices) in the remaining columns. The file representation is similar to `pathway2ec.pkl`.
A.pkl	A matrix file (stored in the "dspath" location) representing the pathway-pathway interaction. It contains 2526 pathway indices shown in the first column with their interactions (2526 pathway indices) in the remaining columns.
B.pkl	A matrix file (stored in the "dspath" location) representing the enzyme-enzyme interaction. It contains 3650 enzymes (represented as EC numbers indices) shown in the first column with their interactions (3650 EC numbers indices) in the remaining columns.
P.pkl	A matrix file (stored in the "dspath" location) representing the pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to `pathway2vec_embeddings.npz`.
E.pkl	A matrix file (stored in the "dspath" location) representing the enzyme features (represented as EC numbers indices). It contains 3650 EC numbers indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to `pathway2vec_embeddings.npz`.

Note: Each of these files differs in the total number of columns they contain, which is why the file used for training should also be used during prediction if one decides to train their own model based on certain specifications mentioned above.

Examples

Example 1:

Execute the following command to preprocess "cami" data (as an example) with no noise:

python main.py --preprocess-dataset --ssample-input-size 1 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── cami_Xa.pkl
        │       ├── cami_Xc.pkl
        │       ├── cami_Xe.pkl
        │       ├── cami_Xea.pkl
        │       ├── cami_Xec.pkl
        │       ├── cami_Xm.pkl
        │       ├── M.pkl
        │       ├── A.pkl
        │       ├── B.pkl
        │       ├── P.pkl
        │       ├── E.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── triUMPF/
                └── ...

Example 2:

Execute the following command to preprocess "cami" data (as an example) with 20% noise to the pathway2ec association matrix ("pathway2ec.pkl"):

python main.py --preprocess-dataset --ssample-input-size 0.2 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated.

Example 3:

Execute the following command to preprocess "cami" data (as an example) with 20% noise to the pathway2ec ("pathway2ec.pkl"), the pathway to pathway ("A"), and the enzyme to enzyme ("B") association matrices:

python main.py --preprocess-dataset --white-links --ssample-input-size 0.2 --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated.

Training

Training can be done using one of the output files from the preprocessing step and the [DATANAME]_y.pkl file.

That being said, one has to keep in mind to use the same file during pathway predictions to avoid errors since each output file from the preprocessing step contains a different number of columns. Here we show you the recommended command but you can also use other flags as described in the argument descriptions table to suit your requirements.

Input:

The input to the command is the output obtained from the preprocessing step above (any one of the [DATANAME]_X*.pkl and the [DATANAME]_y.pkl)

Recommended command:

The basic command is represented below. Do not use this for training. This command is only a representation of all the flags used. See Examples below on how to train a model.

python main.py \
--train \
--no-decomposition \
--fit-features \
--fit-comm \
--num-components 100 \
--num-components-p 90\
--num-communities-e 100 \
--ssample-input-size 0.7 \
--ssample-label-size 2000 \
--lambdas 0.01 0.01 0.01 0.01 0.001 10 \
--penalty "l21" \
--M-name "M.pkl" \
--P-name "P.pkl" \
--E-name "E.pkl" \ 
--A-name "A.pkl" \
--B-name "B.pkl" \
--X-name "[DATANAME]_X*.pkl" \
--y-name "[DATANAME]_y.pkl" \
--model-name "[MODELNAME] (without extension)" \
--W-name "[MODELNAME]_W.pkl" \
--H-name "[MODELNAME]_H.pkl" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--logpath "[absolute path to the log directory (e.g. log)]" \
--batch 50 \
--max-inner-iter 100 \
--num-epochs 10 \
--num-jobs 2 \

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name	Description	Default Value
--train	Training the triUMPF model	True
--no-decomposition	Whether to decompose pathway to enzyme association matrix	False
--fit-features	Whether to fit by external features	False
--fit-comm	Whether to fit community	False
--num-components	Number of components	100
--num-communities-p	Number of communities for pathways	90
--num-communities-e	Number of communities for enzymes	100
--ssample-input-size	Corresponds to the size of random subsampled inputs	0.7
--ssample-label-size	Corresponds to the size of random subsampled pathway labels	2000
--lambdas	Corresponds to the six hyper-parameters for constraints	0.01, 0.01, 0.01, 0.01, 0.001, 10
--penalty	The type of regularization term to be applied	l21
--M-name	The pathway2ec association matrix file name	"M.pkl"
--P-name	The pathway features file name	"P.pkl"
--E-name	The enzyme features file name	"E.pkl"
--A-name	The pathway to pathway association file name	"A.pkl"
--B-name	The enzyme to enzyme association file name	"B.pkl"
--X-name	Input space of multi-label data	biocyc_Xe.pkl
--y-name	Pathway space of multi-label data	biocyc_y.pkl
--model-name	Corresponds to the name of the model excluding any EXTENSION. The model name will have .pkl extension	"triUMPF"
--W-name	File name for the pathways latent factors	"[MODELNAME]_W.pkl"
--H-name	File name for the enzymes (or ECs) latent factors	"[MODELNAME]_H.pkl"
--mdpath	The path to store model	[Outside source code]
--rspath	The path to store costs and resulting samples indices	[Outside source code]
--logpath	The path to the log directory	[Outside source code]
--ospath	The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl")	[Outside source code]
--dspath	The path to the datasets	[Outside source code]
--batch	Batch size	50
--max-inner-iter	Corresponds to the number of inner iteration for logistic regression	100
--num-epochs	Corresponds to the number of iterations over the training set	10
--num-jobs	The number of parallel workers	2

Output:

The output files generated after running the command are:

Without --fit-comm and --fit-features flags (see Example 1):

File	Description
[MODELNAME].pkl	The trained model
[MODELNAME]_W.pkl	A matrix containing the latent factors for pathways
[MODELNAME]_H.pkl	A matrix containing the latent factors for enzymes (or ECs)
[MODELNAME]_cost.txt	This file contains error values between predicted values and expected values
log file	This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

With --fit-features and no --fit-comm flags (see Example 2):

File	Description
[MODELNAME].pkl	The trained model
[MODELNAME]_W.pkl	A matrix containing the latent factors for pathways
[MODELNAME]_H.pkl	A matrix containing the latent factors for enzymes (or ECs)
[MODELNAME]_[U, V].pkl	Auxiliary matrices
[MODELNAME]_cost.txt	This file contains error values between predicted values and expected values
log file	This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

With --fit-comm and --fit-features flags(see Example 3):

File	Description
[MODELNAME].pkl	The trained model
[MODELNAME]_W.pkl	A matrix containing the latent factors for pathways
[MODELNAME]_H.pkl	A matrix containing the latent factors for enzymes (or ECs)
[MODELNAME]_T.pkl	A matrix containing the pathway community representation
[MODELNAME]_C.pkl	A matrix containing the pathway community indicators
[MODELNAME]_R.pkl	A matrix containing the enzyme (or ECs) community representation
[MODELNAME]_K.pkl	A matrix containing the enzyme (or ECs) community indicators
[MODELNAME]_[U, V, L, Z].pkl	Auxiliary matrices
[MODELNAME]_cost.txt	This file contains error values between predicted values and expected values
log file	This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

With --fit-comm, --fit-features, and --no-decomposition flags (see Example 4):

File	Description
[MODELNAME].pkl	The trained model
[MODELNAME]_T.pkl	A matrix containing the pathway community representation
[MODELNAME]_C.pkl	A matrix containing the pathway community indicators
[MODELNAME]_R.pkl	A matrix containing the enzyme (or ECs) community representation
[MODELNAME]_K.pkl	A matrix containing the enzyme (or ECs) community indicators
[MODELNAME]_[U, V, L, Z].pkl	Auxiliary matrices
[MODELNAME]_cost.txt	This file contains error values between predicted values and expected values
log file	This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

Examples

Example 1:

If you wish to decompose M of 100 components, you will need to provide an additional argument --num-components. Run the following command:

python main.py --train --num-components 100 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --model-name "triUMPF_retrained_1" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_1.pkl
        │       ├── triUMPF_retrained_1_W.pkl
        │       ├── triUMPF_retrained_1_H.pkl
        │       ├── triUMPF_retrained_1_W_final.pkl
        │       ├── triUMPF_retrained_1_H_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_1_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

Example 2:

To decompose M of 100 components using external features, you will need to provide additional arguments --fit-features, --P-name, and --E-name. Then, execute the following command:

python main.py --train --fit-features --num-components 100 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --P-name "P.pkl" --E-name "E.pkl" --model-name "triUMPF_retrained_2" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_2.pkl
        │       ├── triUMPF_retrained_2_W.pkl
        │       ├── triUMPF_retrained_2_H.pkl
        │       ├── triUMPF_retrained_2_U.pkl
        │       ├── triUMPF_retrained_2_V.pkl
        │       ├── triUMPF_retrained_2_W_final.pkl
        │       ├── triUMPF_retrained_2_H_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_2_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

Example 3:

To train a pathway dataset using external features, you will need to provide additional arguments --num-communities-p, --num-communities-e, --A-name, and --B-name. Then, execute the following command:

python main.py --train --fit-features --fit-comm --num-components 100 --num-communities-p 90 --num-communities-e 100 --ssample-label-size 2000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --M-name "M.pkl" --P-name "P.pkl" --E-name "E.pkl" --A-name "A.pkl" --B-name "B.pkl" --X-name "biocyc205_tier23_9255_Xe.pkl" --y-name "biocyc205_tier23_9255_y.pkl" --model-name "triUMPF_retrained_3" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_3.pkl
        │       ├── triUMPF_retrained_3_[W, H, U, V].pkl
        │       ├── triUMPF_retrained_3_[C, K, T, R].pkl
        │       ├── triUMPF_retrained_3_[L, Z].pkl
        │       ├── triUMPF_retrained_3_[W, H, U, V]_final.pkl
        │       ├── triUMPF_retrained_3_[C, K, T, R]_final.pkl
        │       ├── triUMPF_retrained_3_[L, Z]_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_3_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

Example 4:

If you wish to use the previously decomposed M of 100 components to train the pathway dataset, you will need to provide additional arguments --no-decomposition, --W-name, and --H-name. Then, execute the following command:

python main.py --train --no-decomposition --fit-features --fit-comm --num-components 100 --num-communities-p 90 --num-communities-e 100 --ssample-label-size 2000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --P-name "P.pkl" --E-name "E.pkl" --A-name "A.pkl" --B-name "B.pkl" --W-name "triUMPF_retrained_3_W.pkl" --H-name "triUMPF_retrained_3_H.pkl" --X-name "biocyc205_tier23_9255_Xe.pkl" --y-name "biocyc205_tier23_9255_y.pkl" --model-name "triUMPF_retrained_4" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── triUMPF_retrained_4.pkl
        │       ├── triUMPF_retrained_3_[W, H].pkl
        │       ├── triUMPF_retrained_4_[C, K, T, R].pkl
        │       ├── triUMPF_retrained_4_[L, Z].pkl
        │       ├── triUMPF_retrained_3_[W, H]_final.pkl
        │       ├── triUMPF_retrained_4_[C, K, T, R]_final.pkl
        │       ├── triUMPF_retrained_4_[L, Z]_final.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── triUMPF_retrained_4_cost.txt
        │       └── ...
	├── log/
        |       ├── triUMPF_events
        │       └── ...
	└── triUMPF/
                └── ...

back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6. Advanced usage

Overview

Table of Contents

Preprocessing

Input:

Other files required for preprocessing:

Command:

Argument descriptions:

Output:

Examples

Example 1:

Example 2:

Example 3:

Training

Input:

Recommended command:

Argument descriptions:

Output:

Without --fit-comm and --fit-features flags (see Example 1):

With --fit-features and no --fit-comm flags (see Example 2):

With --fit-comm and --fit-features flags(see Example 3):

With --fit-comm, --fit-features, and --no-decomposition flags (see Example 4):

Examples

Example 1:

Example 2:

Example 3:

Example 4:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally