Skip to content

Preprocessing Script Overview

Marcus Fedarko edited this page Jul 16, 2018 · 13 revisions

MetagenomeScope's "preprocessing script" is a program that takes as input an assembly graph file and produces a SQLite database file that can be visualized in MetagenomeScope's viewer interface.

The preprocessing script is located in the graph_collator/ directory of MetagenomeScope. The script can be run with the command python graph_collator/collate.py.

Subpages

  • System Requirements: a list of the various libraries needed to run the preprocessing script
  • Installation: a guide to installing the "SPQR version" of the script on your system
    • The "non-SPQR version" of the script is written solely in Python, so that version of the script is relatively portable.
    • However, the "SPQR version" of the preprocessing script uses OGDF (a C++ library) and a C++ script to interface with it, in order to generate SPQR tree decompositions for MetagenomeScope's "decomposition mode." So there are a few extra steps to compiling the C++ script to work on your system, in order to install this version of the preprocessing script.
  • Settings: information about the various options available when running the script

Usage

./collate.py [-h] -i INPUTFILE -o OUTPUTPREFIX [-d OUTPUTDIRECTORY]
             [-spqr] [-pg] [-px] [-sp] [-w] [-nt] [-b BICOMPONENTFILE]
             [-ub USERBUBBLEFILE] [-ubl] [-up USERPATTERNFILE] [-upl]
             [-nbdf]

  -h, --help            show this help message and exit
  -i INPUTFILE, --inputfile INPUTFILE
                        input assembly graph filename (LastGraph, GFA, or
                        MetaCarvel GML)
  -o OUTPUTPREFIX, --outputprefix OUTPUTPREFIX
                        output file prefix for .db and .xdot/.gv files
  -d OUTPUTDIRECTORY, --outputdirectory OUTPUTDIRECTORY
                        directory in which all output files will be stored;
                        defaults to current working directory
  -spqr, --computespqrdata
                        compute data for the SPQR "decomposition modes" in
                        MetagenomeScope; necessitates a few additional system
                        requirements (see wiki for details)
  -pg, --preservegv     save all .gv (DOT) files generated for nontrivial
                        (i.e. containing more than one node, or at least one
                        edge or node group) connected components
  -px, --preservexdot   save all .xdot files generated for nontrivial
                        connected components
  -sp, --structuralpatterns
                        create .txt files in the output directory containing
                        node information for all structural patterns
                        identified in the graph
  -w, --overwrite       overwrite output files
  -nt, --notriangulation
                        disable triangle smoothing in the SPQR mode
  -b BICOMPONENTFILE, --bicomponentfile BICOMPONENTFILE
                        file containing bicomponent information for the
                        assembly graph (will be generated using the SPQR
                        script in the output directory if not passed)
  -ub USERBUBBLEFILE, --userbubblefile USERBUBBLEFILE
                        file describing pre-identified bubbles in the graph,
                        in the format of MetaCarvel's bubbles.txt output: each
                        line of the file is formatted as (source ID) (tab)
                        (sink ID) (tab) (all node IDs in the bubble, including
                        source and sink IDs, all separated by tabs)
  -ubl, --userbubblelabelsused
                        use node labels instead of IDs in the pre-identified
                        bubbles file
  -up USERPATTERNFILE, --userpatternfile USERPATTERNFILE
                        file describing pre-identified miscellaneous
                        structural patterns in the graph: each line of the
                        file is formatted as (pattern type) (tab) (all node
                        IDs in the pattern, all separated by tabs)
  -upl, --userpatternlabelsused
                        use node labels instead of IDs in the pre-identified
                        misc. patterns file
  -nbdf, --nobackfilldotfiles
                        produces .gv (DOT) files without cluster "backfilling"
                        for each nontrivial connected component in the graph;
                        use of this argument doesn't impact the .db files
                        produced by this script -- it just demonstrates the
                        functionality in layout linearization provided by
                        cluster "backfilling"
Clone this wiki locally