-
Notifications
You must be signed in to change notification settings - Fork 8
Preprocessing Script Options
Running collate.py
will process an assembly graph file so that it can be visualized, producing a SQLite3 .db
file that can be loaded in the viewer interface to visualize the assembly graph.
usage: collate.py [-h] -i INPUTFILE -o OUTPUTPREFIX [-d OUTPUTDIRECTORY]
[-spqr] [-pg] [-px] [-sp] [-w] [-nt] [-b BICOMPONENTFILE]
[-ub USERBUBBLEFILE] [-ubl] [-up USERPATTERNFILE] [-upl]
[-nbdf]
Prepares an assembly graph file for visualization, generating a database file
that can be loaded in the MetagenomeScope viewer interface.
optional arguments:
-h, --help show this help message and exit
-i INPUTFILE, --inputfile INPUTFILE
input assembly graph filename (LastGraph, GFA, or
MetaCarvel GML)
-o OUTPUTPREFIX, --outputprefix OUTPUTPREFIX
output file prefix for .db and .xdot/.gv files
-d OUTPUTDIRECTORY, --outputdirectory OUTPUTDIRECTORY
directory in which all output files will be stored;
defaults to current working directory
-spqr, --computespqrdata
compute data for the SPQR "decomposition modes" in
MetagenomeScope; necessitates a few additional system
requirements (see wiki for details)
-pg, --preservegv save all .gv (DOT) files generated for nontrivial
(i.e. containing more than one node, or at least one
edge or node group) connected components
-px, --preservexdot save all .xdot files generated for nontrivial
connected components
-sp, --structuralpatterns
create .txt files in the output directory containing
node information for all structural patterns
identified in the graph
-w, --overwrite overwrite output files
-nt, --notriangulation
disable triangle smoothing in the SPQR mode (this
argument is only used if -spqr is pased)
-b BICOMPONENTFILE, --bicomponentfile BICOMPONENTFILE
file containing bicomponent information for the
assembly graph (this argument is only used if -spqr is
passed; a file containing bicomponent information will
be generated if -spqr is passed and this option is not
passed)
-ub USERBUBBLEFILE, --userbubblefile USERBUBBLEFILE
file describing pre-identified bubbles in the graph,
in the format of MetaCarvel's bubbles.txt output: each
line of the file is formatted as (source ID) (tab)
(sink ID) (tab) (all node IDs in the bubble, including
source and sink IDs, all separated by tabs)
-ubl, --userbubblelabelsused
use node labels instead of IDs in the pre-identified
bubbles file
-up USERPATTERNFILE, --userpatternfile USERPATTERNFILE
file describing pre-identified miscellaneous
structural patterns in the graph: each line of the
file is formatted as (pattern type) (tab) (all node
IDs in the pattern, all separated by tabs)
-upl, --userpatternlabelsused
use node labels instead of IDs in the pre-identified
misc. patterns file
-nbdf, --nobackfilldotfiles
produces .gv (DOT) files without cluster "backfilling"
for each nontrivial connected component in the graph;
use of this argument doesn't impact the .db files
produced by this script -- it just demonstrates the
functionality in layout linearization provided by
cluster "backfilling"
The script will always produce a .db
file. Certain arguments (-pg
, -px
, -nbdf
, -sp
) can be passed to produce more output files; for a thorough description of these arguments, see the command-line argument descriptions.
The script will also generate a few types of auxiliary files containing various information about the structure of the assembly graph. These files are:
-
*_links
, where*
is the output prefix passed via-o
. Only one of these files will be generated per execution ofcollate.py
. This file indicates all the edges in the assembly graph. If you pass in-b
and the input assembly graph has unoriented contigs, then this file will not be generated (since it would be equivalent to the _single_links file in that case). -
*_single_links
, where*
is the output prefix passed via-o
. This file will only be generated if the input assembly graph has unoriented contigs. In terms of currently supported input filetypes, this means that this file will only be generated when the input assembly graph is of type LastGraph or GFA. -
*_bicmps
, where*
is the output prefix passed via-o
. Only one of these files will be generated per execution ofcollate.py
. This file indicates the various separation pairs contained within the assembly graph (see Nijkamp et al. for a brief overview of separation pairs and their usage in bubble detection). It's possible to pass an existing version of this file using-b
to the script, to prevent having to do the work of creating the file again. -
component_D.info
, whereD
is an integer greater than 0. There will be one of these files created for every biconnected component contained within the assembly graph: these files indicate the contents of the SPQR tree defined for their corresponding biconnected component. -
spqrD.gml
, whereD
is an integer greater than 0. These files correspond tocomponent_D.info
files: they indicate the connections between the metanodes of a SPQR tree.
The script requires all component_D.info
and spqrD.gml
files to be
removed from the output directory before it generates more of them.
If -w
is enabled, then all existing files with corresponding names in the
output directory will be deleted; however, if -w
is not enabled, then an
error will be raised.
Similarly, if files exist in the output directory with filenames overlapping
those of the *_links
and *_bicmps
files, then those files will be
either deleted (if -w
is enabled) or an error will be raised (if -w
is not
enabled).
-
-i
The input assembly graph file to be used.- See the MetagenomeScope README for an up-to-date list of input assembly graph filetypes supported.
-
-o
The file prefix to be used for all files generated (with the exception of some SPQR files). As an example, given the argument-o prefix
, the fileprefix.db
would be generated. If .gv and/or .xdot files are created (depending on the-pg
or-px
arguments, respectively), then those files will be numbered according to the relative size rank (in nodes) of their respective connected component within the assembly graph. -
-d
This optional argument specifies the name of the directory in which all output files will be stored. If this argument is not indicated, then all files will be generated in the current working directory.- If the specified directory here does not already exist, then the preprocessing script will create it. In the case that the directory cannot be created (i.e. there exists a file in the current working directory with the same name as the specified directory), an error will be raised.
-
-pg
This optional argument produces DOT files (suffix .gv) in the output directory. As an example, given the arguments-o prefix
and-pg
for an assembly graph with 3 connected components, the filesprefix.db
,prefix_1.gv
,prefix_2.gv
, andprefix_3.gv
would be created (whereprefix_1.gv
indicates the largest connected component by number of nodes,prefix_2.gv
indicates the next largest connected component, and so on). -
-px
This optional argument produces .xdot files in the output directory. These files are labelled in an identical fashion to.gv
files, with the only difference in naming being the file suffix (.xdot instead of .gv). -
-sp
This optional argument will produce .txt files in the output directory describing the nodes contained in the various types of structural patterns identified in the assembly graph.- Each file will be named
sp_clustertypes.txt
, whereclustertypes
is one of (bubbles
,frayed_ropes
,chains
,cyclic_chains
,misc_patterns
). - Files will only be created for structural pattern types that were identified in the graph; so if an input assembly graph only contains chains (and no bubbles, frayed ropes, cyclic chains, etc.) then only a file named
sp_chains.txt
will be produced.
- Each file will be named
-
-w
This optional argument allows the overwriting of output files (.db/.xdot/.gv/links/single_links/bicmps/.info/spqr.gml/structural pattern .txt files). If this argument is not given, then:- An error will be raised if writing a .db file would cause another .db file to be overwritten.
- A warning will be displayed if writing to a .gv or .xdot file would cause another .gv/.xdot file to be overwritten. In this case, the .gv/.xdot file in question simply would not be saved.
- Note that the presence of files in the
output directory that are conflicting-named folders (e.g. a
directory named
e_coli.db/
in the output directory while attempting to produce a file namede_coli.db
) will cause an error/warning to be raised regardless of whether or not-w
is set. - See this page for details on how this option works, and a few possible boundary conditions.
-
-b
This optional argument lets you pass in an existing file indicating the separation pairs in the graph (to be used in the detection of complex bubbles) to the script. -
-ub
This optional argument lets you pass in a file describing pre-identified bubbles in the input graph, which will be automatically highlighted and grouped (as with "normal" bubbles discovered by MetagenomeScope).- The format of this file should match MetaCarvel's
bubbles.txt
output file: each line of the file should be formatted as(source contig ID)\t(sink contig ID)\t(all node IDs in the bubble, including source and sink IDs, all separated by tabs)
. - As with normal MetagenomeScope-identified bubbles, the same contig can't be contained in multiple bubbles. Bubbles specified in the input file here are processed starting from the first line and going down; any bubbles containing already-"used" contigs will be skipped.
- The contigs contained in these bubbles should at least be contiguous in some fashion. (This will eventually be validated.)
- The format of this file should match MetaCarvel's
-
-ubl
If this optional argument is passed -- and if-ub
is passed -- then the pre-identified bubbles file specified by-ub
will be processed looking for contig labels instead of IDs. -
-up
Like-ub
, this optional argument lets you pass in a file describing pre-identified miscellaneous patterns in the input graph, which will be automatically highlighted and colored.- Each line of the file should be formatted as
(pattern type)\t(all node IDs in the pattern, separated by tabs)
.-
(pattern type)
can be any string not containing a tab or newline. It's the name of the pattern, seen when it is selected in the viewer interface.
-
- User-defined bubbles, if present, will be processed by MetagenomeScope before user-defined misc. patterns. So if a contig is present in multiple "groups" for whatever reason, the user-defined bubble will be given higher priority.
- The same contig can't be contained in multiple misc. patterns. As with the user-specified bubbles file, misc. patterns are processed starting from the first line and going down; any patterns containing already-"used" contigs will be skipped.
- Each line of the file should be formatted as
-
-upl
If this optional argument is passed -- and if-up
is passed -- then the pre-identified misc. patterns file specified by-up
will be processed looking for contig labels instead of IDs. -
-nbdf
If this optional argument is passed, DOT files for each nontrivial standard mode connected component (see the note below) that don't use backfilling for node groups will be generated in the output directory. That is, these files (all with the suffix_nobackfill.gv
) will contain all node groups in their respective connected component represented as "clusters" in Graphviz.- This differs from the normal way we lay out graphs in standard mode using Graphviz, in which node groups are laid out separately and represented as rectangular nodes in the overall graph layout; these node groups are later "backfilled" to contain their children nodes. Since this makes all node groups "atomic" -- they're represented as nodes, so
dot
doesn't route any edges through them -- this has the effect of linearizing the graph in many cases. - Using the
-pg
option will produce DOT files that do use backfilling. This is nice if you'd like to compare components' layouts with and without backfilling. - Note that this option doesn't actually change the way the .db file is created -- that'll still use backfilling, regardless of this option. All passing
-nbdf
does is create extra DOT files in the output directory.
- This differs from the normal way we lay out graphs in standard mode using Graphviz, in which node groups are laid out separately and represented as rectangular nodes in the overall graph layout; these node groups are later "backfilled" to contain their children nodes. Since this makes all node groups "atomic" -- they're represented as nodes, so
Graphviz seems to round input node dimensions to the nearest point value (where an inch is defined as 72 points). See this issue for details on the rounding process.
We don't use these rounded dimensions
in the viewer interface, although the rounded dimensions will persist in
.xdot
files and when Graphviz performs layout on/draws the .gv
files
produced via -pg
.
This results in a very slight discrepancy in node sizes between the viewer
interface and Graphviz' drawings.
To save time and space, we don't actually call Graphviz to lay out connected
components containing one node and no edges or node groups. Instead, we
position each node in the center of an appropriately-sized connected component,
essentially "mimicking" the layout PyGraphviz/Graphviz would have produced in a
portion of the time taken to invoke them. (See
this issue for
details.) Therefore, .gv/.xdot files for these connected components will not be
exported even if -pg
or -px
are passed.
-
Controls
(Work in progress)
-
Viewer Interface Tutorial