Using Cactus Graphs to Build a Pangenome

Susan Tu
Using Cactus
Graphs for Multispecies Genomes
Motivation
 Build a reference genome for multiple, related
species
 Visualize variation between species
 Display data for several species relative to this
reference genome
Prior work
 Minimize partial order’s weighted symmetric
difference
 Minimize Kemeny tau distance (number of out of
order pairs)
 But this doesn’t penalize less likely order-switches
(i.e., of sequences that are pretty far apart) enough
 What’s better about new approach: Nguyen et
al.’s model takes “explicitly models double
stranded nature of DNA”
Problem Formulation
 Let S be the input DNA sequences
 Define the equivalence relation ~ on S x S
 Review: reflexive, symmetric, transitive
 We enforce strand consistency and strand
exclusivity
 Consistency: x ~ y => -y ~ -x
 Exclusivity: x ~ y => neither x ~ -y nor –x ~ y
 We call each equivalence class, S/~, a side, and
the forward and reverse complement sides
together constitute a block
Sequence Graphs
 G=(V,E) is a bi-directed sequence graph
 Bi-directed means that each edge is given an
orientation at each endpoint
 Each edge is a pair of sides
 A thread path is a sequence of sides such that each
consecutive pair of sides is connected by an edge
going in that direction
 Transitive sequence graph: add in edges for sides
connected by thread path
Constructing the Cactus
Graph
 Merge nodes that are connected only by
adjacency or backdoor adjacency edges (the
only other kind of edge is block edges)
 Each 3-edge-connected component should be
merged into 1 node
 Merge all leaf nodes and branching nodes of
bridge trees into single node
 (Call the node that contains the backdoor group
component the origin node)
Constructing a Pangenome
Reference
 A set of non-empty threads such that each block
is visited once
 Find the one F with the best score, where the
score is the sum of the weights of edges that are
consistent with F
 NP-hard problem
Heuristic for Pangenome
Reference Problem
 Cactus graph represents sequence graph in
hierarchical form
 Create a pangenome reference independent
for each net of the cactus graph (can do this in
parallel)
 Solve each subproblem using greedy algorithm
and simulated annealing
 Greedy: add in element of V, picking insertion point
and member of V that maximizes consistency with
elements already in F
Multi-level Cactus Graphs
 A bunch of cactus graphs connected in the
shape of a tree
 Represent progressively more detailed levels of
alignment
Maximum weight Cactus
subgraph with large chains
problem
 This is for constructing the initial cactus graph (we
will make it multilevel later)
 Find a cactus graph such that all chains are of
length >= alpha, and the weight is maximal
 Length of chain: # of block edges it contains
 Weight of a cactus graph: sum of weights of its
block edges
Minimizing entropy of multilevel cactus graph
Results
Results
References
 Benedict Paten, Mark Diekhans, Dent Earl, John St.
John, Jian Ma, Bernard Suh, and David Haussler.
Journal of Computational Biology. March 2011,
18(3): 469-481. doi:10.1089/cmb.2010.0252.
 Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D,
Haussler D. Cactus: Algorithms for genome multiple
sequence alignment. Genome Research.
2011;21(9):1512-1528. doi:10.1101/gr.123356.111.
 Nguyen Ngan, Hickey Glenn, Zerbino Daniel R.,
Raney Brian, Earl Dent, Armstrong Joel, Kent W.
James, Haussler David, and Paten Benedict. Journal
of Computational Biology. May 2015, 22(5): 387-401.
doi:10.1089/cmb.2014.0146.