GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ.

GraphLab: how I understood it
with sample code
Aapo Kyrola,
Carnegie Mellon Univ.
Oct 1, 2009
To test if I got your idea…
•  … I created two imaginary GraphLab
sample applications by using imaginary
GraphLab Java API
•  Is this how you imagined GraphLab
applications would look like?
Technology layers
GraphLab
•  GraphLab API
–  Defined and maintained by
us
OpenGL
•  OpenGL API
–  Maintained by Khronos
group
•  glVertex3f(a,b,c,d)
•  glTransform(…)
•  GraphLab Engine
–  Reference implementation
done by us
–  Others encouraged to
implement their own
•  OpenGL graphics card
drivers
–  By Nvidia, ATI, …; interface
with their hardware
Contents
1.  GraphLab sample code for belief
propagation –based inference
• 
• 
ML practitioner’s (end-user’s) point of view
What happens in the Engine?
2.  Sample code for stochastic matrix
eigenvector calculation by iteration
• 
Issue with syncs and aggregation functions
Note about BP
•  Bishop’s text uses BP on bipartite graph
(variable + factor nodes), while Keller’s
book uses Cluster Factor graphs
–  I will use Keller’s representation because it is
simpler
A
B
A, B, D
D
C
(a)
B, D
B, C, D
(b)
1: A, B
A
B
2: B, C
4: A, D
D
C
(c)
3: C, D
Sample program
• 
• 
• 
User has huge Bayes network that models weather in USA
He knows it is 37F in Philadelphia and it rained yesterday in Pittsburgh,
and it is October (evidence)
–  What is the probability of rain in Pittsburgh today?
See main() below (no GraphLab –stuff here yet)
Initialization of BP
•  Create cluster factor graph with special nodes for Belief
Propagation (have storage for messages; edges contain
the shared variables between factors)
BayesNetwork and
ClusterFactorGraph are
classes defined by the
GraphLab API or/and extend
some more abstract Graph
class
This implicitly marks each
node ‘dirty’ (which the
engine will add to task
queue)
Node function (kernel)
•  To run the distributed BP algorithm, we need to define
function (kernel) that runs on each factor node --- always
when the factor is “dirty” (task queue is not visible?)
Only if message changes
significantly, do we send it. Sending
a message flags recipient as dirty ->
it will be added to task queue.
Note: edge might be
remote or local,
depending on graph
partitioning. Kernel may or
may not care about it. (For
example, threshold could
be higher for remote
edges?)
Executing the engine
• 
User can execute the algorithm on different GraphLab implementations
– 
• 
Data cluster version, multicore version, GPU version, DNA computer version, Quantum computer etc.
Advanced users can customize graph partitioning algorithm, scheduling priority,
timeout etc.
– 
For example, loopy BP may not converge everywhere, but still be usable?? Need timeout our relaxed
convergence criteria.
After lightning fast computation, we have
calibrated belief network. We can use this
to efficiently ask marginal distributions.
What Engine does?
1.  Client sends the graph data and functions to be run to the Computation Parent
1.  How code is delivered? Good question. In Java, easy to send class files.
2.  Graph is partitioned to logical partitions (minimizing of links between partitions)
1.  Edges that cross partitions are made into remote edges
2.  Each CPU is assigned one or more logical partitions by the Computation Parent
3.  In each logical partition, computation is done sequentially
1.  In the beginning of each iteration, partition collects the dirty nodes (-> taskqueue(T))
2.  … and calls each dirty node with node function sequentially
Next example of eigenvalue
3.  This will result into new set of dirty nodes (-> taskqueue(T+1))
calculation shows how we can
• 
via remote edges, nodes in other partitions are flagged dirty
calculate partition-level
4.  Computation Parent monitors each logical partition for number of dirty nodes accumulative functions efficiently
and deliver them to the central
1.  When dirty count is zero or under defined limit, computation is finished. unit
5.  Graph state in the end of computation is sent back to client.
Note: in this model, nodes are not able to read from other nodes. Instead they
can send data to other nodes, which can then cache this information.
A posteriori
Stochastic Matrix Eigenvector
•  Task: to iterate x = Mx, where x is a probability distribution
over (X1..Xn) and M is a stochastic matrix (Markov transition
matrix), until we reach convergence (“fixed point”)
–  Existence of eigenvector (limit distribution) is guaranteed in all
but pathological cases? (= periodic chains?)
•  Running iteration in parallel is not stable because of “feedback
loops”
–  In serial computation, |Mx| = 1 (norm is L1 norm, right?)
–  Normalization factor is needed to keep computation in control
–  But calculation of |Mx| needs input from all Xi synchronously
•  Sync is costly, so we want to do this infrequently
–  how well is the effect studied? Are the some runaway problems?
Two players in Markov’s chain
talking
Normalization
1.  Each logical partition has its on SumAccumulator
1.  This is passed to each node on function computation. Node
discounts its previous value and adds new (=> we need not to
enumerate al nodes to get an updated sum)
2.  After iteration, partition sends its accumulator value to the
computation parent, which has its own SumAccumulator
• 
Amount of remote accumulator communication = N(num of partitions)
3.  Before each iteration, partition queries parent for current value
of normalization. This is passed to all nodes when node
function is computed.
• 
If normalization factor changes significantly, all nodes are renormalized.
But does it work? Good question!
Initialization
Node Function
Invokes update on outbound nodes only if its value
changed significantly. When converging, there are less
and less dirty nodes.
Partition Interceptor
Interceptor-idea is copied from certain web application frameworks
Computation Parent code
Putting it together…