GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009 To test if I got your idea… •  … I created two imaginary GraphLab sample applications by using imaginary GraphLab Java API •  Is this how you imagined GraphLab applications would look like? Technology layers GraphLab •  GraphLab API –  Defined and maintained by us OpenGL •  OpenGL API –  Maintained by Khronos group •  glVertex3f(a,b,c,d) •  glTransform(…) •  GraphLab Engine –  Reference implementation done by us –  Others encouraged to implement their own •  OpenGL graphics card drivers –  By Nvidia, ATI, …; interface with their hardware Contents 1.  GraphLab sample code for belief propagation –based inference •  •  ML practitioner’s (end-user’s) point of view What happens in the Engine? 2.  Sample code for stochastic matrix eigenvector calculation by iteration •  Issue with syncs and aggregation functions Note about BP •  Bishop’s text uses BP on bipartite graph (variable + factor nodes), while Keller’s book uses Cluster Factor graphs –  I will use Keller’s representation because it is simpler A B A, B, D D C (a) B, D B, C, D (b) 1: A, B A B 2: B, C 4: A, D D C (c) 3: C, D Sample program •  •  •  User has huge Bayes network that models weather in USA He knows it is 37F in Philadelphia and it rained yesterday in Pittsburgh, and it is October (evidence) –  What is the probability of rain in Pittsburgh today? See main() below (no GraphLab –stuff here yet) Initialization of BP •  Create cluster factor graph with special nodes for Belief Propagation (have storage for messages; edges contain the shared variables between factors) BayesNetwork and ClusterFactorGraph are classes defined by the GraphLab API or/and extend some more abstract Graph class This implicitly marks each node ‘dirty’ (which the engine will add to task queue) Node function (kernel) •  To run the distributed BP algorithm, we need to define function (kernel) that runs on each factor node --- always when the factor is “dirty” (task queue is not visible?) Only if message changes significantly, do we send it. Sending a message flags recipient as dirty -> it will be added to task queue. Note: edge might be remote or local, depending on graph partitioning. Kernel may or may not care about it. (For example, threshold could be higher for remote edges?) Executing the engine •  User can execute the algorithm on different GraphLab implementations –  •  Data cluster version, multicore version, GPU version, DNA computer version, Quantum computer etc. Advanced users can customize graph partitioning algorithm, scheduling priority, timeout etc. –  For example, loopy BP may not converge everywhere, but still be usable?? Need timeout our relaxed convergence criteria. After lightning fast computation, we have calibrated belief network. We can use this to efficiently ask marginal distributions. What Engine does? 1.  Client sends the graph data and functions to be run to the Computation Parent 1.  How code is delivered? Good question. In Java, easy to send class files. 2.  Graph is partitioned to logical partitions (minimizing of links between partitions) 1.  Edges that cross partitions are made into remote edges 2.  Each CPU is assigned one or more logical partitions by the Computation Parent 3.  In each logical partition, computation is done sequentially 1.  In the beginning of each iteration, partition collects the dirty nodes (-> taskqueue(T)) 2.  … and calls each dirty node with node function sequentially Next example of eigenvalue 3.  This will result into new set of dirty nodes (-> taskqueue(T+1)) calculation shows how we can •  via remote edges, nodes in other partitions are flagged dirty calculate partition-level 4.  Computation Parent monitors each logical partition for number of dirty nodes accumulative functions efficiently and deliver them to the central 1.  When dirty count is zero or under defined limit, computation is finished. unit 5.  Graph state in the end of computation is sent back to client. Note: in this model, nodes are not able to read from other nodes. Instead they can send data to other nodes, which can then cache this information. A posteriori Stochastic Matrix Eigenvector •  Task: to iterate x = Mx, where x is a probability distribution over (X1..Xn) and M is a stochastic matrix (Markov transition matrix), until we reach convergence (“fixed point”) –  Existence of eigenvector (limit distribution) is guaranteed in all but pathological cases? (= periodic chains?) •  Running iteration in parallel is not stable because of “feedback loops” –  In serial computation, |Mx| = 1 (norm is L1 norm, right?) –  Normalization factor is needed to keep computation in control –  But calculation of |Mx| needs input from all Xi synchronously •  Sync is costly, so we want to do this infrequently –  how well is the effect studied? Are the some runaway problems? Two players in Markov’s chain talking Normalization 1.  Each logical partition has its on SumAccumulator 1.  This is passed to each node on function computation. Node discounts its previous value and adds new (=> we need not to enumerate al nodes to get an updated sum) 2.  After iteration, partition sends its accumulator value to the computation parent, which has its own SumAccumulator •  Amount of remote accumulator communication = N(num of partitions) 3.  Before each iteration, partition queries parent for current value of normalization. This is passed to all nodes when node function is computed. •  If normalization factor changes significantly, all nodes are renormalized. But does it work? Good question! Initialization Node Function Invokes update on outbound nodes only if its value changed significantly. When converging, there are less and less dirty nodes. Partition Interceptor Interceptor-idea is copied from certain web application frameworks Computation Parent code Putting it together…