What is Computer Science About? Part 2: Algorithms Design and Analysis of Algorithms • Why we study algorithms: – many tasks can be reduced to abstract problems – if we can recognize them, we can use known solutions • example: Graph Algorithms – graphs could represent friendships among people, or adjacency of states on a map, or links between web pages... – determining connected components (reachability) – finding shortest path between two points • MapQuest; Traveling Salesman Problem – finding cliques • completely connected sub-graphs – uniquely matching up pairs of nodes • e.g. a buddy system based on friendships – determining whether 2 graphs have same connectivity (isomorphism) • useful for visual shape recognition (e.g. tanks from aerial photographs+edge detection) – finding a spanning tree (acyclic tree that touches all nodes) • minimal-cost communication networks Kruskal’s Algorithm for MinimumSpanning Trees // input: graph G with a set of vertices V // and edges (u,v) with weights (lengths) KRUSKAL(G): A = ∅ foreach vi  V: cluster[vi]  i // singletons foreach edge (u,v) ordered by increasing weight: if cluster[u] ≠ cluster[v]: A = A  {(u, v)} foreach w  V: if cluster[w] = cluster[u]: cluster[w]  cluster[v] // merge return A // subset of edges • it is greedy • is it correct? (always produce MST?) • is it optimal? (how long does it take?) • characterize algorithms in terms of efficiency – note: we count number of steps, rather than seconds • wall-clock time is dependent on machine, compiler, load, etc... • however, optimizations are important for real-time sys., games – are there faster ways to sort a list? invert a matrix? find a completely connected sub-graph? – scalability for larger inputs (think: human genome): how much more time/memory does the algorithm take? – polynomial vs. exponential run-time (in the worst case) • depends a lot on the data structure (representation) – hash tables, binary trees, etc. can help a lot • proofs of correctness – can you prove Euclid’s algorithm is correct? – can you prove an algorithm will guarantee to output the longest palindrome in a string? – is the code for billing long-distance calls correct? • Why do we care so much about polynomial run-time? – consider 2 programs that take an input of size n (e.g. length of a string, number of nodes in graph, etc.) – run-time of one scales up as n2 (polynomial), and the other as 2n (exponential) 1200 1000 800 n^2 600 2^n 400 200 0 0 2 4 6 n 8 10 • Why do we care so much about polynomial run-time? – consider 2 programs that take an input of size n (e.g. length of a string number of nodes in graph, etc.) – run-time of one scales up as n2 (polynomial), and the other as 2n (exponential) – exponential algorithms are effectively “unsolvable” for n>~16 even if we used computers that were 100 times as fast! 1200 100000 a computational “cliff” 90000 1000 80000 70000 800 60000 n^2 600 2^n n^2 50000 2^n 40000 400 30000 20000 200 10000 0 0 0 2 4 6 n 8 10 0 5 10 15 n 20 25 n n2 2n 1 1 2 2 4 4 3 9 8 4 16 16 5 25 32 6 36 64 7 49 128 8 64 256 9 81 512 10 100 1024 11 121 2048 12 144 4096 13 169 8192 14 196 16384 15 225 32768 16 256 65536 17 289 131072 18 324 262144 19 361 524288 20 400 1048576 21 441 2097152 22 484 4194304 23 529 8388608 24 576 16777216 25 625 33554432 helpful rules of thumb: 210 ~ 1 thousand (1,024) 220 ~ 1 million (1,048,576) 230 ~ 1 billion (1,073,741,824) 1 Gb = 232 = 4294967296 ~ 4 billion bytes Moore’s Law (named after Gordon Moore, founder of Intel) • Number of transistors on CPU chips appears to double about once every 18 months • Similar statements hold for CPU speed, network bandwidth, disk capacity, etc. • but waiting a couple years for computers to get faster is not an effective solution to NP-hard problems Dual Core Itanium Pentium-4 80486 Motorola 6800 source: Wikipedia P vs. NP (CSCE 411) • problems in “P”: solvable in polynomial time with a deterministic algorithm – examples: sorting a list, inverting a matrix... • problems in “NP”: solvable in polynomial time with a non-deterministic algorithm – given a “guess”, can check if it is a solution in polynomial time – no known polynomial-time algorithm exists, and they would take exponential time to enumerate and try all the guesses in the worst case – example: given a set of k vertices in a graph, can check if they form a completely connected clique; but there are exponentially many possible sets to choose from P vs. NP (CSCE 411) • most computer scientists believe P≠NP, though it has yet to be rigorously proved • what does this mean? – that there are intrinsically “hard” problems for which a polynomial-time algorithm will never be found P sorting a list, inverting a matrix, minimum-spanning tree... NP even harder problems (complexity classes) graph clique, subset cover, Traveling Salesman Problem, satisfiability of Boolean formulas, factoring of integers... • Being able to recognize whether a problem is in P or NP is fundamentally important to a computer scientist • Many combinatorial problems are in NP – knapsack problem (given n items with size wi and value vi, fit as many as possible items into a knapsack with a limited capacity of L that maximizes total value. – traveling salesman problem (shortest circuit visiting every city) – scheduling – e.g. of machines in a shop to minimize a manufacturing process • Finding the shortest path in a graph between 2 nodes is in P – there is an algorithm that scales-up polynomially with size of graph: Djikstra’s algorithm – however, finding the longest path is in NP! (hence we do not expect there are complete and efficient solutions for all cases) • Applications to logistics, VLSI circuit layout... • not all hope is lost... • Even if a problem is in NP, there might be an approximation algorithm to solve it efficiently (in polynomial time) – However, it is important to determine the error bounds. – For example, an approx. alg. might find a subset cover that is “no more than twice the optimal size” – A simple greedy algorithm for the knapsack problem: • put in item with largest weight-to-value ratio first, then next largest, and so on... • can show that will fill knapsack to within 2 times the optimal value