Data Structures Specification and Implementation , CSE 5350/7350

CSE 5350/7350
Introduction to Algorithms
Data Structures
Specification and Implementation
Textbook readings:
Cormen: Part III, Chapters 10-14
Mihaela Iridon, Ph.D.
[email protected]
CSE 5350 - Fall 2007
Data Structures
Slide 1
Objectives
• Understand what dynamic sets are
• Learn basic techniques for
a) Representing &
b) Manipulating finite dynamic set
• Elementary Data Structures
– Stacks, queues, heaps, linked lists
• More Complex Data Structures
– Hash tables, binary search trees
• Data Structures in C#.NET 2.0
CSE 5350 - Fall 2007
Data Structures
Slide 2
High-Level Structure (1)
• Arrays
– System.Collections.ArrayList
– System.Collections.Generic.List
• Queue
– System.Collections.Generic.Queue
• Stack
– System.Collections.Generic.Stack
CSE 5350 - Fall 2007
Data Structures
Slide 3
High-Level Structure (2)
• Hashtable
– System.Collections.Hashtable
– System.Collections.Generic.Dictionary
• Trees
– Binary Trees, BST, Self-Balancing BST
– Linked Lists
• System.Collections.Generic.LinkedList
• Graphs
CSE 5350 - Fall 2007
Data Structures
Slide 4
Dynamic Data Sets
• Definition
• Why dynamic
• General examples
• Data structures and the .NET framework
• “An Extensive Examination of Data
Structures Using C# 2.0” – Scott Mitchell
• http://msdn2.microsoft.com/enus/library/ms364091(VS.80).aspx
CSE 5350 - Fall 2007
Data Structures
Slide 5
Data Structure Design
• Impact on efficiency/running time
• The data structure used by an algorithm
can greatly affect the algorithm's
performance
• Important to have rigorous method by
which to compare the efficiency of various
data structures
CSE 5350 - Fall 2007
Data Structures
Slide 6
Example: file extension search
public bool DoesExtensionExist(string [] fileNames, string extension)
{
int i = 0;
for (i = 0; i < fileNames.Length; i++)
if (String.Compare(Path.GetExtension(fileNames[i]), extension, true) == 0)
return true;
return false; // If we reach here, we didn't find the extension }
}
• Search is of O(n)
CSE 5350 - Fall 2007
Data Structures
Slide 7
The Array
• Linear
• Simple
• Direct Access
• Homogeneous
• Most widely used
CSE 5350 - Fall 2007
Data Structures
Slide 8
The Array (2)
• The contents of an array are stored in contiguous
memory.
• All of the elements of an array must be of the
same type or of a derived type; hence arrays are
referred to as homogeneous data structures.
• Array elements can be directly accessed. With
arrays if you know you want to access the ith
element, you can simply use one line of code:
arrayName[i].
CSE 5350 - Fall 2007
Data Structures
Slide 9
Array Operations
• Allocation
• Accessing
– Declaring an array in C#:
string[] myArray;
(initially myArray reference is null)
– Creating an array in C#:
myArray = new string[5];
CSE 5350 - Fall 2007
Data Structures
Slide 10
Array Allocation
• string[] myArray = new string[someIntegerSize];
•  this allocates a contiguous block of
memory on the heap (CLR-managed)
CSE 5350 - Fall 2007
Data Structures
Slide 11
Array Accessing
• Accessing an element at index i: O(1)
• Searching through and array
– Unsorted: O(n)
– Sorted: O(log n)
• Array class: static method:
– Array.BinarySearch(Array input, object val)
CSE 5350 - Fall 2007
Data Structures
Slide 12
Array Resizing
• When the size needs to change:
– Must create a new array instance
– Copy old array into new array:
Array1.CopyTo(Array2, 0)
• Time consuming
• Also, inserting into an array is problematic
CSE 5350 - Fall 2007
Data Structures
Slide 13
Multi-Dimensional Arrays
• Rectangular
–
–
–
–
nxn
nxnxnx…
Accessing: O(1)
Searching: O(nk)
• Jagged/Ragged
– n1 x n2 x n3 x …
CSE 5350 - Fall 2007
Data Structures
Slide 14
Goals
• Type-safe
• Performant
• Reusable
• Example: payroll application
CSE 5350 - Fall 2007
Data Structures
Slide 15
System.Collections.ArrayList
• Can hold any data type: (hybrid)
• Internally: array object
• Automatic resizing
• Not type safe: casting  errors detected
only at runtime
• Boxing/unboxing: extra-level of
indirection  affects performance
• Loose homogeneity
CSE 5350 - Fall 2007
Data Structures
Slide 16
Generics
• Remedy for Typing and Performance
• Type-safe collections
• Reusability
• Example:
public class MyTypeSafeList<T>
{
T[] innerArray = new T[0];
}
CSE 5350 - Fall 2007
Data Structures
Slide 17
List
• Homogeneous
• Self-Re-dimensioning Array
• System.Collections.Generic.List
List<string> studentNames = new List<string>();
studentNames.Add(“John”);
…
string name = studentNames[3];
studentNames[2] = “Mike”;
CSE 5350 - Fall 2007
Data Structures
Slide 18
List Methods
• Contains()
• IndexOf()
• BinarySearch()
• Find()
• FindAll()
• Sort()
– Asymptotic Running Time: same as array but with
extra overhead
CSE 5350 - Fall 2007
Data Structures
Slide 19
Ordered Requests Processing
• First-come, First-serve (FIFO)
• Priority-based processing
• Inefficient to use List<T>
• List will continue to grow (internally, the
size is doubled every time)
• Solution: circular list/array
• Problem: initial size??
CSE 5350 - Fall 2007
Data Structures
Slide 20
Queue
• System.Collections.Generic.Queue
• Operations:
–
–
–
–
–
Enqueue()
Dequeue()
Contains()
ToArray()
Peek()
• Does not allow random access
• Type-safe; maximizes space utilization
CSE 5350 - Fall 2007
Data Structures
Slide 21
Queue (continued)
• Applications:
– Web servers
– Print queues
• Rate of growth:
– Specified in the constructor
– Default: double initial size
CSE 5350 - Fall 2007
Data Structures
Slide 22
Stack
• LIFO
• System.Collections.Generic.Stack
• Operations:
– Push()
– Pop()
• Doubles in size when more space is
needed
• Applications:
– CLR call stack (functions invocation)
CSE 5350 - Fall 2007
Data Structures
Slide 23
Limitations of Ordinal Indexing
• Ideal access time: O(1)
• If index is unknown
– O(n) if not sorted
– O(log n) if sorted
• Example: SSN: 10 ^ 9 possible
combinations
• Solution: compress the ordinal indexing
domain with a hash function; e.g. use only
4 digits
CSE 5350 - Fall 2007
Data Structures
Slide 24
Hash Table
• Hashing:
– Math transformation of one representation
into another representation
• Hash table:
– The array that uses hashing to compress the
indexers space
• Cryptography (information security)
• Hash function:
– Non-injective (not a one-to-one function)
– “Fingerprint” of initial data
CSE 5350 - Fall 2007
Data Structures
Slide 25
Goals
• Fast access of items in large amounts of
data
• Few collisions as possible
– collision avoidance
• Avalanche effect:
– Minor changes to input  major changes to
output
CSE 5350 - Fall 2007
Data Structures
Slide 26
Collision Resolution (1)
• Probability to map to a given location:
1/k (k = size = number of slots)
• (1) Linear Probing
Is H[i] empty?
• YES: place item at location I
• NO: i = i + 1; repeat
– Deficiency: clustering
– Access and Insertion: no longer O(1)
CSE 5350 - Fall 2007
Data Structures
Slide 27
Collision Resolution (2)
• (2) Quadratic Probing
–
–
–
–
–
–
Check s + 12
Check s – 12
Check s + 22
Check s – 22
…
Check s +/- i2
– Clustering a problem as well
CSE 5350 - Fall 2007
Data Structures
Slide 28
Collision Resolution (3)
• (3) Rehashing – used by Hashtable (C#)
• System.Collections.Hashtable
• Operations:
–
–
–
–
–
Add(key, item)
ContainsKey()
Keys()
ContainsValue()
Values()
• Key, Value: any type  not type safe
CSE 5350 - Fall 2007
Data Structures
Slide 29
Hashtable Data Type – Example
using System;
using System.Collections;
public class HashtableDemo
{
private static Hashtable employees = new Hashtable();
public static void Main()
{
// Add some values to the Hashtable, indexed by a string key
employees.Add("111-22-3333", "Scott");
employees.Add("222-33-4444", "Sam");
employees.Add("333-44-55555", "Jisun");
}
}
// Access a particular key
if (employees.ContainsKey("111-22-3333"))
{
string empName = (string) employees["111-22-3333"];
Console.WriteLine("Employee 111-22-3333's name is: " + empName);
}
else
Console.WriteLine("Employee 111-22-3333 is not in the hash table...");
CSE 5350 - Fall 2007
Data Structures
Slide 30
Hashtable
• Key = any type
• Key is transformed into an index via
GetHashCode() function
• Object class defines GetHashCode()
• H(key) = [GetHash(key) + 1 +
(((GetHash(key) >> 5) + 1) %
(hashsize – 1))] % hashsize
Values = 0 .. hashsize-1
CSE 5350 - Fall 2007
Data Structures
Slide 31
Collision Resolution (3 – cont’d)
• Rehashing = double hashing
• Set of hash functions: H1, H2, …, Hn
• Hk(key) = [GetHash(key) + k *
(1 + (((GetHash(key) >> 5) + 1) %
(hashsize – 1)))] % hashsize
• Hashsize must be PRIME
CSE 5350 - Fall 2007
Data Structures
Slide 32
Hashtable
• Load Factor = MAX ( # items / # slots)
• Optimal: 0.72
• Expanding the hashtable: 2 steps: (costly)
– Double # slots (crt prime  next prime which
is about twice bigger)
– Rehash
• High LoadFactor  Dense Hashtable
– Less space
– More probes on collision (1/(1-LF))
– If LF = 0.72  expected # probes = 3.5  O(1)
CSE 5350 - Fall 2007
Data Structures
Slide 33
Hashtable
• Costly to expand
• Set the size in constructor if size is known
• Asymptotic running times:
– Access: O(1)
– Add, Remove: O(1)
– Search: O(1)
CSE 5350 - Fall 2007
Data Structures
Slide 34
System.Collections.Generic.Dictionary
• Typesafe
• Strongly typed KEYS + VALUES
• Operations:
– Add(key, value)
– ContainsKey(key)
• Collision Resolution: CHAINING
– Uses linked lists from an entry where collision
occurs
CSE 5350 - Fall 2007
Data Structures
Slide 35
Chaining in Dictionary Data Type
CSE 5350 - Fall 2007
Data Structures
Slide 36
Dictionary Example
Dictionary<keyType, valueType> variableName =
new Dictionary<keyType, valueType>();
Dictionary<int, Employee> employeeData = new Dictionary<int, Employee>();
// Add some employees
employeeData.Add(455110189) = new Employee("Scott Mitchell");
employeeData.Add(455110191) = new Employee("Jisun Lee");
...
// See if employee with SSN 123-45-6789 works here
if (employeeData.ContainsKey(123456789))
...
CSE 5350 - Fall 2007
Data Structures
Slide 37
Chaining in the Dictionary type
• Efficiency:
– Add: O(1)
– Remove: O (n/m)
– Search: O(n/m)
Where:
n = hash table size
m = number of buckets/slots
• Implemented s.t. n=m at ALL times
– The total # of chained elements can never
exceed the number of buckets
CSE 5350 - Fall 2007
Data Structures
Slide 38
Trees
• = set of linked nodes where no cycle exists
• (GT) a connected acyclic graph
• Nodes:
– Root
– Leaf
– Internal
• |E| = ?
• Forrest = { trees }
CSE 5350 - Fall 2007
Data Structures
Slide 39
Popular Tree-Type Data Structures
• BST: Binary Search Tree
• Heap
• Self-balancing binary search trees
– AVL
– Red-black
• Radix tree
•…
CSE 5350 - Fall 2007
Data Structures
Slide 40
Binary Trees
• Code example for defining a tree data
object
• Tree Traversal
–
–
–
–
In-order: L Ro R
Pre-order: Ro L R
Post-order: L R Ro
Ө(n)
CSE 5350 - Fall 2007
Data Structures
Slide 41
Binary Tree Data Structure
CSE 5350 - Fall 2007
Data Structures
Slide 42
Tree Operations
• Search: Recursive: O(h)
– h = height of the tree
• Max & Min Search: search right/left
• Successor & Predecessor Search
• Insertion (easy: always add a new leaf) &
Deletion (more complicated as it may
cause the tree structure to change)
• Running time:
– function of the tree topology
CSE 5350 - Fall 2007
Data Structures
Slide 43
Binary Search Tree
• Improves the search time (and lookup
time) over the binary tree in general
• BST property:
– for any node n, every descendant node's value
in the left subtree of n is less than the value of
n, and every descendant node's value in the
right subtree is greater than the value of n
CSE 5350 - Fall 2007
Data Structures
Slide 44
Non-BST vs BST
(a) Non-BST
(b) BST
CSE 5350 - Fall 2007
Data Structures
Slide 45
Linear Search Time in BST
The search time for a BST
depends upon its topology.
CSE 5350 - Fall 2007
Data Structures
Slide 46
BST continued
• Perfectly balanced BST:
– Search: O(log n)
[ height = log n]
• Sub-linear search running time
• Balanced Binary Tree:
– Exhibits a good ration: breadth/width
• Self-balancing trees
CSE 5350 - Fall 2007
Data Structures
Slide 47
The Heap
• Specialized tree-based data structure that
satisfies the heap property: if B is a child node of
A, then key(A) ≥ key(B). [max-heap]
• Operations:
– delete-max or delete-min: removing the root node of a
max- or min-heap, respectively
– increase-key or decrease-key: updating a key within a
max- or min-heap, respectively
– insert: adding a new key to the heap
– merge: joining two heaps to form a valid new heap
containing all the elements of both
CSE 5350 - Fall 2007
Data Structures
Slide 48
Max Heap Example
• Example of max-heap:
CSE 5350 - Fall 2007
Data Structures
Slide 49
Linked Lists
• No resizing necessary
• Search: O(n)
• Insertion
– O(1) if unsorted
– O(n) is sorted
• Access: O(n)
• System.Collections.Generic.LinkedList
– Doubly-linked; type safe (value  Generics)
– Element: LinkedListNode
CSE 5350 - Fall 2007
Data Structures
Slide 50
Skip List
• Link list with self-balancing BST-like
property
• The elements are sorted
• Height = log n
• Problems with insert & delete
• Solution: randomized distribution
• Overall: O(log n)
• Worst case: O(n) – but very, very, slim
changes to reach worst case
CSE 5350 - Fall 2007
Data Structures
Slide 51
Skip List Examples
CSE 5350 - Fall 2007
Data Structures
Slide 52
Graphs
• A collection of interconnected nodes
• A graph or undirected graph G is an
ordered pair G: = (V,E) that is subject to
the following conditions:
– V is a set, whose elements are called vertices or nodes,
– E is a set of pairs (unordered) of distinct vertices, called
edges or lines.
• Edges (1):
– Directed
– Undirected
CSE 5350 - Fall 2007
- Weighted
- Unweighted
Data Structures
Slide 53
Graph (cont’d)
• Sparse: |E| << |Emax| or |E| ≤ n2
• Representation:
– Adjacency List
– Adjacency Matrix
– (Packed Edge List)
• Problems applicable to graphs:
– Minimum spanning tree (Kruskal, Prim)
– Shortest Path (Dijkstra)
CSE 5350 - Fall 2007
Data Structures
Slide 54
Website Navigation as a Graph
CSE 5350 - Fall 2007
Data Structures
Slide 55
Distance Graph Example
CSE 5350 - Fall 2007
Data Structures
Slide 56
Graph Representation
CSE 5350 - Fall 2007
Data Structures
Slide 57
Minimum Spanning Tree
• Spanning Tree of a connected, undirected
graph = some subset of the edges that
connect all the nodes, and does not
introduce a cycle
CSE 5350 - Fall 2007
Data Structures
Slide 58
Kruskal’s Algorithm
CSE 5350 - Fall 2007
Data Structures
Slide 59
Prim’s Algorithm
CSE 5350 - Fall 2007
Data Structures
Slide 60