Introduction to Algorithms - Lecture 4: Hashing
This post is a derivative from Introduction to Algorithms of MIT OpenCourseWare, used under CC BY-NC-SA 4.0.
Review
- Set data structures storing things so that you can query items by their key
- Sequence data structures: impose an external order on these items and we want to maintain those.
Comparison Model
- In this model, assume algorithm can only differentiate items via comparisons
- Comparable items: black boxes only supporting comparisons between pairs
- Comparisons are <,≤, >,≥,=,=6 , outputs are binary: True or False
- Goal: Store a set of n comparable items, support find(k) operation
- Running time is lower bounded by # comparisons performed, so count comparisons!
Decision Tree
- Any algorithm can be viewed as a decision tree of operations performed
- An internal node represents a binary comparison, branching either True or False
- For a comparison algorithm, the decision tree is binary (draw example)
- A leaf represents algorithm termination, resulting in an algorithm output
- A root-to-leaf path represents an execution of the algorithm on some input
- Need at least one leaf for each algorithm output, so search requires ≥ n +1 leaves
Comparison Search Lower Bound
- What is worst-case running time of a comparison search algorithm?
- running time ≥ # comparisons ≥ max length of any root-to-leaf path ≥ height of tree
- The longest path is the same as the height of the tree
- Minimum height when binary tree is complete (all rows full except last)
- Height ≥ [lg(n + 1)] − 1 = Ω(log n), so running time of any comparison sort is Ω(log n)
- Sorted arrays achieve this bound! Yay!
- More generally, height of tree with Θ(n) leaves and max branching factor b is Ω(logb n)
- To get faster, need an operation that allows super-constant ω(1) branching factor. How??
Direct Access Array
If I had constant branching factor, the height of the tree would still bounded by a log base the constant of that number of leaves. But I don’t want logarithmic time. I want faster. So how can I do that?
So I need, in some sense, to be able to branch a non-constatnt amount. So how can I branch a non-constant amount?
We had this operation in the random access machine that could randomly go to any place in memory in constant time based on a number. That was a powerful thing because within a single constant time operation, I could go to any space in memory. Can we use that to find quicker?
What we’re going to do is, if I have an item that has key 10, I’m going to keep an array and store that item 10 spaces away from the front of the array.
This is amazing. But why don’t we just do this all the time? Because we don’t know how high the numbers go. And order operations (find_min, find_max, …) are bad because if I’m storing these things non-continuously, I have to scan down the thing to find the next element.
To be able to use the direct accessory, I need to make sure that the u (largest key) is less than 2^w (word size of the machine)
Hashing
We have a problem when we have a large universe of keys. So how do we get around that?
- Map the space of keys—from 0 to u-1 down to arrange 0 to m-1. (m = Θ(n))
- Hash function: h(k) : {0, . . . , u − 1} → {0, . . . ,m − 1} (also hash map)
- We really want a hash function that will evenly distribute keys over this space.
The problem is it has the potential that we might have to store more than one thing at the same index location. So I have two options on how to deal—what I call collisions.
- If I choose m to be larger than n, there’s going to be extra space in here. I’ll just stick it somewhere else in the existing array—open addressing
- Instead of storing it somewhere else in that hash table, store a pointer to another data structure like a dynamic array, or linked list, or anything—chain
Chaining
- Idea! Store collisions in another data structure (a chain)
- If keys roughly evenly distributed over indices, chain size is n/m = n/Ω(n) = O(1)!
- If chain has O(1) size, all operations take O(1) time! Yay!
- If not, many items may map to same location, e.g. h(k) = constant, chain size is Θ(n) :(
- Need good hash function! So what’s a good hash function?
Hash Functions
Any fixed hash function is going to experience collisions.
Division (bad): h(k) = (k mod m)
Modulus. This is called the division method.
If all of the keys are uniformly distributed over the larger space, then this isn’t such a bad thing. But that’s imposing some kind of distribution requirements on the type of inputs to have good performance.
- Heuristic, good when keys are uniformly distributed!
- m should avoid symmetries of the stored keys
- Large primes far from powers of 2 and 10 can be reasonable
- Python uses a version of this with some additional mixing
- Because they want a deterministic hash function. They want something that I do the program again, it’s going to do the same thing underneath.
- If u » n, every hash function will have some input set that will a create O(n) size chain
- Idea! Don’t use a fixed hash function! Choose one randomly (but carefully)!
If I make this data structure and I put some inputs on it, I want a running time that is Independent on what keys I decided to store -> Universal hash function.
Universal (good, theoretically)
hab(k) = (((ak + b) mod p) mod m) - random a, b and prime number
So when I start up my program, I’m going to instantiate this thing with some random a and b, not deterministically. The user, when they’re using this thing, doesn’t know which a and b I picked, so it’s really hard for them to give me a bad example.
If we use this universal hash family, the length of our chains is expected to be contant length.
- Expected size of chain at index h(ki)
- Since m = Ω(n), load factor α = n/m = O(1), so O(1) in expectation!
Dynamic
- If n/m far from 1, rebuild with new randomly chosen hash function for new size m
- Same analysis as dynamic arrays, cost can be amortized over many dynamic operations
- So a hash table can implement dynamic set operations in expected amortized O(1) time! :)
Leave a comment