Introduction to Algorithms - Lecture 4: Hashing

April 24, 2022

This post is a derivative from Introduction to Algorithms of MIT OpenCourseWare, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see PDF.

Review

Set data structures storing things so that you can query items by their key
Sequence data structures: impose an external order on these items and we want to maintain those.

Comparison Model

In this model, assume algorithm can only differentiate items via comparisons
Comparable items: black boxes only supporting comparisons between pairs
Comparisons are <,≤, >,≥,=,=6 , outputs are binary: True or False
Goal: Store a set of n comparable items, support find(k) operation
Running time is lower bounded by # comparisons performed, so count comparisons!

Decision Tree

Any algorithm can be viewed as a decision tree of operations performed
An internal node represents a binary comparison, branching either True or False
For a comparison algorithm, the decision tree is binary (draw example)
A leaf represents algorithm termination, resulting in an algorithm output
A root-to-leaf path represents an execution of the algorithm on some input
Need at least one leaf for each algorithm output, so search requires ≥ n +1 leaves

Comparison Search Lower Bound

What is worst-case running time of a comparison search algorithm?
running time ≥ # comparisons ≥ max length of any root-to-leaf path ≥ height of tree
The longest path is the same as the height of the tree
Minimum height when binary tree is complete (all rows full except last)
- Height ≥ [lg(n + 1)] − 1 = Ω(log n), so running time of any comparison sort is Ω(log n)
Sorted arrays achieve this bound! Yay!
More generally, height of tree with Θ(n) leaves and max branching factor b is Ω(logb n)
To get faster, need an operation that allows super-constant ω(1) branching factor. How??

If I had constant branching factor, the height of the tree would still bounded by a log base the constant of that number of leaves. But I don’t want logarithmic time. I want faster. So how can I do that?

So I need, in some sense, to be able to branch a non-constatnt amount. So how can I branch a non-constant amount?

We had this operation in the random access machine that could randomly go to any place in memory in constant time based on a number. That was a powerful thing because within a single constant time operation, I could go to any space in memory. Can we use that to find quicker?

What we’re going to do is, if I have an item that has key 10, I’m going to keep an array and store that item 10 spaces away from the front of the array.

This is amazing. But why don’t we just do this all the time? Because we don’t know how high the numbers go. And order operations (find_min, find_max, …) are bad because if I’m storing these things non-continuously, I have to scan down the thing to find the next element.

To be able to use the direct accessory, I need to make sure that the u (largest key) is less than 2^w (word size of the machine)

Hashing

We have a problem when we have a large universe of keys. So how do we get around that?

Map the space of keys—from 0 to u-1 down to arrange 0 to m-1. (m = Θ(n))
Hash function: h(k) : {0, . . . , u − 1} → {0, . . . ,m − 1} (also hash map)
We really want a hash function that will evenly distribute keys over this space.

The problem is it has the potential that we might have to store more than one thing at the same index location. So I have two options on how to deal—what I call collisions.

If I choose m to be larger than n, there’s going to be extra space in here. I’ll just stick it somewhere else in the existing array—open addressing
Instead of storing it somewhere else in that hash table, store a pointer to another data structure like a dynamic array, or linked list, or anything—chain

Chaining

Idea! Store collisions in another data structure (a chain)
If keys roughly evenly distributed over indices, chain size is n/m = n/Ω(n) = O(1)!
If chain has O(1) size, all operations take O(1) time! Yay!
If not, many items may map to same location, e.g. h(k) = constant, chain size is Θ(n) :(
Need good hash function! So what’s a good hash function?

Hash Functions

Any fixed hash function is going to experience collisions.

Division (bad): h(k) = (k mod m)

Modulus. This is called the division method.

If all of the keys are uniformly distributed over the larger space, then this isn’t such a bad thing. But that’s imposing some kind of distribution requirements on the type of inputs to have good performance.

Heuristic, good when keys are uniformly distributed!
m should avoid symmetries of the stored keys
Large primes far from powers of 2 and 10 can be reasonable
Python uses a version of this with some additional mixing
- Because they want a deterministic hash function. They want something that I do the program again, it’s going to do the same thing underneath.
If u » n, every hash function will have some input set that will a create O(n) size chain
Idea! Don’t use a fixed hash function! Choose one randomly (but carefully)!

If I make this data structure and I put some inputs on it, I want a running time that is Independent on what keys I decided to store -> Universal hash function.

Universal (good, theoretically)

hab(k) = (((ak + b) mod p) mod m) - random a, b and prime number

So when I start up my program, I’m going to instantiate this thing with some random a and b, not deterministically. The user, when they’re using this thing, doesn’t know which a and b I picked, so it’s really hard for them to give me a bad example.

If we use this universal hash family, the length of our chains is expected to be contant length.

Expected size of chain at index h(ki)
- Since m = Ω(n), load factor α = n/m = O(1), so O(1) in expectation!

Dynamic

If n/m far from 1, rebuild with new randomly chosen hash function for new size m
Same analysis as dynamic arrays, cost can be amortized over many dynamic operations
So a hash table can implement dynamic set operations in expected amortized O(1) time! :)

Share on

Twitter Facebook LinkedIn

Juhyeong Kim

Introduction to Algorithms - Lecture 4: Hashing

Review

Comparison Model

Decision Tree

Comparison Search Lower Bound

Direct Access Array

Hashing

Chaining

Hash Functions

Division (bad): h(k) = (k mod m)

Universal (good, theoretically)

Dynamic

Share on

Leave a comment

You may also enjoy

Introduction to Algorithms - Lecture 6: Binary Trees, Part 1

Introduction to Algorithms - Lecture 5: Linear Sorting

Introduction to Algorithms - Lecture 3: Sorting

Introduction to Algorithms - Lecture 2: Data Structures and Dynamic Arrays