Sometimes, we need to use a data structure to associate an identifier with another value (e.g., a restaurant menu). Technically, an array of subarrays would help us define such an association.
1const menu = [
2 ["hamburger", 3.5],
3 ["soda", 0.75],
4 ["hash browns", 1.8],
5 // ...
6];
However, it’s quite troublesome as we would have to traverse the whole array to find an item as worst-case (i.e., $O(n)$). There’s another strategy to look for a value in $O(1)$ time though! Dictionaries, associative arrays, or also maps. A very well-known data structure that helps us to implement a dictionary is the Hash Table.
Dictionaries, Associative Arrays, or Maps
We usually denote $A[i]$ to the association that maps $i \mapsto v$ where $i$ is in the range of the finitary domain $[0, \text{len}(A))$. Nonetheless, there are cases where we are interested in avoiding counting the index as an integer[1]. For instance, our previous example could be understood as a map with something like this:
1const menu = {
2 hamburger: 3.5,
3 soda: 0.75,
4 "hash browns": 1.8,
5 // ...
6};
These generalized arrays are known as Dictionaries, or Associative Arrays, Maps, and Hash Tables are a good way to implement them, because they generalize the simpler notion of an ordinary array[1].
“An ordinary array data type is a structure that contains an ordered collection of data elements in which each element can be referenced by its ordinal position in the collection"[2].
“Directly addressing into an ordinary array makes effective use of our ability to examine an arbitrary position in an array in $O(1)$ time. We can take advantage of direct addressing when we can afford to allocate an array that has one position for every possible key"[3].
Keys and their Entries or Values
The purpose of dictionaries is to store complex data and access it using a key derived from the data.
We assume that the keys in a dictionary are unique, meaning that each key maps to at most one data item.
The data in the dictionary is referred to as either an entry ($e$) when the key is part of the data (e.g., the person’s id number), or value when it’s not.
We aim to add as many keys as necessary to the dictionary, using more than just simple sequential numbers (i.e., strings or non-sequential numbers), always maintaining the property that each key maps to only one data item[1].
Chain, or Direct-Address Table
Let’s suppose an application that needs a dynamic set of keys. We have a universe of keys $U = \lbrace 0, 1, …, m-1 \rbrace$ with $m$ not being too large, where the possible keys for the dynamic set will be drawn from.
By default all the slots in the Direct-Address Table Linked List (it could be an array as well) are null. We denote this Linked List as $T[0 .. m-1]$. This Linked List will hold in each slot the key from the universe $U$. However, we don’t expect to hold all the keys in the universe $U$, but a subset of actual used keys $K$, such as $K \subset U$. Of course, some applications might decide not to store the key but the actual data in the slot.
If the Linked List is unordered, we might take $O(n)$ steps to find the $n^{\text{th}}$ element as we need to perform a linear search. On the other hand, if the Linked List is ordered we might improve the search with a binary search with an average $O(\log(n))$ time[3].
Hashing
What happens if $U$ us extremely large? Storing a Linked List $T$ of size $|U|$ is impractical, and impossible given certain physical constraints. On the other hand the set $K$ of keys actually used may be too small in comparison to $U$. We might end wasting allocated space. Storing the set $K$ of keys requires $\Theta(|K|)$ space while we maintain the benefit of searching for an element in only $O(1)$ time[3].
In the Linked List we store the element in the slot $k$, whereas in the hash table it’s stored in the slot $h(k)$. That means we use a hash function $h$ to determine the slot from the element $e_k$. Thus, we say $h$ maps the universe of keys $U$ into the slots of a hash table $T[0..m-1]$:
$$ h:U \mapsto \lbrace 0, 1, ..., m-1 \rbrace $$where $m$ (i.e., the size of the hash table $T$) is considerably much less than $|U|$. Then, we say an element $e_k$ hashes to slot $h(k)$. One of the effects of taking the keys from a Linked List or Array into a Hash Table is that the range of indices is reduced and hence its size. We take an array of possible $|U|$ size to one with size $m$[3].
There is still one catch with this technique though. We might have a situation where two or more keys maps to the same slot. We call this situation a collision.
Ideally, we would like to guarantee a hash function that avoids collisions at all. What would an ideal hash function look like? It must guarantee three key properties:
Preimage Resistance or One-Wayness
Simple as it is: hash functions need to be one-way.
“Given a hash output $z$ it must be computationally infeasible to find an input message $k$ such that $z = h(k)$“[4].
Second Preimage Resistance or Weak Collision Resistance
Two different messages do not hash to the same value.
“This means it should be computationally infeasible to create two different messages $k_1 \neq k_2$ with equal hash values $z_1 = h(k_1) = h(k_2) = z_2$“[4].
Can we prevent this? No. Unfortunatelly, this is impossible due to the Dirichlet’s drawer principle, or the pigeonhole principle (i.e., suppose you owe 100 pigeons, but your pigeon loop only has 99 holes, which make at least two pigeons to share one hole). As weak collisions theoretically exist the best we can do is reducing their appearance in practice.
Collision Resistance or Strong Collision Resistance
We call a hash function strong collision resistance if…
“it is computationally infeasible to find two different inputs $k_1 \neq k_2$ with $h(k_1) = h(k_2)$“[4].
This one is harder than the second preimage resistance to achieve. One way to achieve it is by increasing the length of the output.
Separate Chaining
We can effectively still resolve the issue created by collisions by using another level of chaining. Let’s assume a situation that violates the Strong Collision Resistance, $h(k_1) = i = h(k_2)$ with $k_1 \neq k_2$.
We know that such keys will be stored in the same slot, but we can think of that slot (and all the slots in the first linked list) as linked lists by itself. So, we would store those entries in a second resolving Linked List in $O(1)$ time.
Nevertheless, how well would it perform to search for an element? Let’s dive deeper.
- We find the $i^{\text{th}}$ position as $i = h(k)$. This is $O(1)$ if we assume constant time to computhe the hash.
- We find the $T[i]$ position, which again is $O(1)$ time.
- We find the element in the slot’s Linked List.
How long would it take to find the element in such slot’s Linked List? We might end taking $O(n)$ in the worst-case if it has too many elements. Such scenario would be true if we map all the keys to the same slot.
Let’s define the load factor $\alpha$ for $T$ as the average number of elements stored in a chain[3].
$$ \alpha_T = \frac{n}{m} $$Worst-Case
If $n > m$, we are not performing better than a Linked List or an Array. The worst-case would be $\Theta(n)$ plus the time to compute the hash.
Average-Case
This will depend on how well $h$ distributes all the keys among the $m$ slots in $T$. If we assume a well distributed hash function we call it simple uniform hashing[3].
We might distinguish between the case when the element is in the hash table and when is not. However, this could be neglacted in favor of treat them as equals. Under this assumption, we can safely state that the successful and unsuccessful search is the same and it’s $\Theta(1 + \alpha)$ (Cormen[3] p. 264, demonstrates this). By assuming a proportional relation between the number of slots and elements we can safely say that $n = O(m)$, consequently having $\alpha = \frac{n}{m} = \frac{O(m)}{m} = O(1)$.
We can conclude then:
- Searching takes $O(1)$ on average-case.
- Inserting takes $O(1)$ on worst-case.
- Deletion takes $O(1)$ on worst-case on doubly linked lists.
Thus, all the operations can be taken down to $O(1)$ for dictionaries on average-case.
Hash Functions
A good hash function ensures the asssumption of uniform hashing: one function where a key is equally hashed to any of the $m$ slots in $T$. The case is that we rarely know how to check such a condition as we might not know the probaility distribution from where the keys are fetch from. Sometimes, we know it. For instance, consider the case where “the keys are random real numbers $k$ independently and uniformly distributed in the range $0 \leq k < 1$, then the hash function
$$ h(k) = \lfloor km \rfloor $$satisfies the condition of simple uniform hashing"[3].
Universal Hashing
Suppose an attack where the adversary has the possibility to choose the keys to be hashed by some fixed hash function, then they can choose $n$ keys that all hash to the same slot, yielding an average retrieval time of $\Theta(n)$. Any fixed hash function is prone to such worst-case behavior. The only effective way to improve the situation is to choose the hash function randomly in a way it is independent of the keys that are actually going to be stored. This is what Universal Hashing is, and can prove good average performance no matter which keys the adversary chooses[3].
What does it mean for $h$ to be random?
“Let $\mathcal{H}$ be a finite collection of hash functions that map a given universe $U$ of keys such as $U \mapsto \lbrace 0, 1, …, m-1 \rbrace$. Such a collection is said to be universal if for each pair of distinct keys $k, l \in U$, the number of hash functions $h \in \mathcal{H}$ for which $h(k) = h(l)$ is at most $\frac{|\mathcal{H}|}{m}$[3].
This means that, with a randomly chosen hash function $h$ from $\mathcal{H}$, the chance of getting a collision between distinct keys $k$ and $l$ is not greater than the chance $\frac{1}{m}$ of a collision if $h(k)$ and $h(l)$ were randomly and independently chosen from the set $\lbrace 0, 1, …, m-1 \rbrace$[3].
Designing the Universal Family of Hash Functions
A little of Number Theory will help us to come up with the construction of a Universal Class for Hash Functions.
Let’s imagine a number $p$ so large that every key $k$ is in the range $[0, p-1]$. Let’s denote $\mathbb{Z}^*_p = [1, p-1]$ and $\mathbb{Z}_p = [0, p-1]$.
We define the hash function $h_{ab}$ as
$$ h_{ab}(k) = ((ak+b) \mod p) \mod m $$where $a \in \mathbb{Z}^{*}_{p}$ and $b \in \mathbb{Z}_p$.
Thus, we can define the family of all the hash functions as
$$ \mathcal{H}_{pm} = \lbrace h_{ab} : a \in \mathbb{Z}^*_p, b \in \mathbb{Z}_p \rbrace $$where $h_{ab}$ is a hash function that maps $\mathbb{Z}_p \mapsto \mathbb{Z}_m$.
From this definition we can deduct that we have precisely $p-1$ choices for $a$ and $p$ for $b$ making $|\mathcal{H}_{pm}| = p(p-1)$ (i.e., the size of hash functions in that family of hash functions).
Let’s prove it by defining
$$ f_{ab}(x) = (ax + b) \mod p $$and applying the function to two values $k$ and $l$ where $k \neq l$
$$ r = f_{ab}(k) = (ak + b) \mod p \\ s = f_{ab}(l) = (al + b) \mod p \\ $$Let’s solve for both equations
$$ (ak + b) \equiv r \mod p \\ (al + b) \equiv s \mod p $$By substracting both equations, we get
$$ a(k-l) \equiv (r-s) \mod p $$Now, as $p$ is prime and $k \neq l$, this equation has only one solution for $a \in \mathbb{Z}_p$. Then,
$$ b \equiv (r - ak) \mod p $$so we have found the unique $a$ and $b$ such that $f_{ab}(k) = r$ and $f_{ab}(l) = s$. However, notice how if $a = 0$, then $r = s = b$, and that’s precisely why $a$ cannot be $0$[3].
$$ \begin{aligned} \left \lceil \frac{p}{m} \right \rceil - 1 &\leq \frac{p+m-1}{m}-1 \\ &= \frac{p-1}{m} \end{aligned} $$Therefore, the probability that distinct keys $k$ and $l$ collide is equal to the probability that $r \equiv s (\mod p)$ when $r$ and $s$ are randomly chosen as distinct values modulo $p$. For a given value of $r$, of the $p-1$ possible remaining values for $s$, the number of values $s$ such as $s \neq r$ and $s \equiv r (\mod p)$ is at most
$$ \frac{(p-1)/m}{(p-1)} = \frac{1}{m} $$The probability that $s$ collides with $r$ when reduced modulo $m$ is at most
$$ Pr \lbrace h_{ab}(k) = h_{ab}(l) \rbrace \leq \frac{1}{m} $$Therefore, for any pair of distinct values $k, l \in \mathbb{Z}_p$,
$$ \mathcal{H}_{pm} $$So that
is indeed universal[3].
The Implementation
The Hash Table Class
Our hash table implementation will have a default value of 32
null slots (i.e., $m_T = 32$) each one pointing to a singly linked list through an Array
for the first chaining layer. But it can be modified via its constructor to specify more or less. An interesting test would be to create a hash table with $m_T=3$ to get a high load factor and get to understand better how to deal with collisions.
The keys
dictionary will help us to get track of the keys in a faster way.
1import LinkedList from "../singly-linked-lists/LinkedList.js";
2
3const defaultHashTableSize = 32;
4
5export default class HashTable {
6 constructor(hashTableSize = defaultHashTableSize) {
7 this.buckets = Array(hashTableSize)
8 .fill(null)
9 .map(() => new LinkedList());
10
11 this.keys = {};
12 }
13}
This implementation will get a hash table with default size equal to 32 linked list slots. However you can always specify how many slots you want.
1import HashTable from "./src/data-structures/hash-table/HashTable.js";
2
3const defaultHashTable = new HashTable();
4const biggerHashTable = new HashTable(64);
5const smallerHashTable = new HashTable(3);
6
7console.log("defaultHashTable size:", defaultHashTable.buckets.length);
8console.log("biggerHashTable size:", biggerHashTable.buckets.length);
9console.log("smallerHashTable size:", smallerHashTable.buckets.length);
defaultHashTable size: 32
biggerHashTable size: 64
smallerHashTable size: 3
hashTable.hash
The following is a simple way to hash the values for a given key: we will sum up all the characters’ codes of the key to calculate its hash via modulo $m_T$. Of course, you can always come up with better ways to achieve a hash to reduce the number of collisions.
1hash(key) {
2 const hash = Array.from(key).reduce(
3 (hashAcc, keySymbol) => hashAcc + keySymbol.charCodeAt(0),
4 0,
5 );
6
7 return hash % this.buckets.length;
8}
Here’s an example of how we can use this simple hash function:
1import HashTable from "./src/data-structures/hash-table/HashTable.js";
2
3const hashTable = new HashTable();
4
5console.log("hash for key 'a':", hashTable.hash("a"));
6console.log("hash for key 'b':", hashTable.hash("b"));
7console.log("hash for key 'abc':", hashTable.hash("abc"));
hash for key 'a': 1
hash for key 'b': 2
hash for key 'abc': 6
hashTable.has
We determine if the hash table actually has a key via the helper dictionary we defined in the HashTable
class’ constructor.
1has(key) {
2 return Object.hasOwnProperty.call(this.keys, key);
3}
This is a simple one, let’s check how it works.
1import HashTable from "./src/data-structures/hash-table/HashTable.js";
2
3const hashTable = new HashTable(3);
4
5hashTable.set("a", "sky");
6console.log("Has 'a'?:", hashTable.has("a"));
7console.log("Has 'x'?:", hashTable.has("x"));
Has 'a'?: true
Has 'x'?: false
hashTable.get
First, we locate the right bucket. As we get the key
, we can calculate its hash via the hash
function. Then, we retrieve the bucket using that hash. Once we get the bucket, we get the right node doing a normal linked list search through a custom callback search function where we return the node that matches with the key
. If it does exist, we return its value, otherwise, we return undefined
as it has not been set before.
1get(key) {
2 const bucketLinkedList = this.buckets[this.hash(key)];
3 const node = bucketLinkedList.find({
4 callback: (nodeValue) => nodeValue.key === key,
5 });
6
7 return node ? node.value.value : undefined;
8}
Let’s set a value for a key and retrieve it back. Also, let’s retrieve an undefined key.
1import HashTable from "./src/data-structures/hash-table/HashTable.js";
2
3const hashTable = new HashTable(3);
4
5hashTable.set("a", "sky");
6
7const a = hashTable.get("a");
8const x = hashTable.get("x");
9
10console.log("'a' value:", a);
11console.log("'x' value:", x);
'a' value: sky
'x' value: undefined
hashTable.delete
To delete a key-value pair, we calculate the hash of the passed key
. First, we delete the reference from our auxiliary dictionary. Then, we proceed with the actual deletion of the key-value pair from our bucketed array of linked lists. To do that, we locate the right bucket via the key
hash we just calcualted. Then, we find the node with the actual value using the custom callback search function in the linked list. If it actually exists, we delete it from the linked list using the node value. Otherwise, we just return a null
indicating there was nothing to delete.
1delete(key) {
2 const keyHash = this.hash(key);
3 delete this.keys[key];
4 const bucketLinkedList = this.buckets[keyHash];
5 const node = bucketLinkedList.find({
6 callback: (nodeValue) => nodeValue.key === key,
7 });
8
9 if (node) {
10 return bucketLinkedList.delete(node.value);
11 }
12
13 return null;
14}
Let’s set a value for a key, then let’s delete it and try to get it after the deletion.
1import HashTable from "./src/data-structures/hash-table/HashTable.js";
2
3const hashTable = new HashTable(3);
4
5hashTable.set("a", "sky");
6
7const a = hashTable.get("a");
8console.log("'a' value:", a);
9hashTable.delete("a");
10const aa = hashTable.get("a");
11console.log("'aa' value:", aa);
'a' value: sky
'aa' value: undefined
hashTable.getKeys
This method allows us to get all the keys stored in the hash table. This is no other than quickly getting the keys right from the auxiliary keys
dictionary.
1getKeys() {
2 return Object.keys(this.keys);
3}
Let’s set some key-values and let’s retrieve only the keys.
1import HashTable from "./src/data-structures/hash-table/HashTable.js";
2
3const hashTable = new HashTable(3);
4
5hashTable.set("a", "sky-old");
6hashTable.set("a", "sky");
7hashTable.set("b", "sea");
8hashTable.set("c", "earth");
9hashTable.set("d", "ocean");
10
11console.log("All the keys", hashTable.getKeys());
All the keys [ 'a', 'b', 'c', 'd' ]
hashTable.getValues
This method allows us to get all the values stored in the hash table. It goes through all the bucketed linked lists, and concatenates the values of each linked list producing a unique array of elements no matter what keys they belong to.
1getValues() {
2 return this.buckets.reduce((values, bucket) => {
3 const bucketValues = bucket
4 .toArray()
5 .map((linkedListNode) => linkedListNode.value.value);
6 return values.concat(bucketValues);
7 }, []);
8}
Finally, let’s get all the values for some set key-value pairs.
const hashTable = new HashTable(3);
hashTable.set("a", "sky-old");
hashTable.set("a", "sky");
hashTable.set("b", "sea");
hashTable.set("c", "earth");
hashTable.set("d", "ocean");
console.log("All the keys", hashTable.getValues());
All the keys [ 'earth', 'sky', 'ocean', 'sea' ]
Notice how sky
and ocean
are actually together as this hash table size is 3, so we have a collision for both a
and d
, so we store those values in the same linked list bucket.
Putting All Together
You can find the whole implementation with its tests at https://github.com/mesirendon/datastructures-and-algorithms-js/tree/main/src/data-structures/hash-table
Bibliography
- CS 15-122: Principles of Imperative Computation (Summer 2024) - Lecture 12. Hash Tables, 2024. [Online]. Available: https://www.cs.cmu.edu/~15122/handouts/lectures/12-hashing.pdf
- Ordinary array data type, 2024. [Online]. Available: https://www.ibm.com/docs/en/db2/11.5?topic=types-ordinary-array-data-type
- T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, third edition. Cambridge, MA, USA: MIT Press, 2009. [Online]. Available: https://books.google.com.co/books/about/Introduction_to_Algorithms_third_edition.html?id=i-bUBQAAQBAJ&redir_esc=y
- C. Paar and J. Pelzl, Understanding Cryptography. Berlin, Germany: Springer, 2009. [Online]. Available: https://books.google.com.co/books?id=f24wFELSzkoC&newbks=1&newbks_redir=0&printsec=frontcover&dq=understanding+cryptography+christof+paar&hl=en&redir_esc=y#v=onepage&q=understanding%20cryptography%20christof%20paar&f=false