Hash Tables

Hash tables are data structures that store key-value pairs and provide efficient lookup, insertion, and deletion operations. This module covers the fundamentals of hash tables, their implementations, collision resolution strategies, and common applications.

Introduction to Hash Tables

A hash table (also known as a hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

Key Characteristics

  • Fast operations: Average case O(1) for search, insert, and delete operations
  • Key-value storage: Data is stored as key-value pairs
  • Unordered: Elements are not stored in a specific order
  • Dynamic size: Can grow or shrink as needed
  • Collision handling: Must handle cases where different keys hash to the same index

Time and Space Complexity

OperationAverage CaseWorst Case
SearchO(1)O(n)
InsertO(1)O(n)
DeleteO(1)O(n)
SpaceO(n)O(n)

Note: The worst-case time complexity occurs when there are many collisions, causing the hash table to degenerate into a linked list.

Hash Functions

A hash function is a function that maps data of arbitrary size to fixed-size values. In the context of hash tables, it converts keys into array indices.

Properties of a Good Hash Function

  • Deterministic: The same key should always produce the same hash value
  • Efficient: It should be fast to compute
  • Uniform distribution: It should distribute keys uniformly across the hash table to minimize collisions
  • Low collision rate: Different keys should hash to different indices as much as possible

Common Hash Functions

1. Division Method

The simplest hash function is the division method, which takes the remainder of the key divided by the table size.

// Division method hash function
int hash(int key, int tableSize) {
    return key % tableSize;
}

2. Multiplication Method

The multiplication method multiplies the key by a constant, takes the fractional part, and scales it to the table size.

// Multiplication method hash function
int hash(int key, int tableSize) {
    double A = 0.6180339887; // (sqrt(5) - 1) / 2
    double val = key * A;
    val = val - Math.floor(val); // Extract fractional part
    return (int)(tableSize * val);
}

3. Universal Hashing

Universal hashing uses a randomly selected hash function from a family of hash functions to minimize the probability of collisions.

4. String Hashing

For string keys, a common approach is to use polynomial hashing.

// String hash function
int hash(String key, int tableSize) {
    int hash = 0;
    for (int i = 0; i < key.length(); i++) {
        hash = (hash * 31 + key.charAt(i)) % tableSize;
    }
    return hash;
}

Collision Resolution

A collision occurs when two different keys hash to the same index. There are several strategies to handle collisions in hash tables.

1. Separate Chaining

In separate chaining, each bucket in the hash table contains a linked list (or another data structure) of all key-value pairs that hash to the same index.

Advantages:

  • Simple to implement
  • Hash table never fills up
  • Less sensitive to the hash function or load factors

Disadvantages:

  • Requires additional memory for linked list pointers
  • Cache performance can be poor due to linked list traversal
  • If many keys hash to the same index, performance degrades to O(n)
// Separate chaining implementation
class HashTable<K, V> {
    private static class Node<K, V> {
        K key;
        V value;
        Node<K, V> next;
        
        Node(K key, V value) {
            this.key = key;
            this.value = value;
        }
    }
    
    private Node<K, V>[] buckets;
    private int size;
    private int capacity;
    
    @SuppressWarnings("unchecked")
    public HashTable(int capacity) {
        this.capacity = capacity;
        this.buckets = (Node<K, V>[]) new Node[capacity];
        this.size = 0;
    }
    
    private int hash(K key) {
        return Math.abs(key.hashCode()) % capacity;
    }
    
    public void put(K key, V value) {
        int index = hash(key);
        Node<K, V> head = buckets[index];
        
        // Check if key already exists
        Node<K, V> current = head;
        while (current != null) {
            if (current.key.equals(key)) {
                current.value = value; // Update value
                return;
            }
            current = current.next;
        }
        
        // Key doesn't exist, add new node at the beginning of the list
        Node<K, V> newNode = new Node<>(key, value);
        newNode.next = head;
        buckets[index] = newNode;
        size++;
    }
    
    public V get(K key) {
        int index = hash(key);
        Node<K, V> current = buckets[index];
        
        while (current != null) {
            if (current.key.equals(key)) {
                return current.value;
            }
            current = current.next;
        }
        
        return null; // Key not found
    }
    
    public void remove(K key) {
        int index = hash(key);
        Node<K, V> head = buckets[index];
        
        // If bucket is empty
        if (head == null) {
            return;
        }
        
        // If key is in the first node
        if (head.key.equals(key)) {
            buckets[index] = head.next;
            size--;
            return;
        }
        
        // Search for the key in the rest of the list
        Node<K, V> prev = head;
        Node<K, V> current = head.next;
        
        while (current != null) {
            if (current.key.equals(key)) {
                prev.next = current.next;
                size--;
                return;
            }
            prev = current;
            current = current.next;
        }
    }
}

2. Open Addressing

In open addressing, all key-value pairs are stored in the hash table itself. When a collision occurs, the algorithm searches for the next available slot according to a probing sequence.

a. Linear Probing

In linear probing, if a collision occurs at index i, the algorithm checks index i+1, then i+2, and so on, until an empty slot is found.

// Linear probing hash function
int linearProbe(int key, int i, int tableSize) {
    return (hash(key) + i) % tableSize;
}

b. Quadratic Probing

In quadratic probing, if a collision occurs at index i, the algorithm checks index i+1², then i+2², and so on.

// Quadratic probing hash function
int quadraticProbe(int key, int i, int tableSize) {
    return (hash(key) + i*i) % tableSize;
}

c. Double Hashing

Double hashing uses a second hash function to determine the step size for probing.

// Double hashing
int doubleHash(int key, int i, int tableSize) {
    int hash1 = hash1(key);
    int hash2 = hash2(key);
    return (hash1 + i * hash2) % tableSize;
}

Advantages:

  • Better cache performance
  • No extra memory for pointers

Disadvantages:

  • More sensitive to the hash function and load factor
  • Can suffer from clustering (especially with linear probing)
  • Requires a good probing strategy
  • Table can fill up, requiring resizing

Implementation

Hash Table with Open Addressing in Java

// Hash Table implementation with linear probing
class HashTable<K, V> {
    private static class Entry<K, V> {
        K key;
        V value;
        boolean isDeleted;
        
        Entry(K key, V value) {
            this.key = key;
            this.value = value;
            this.isDeleted = false;
        }
    }
    
    private Entry<K, V>[] table;
    private int size;
    private int capacity;
    private final double loadFactor = 0.75;
    
    @SuppressWarnings("unchecked")
    public HashTable(int capacity) {
        this.capacity = capacity;
        this.table = (Entry<K, V>[]) new Entry[capacity];
        this.size = 0;
    }
    
    private int hash(K key) {
        return Math.abs(key.hashCode()) % capacity;
    }
    
    public void put(K key, V value) {
        if (size >= capacity * loadFactor) {
            resize();
        }
        
        int index = hash(key);
        int i = 0;
        
        while (i < capacity) {
            int j = (index + i) % capacity; // Linear probing
            
            if (table[j] == null || table[j].isDeleted) {
                table[j] = new Entry<>(key, value);
                size++;
                return;
            } else if (table[j].key.equals(key)) {
                table[j].value = value; // Update value
                return;
            }
            
            i++;
        }
    }
    
    public V get(K key) {
        int index = hash(key);
        int i = 0;
        
        while (i < capacity) {
            int j = (index + i) % capacity;
            
            if (table[j] == null) {
                return null; // Key not found
            } else if (!table[j].isDeleted && table[j].key.equals(key)) {
                return table[j].value;
            }
            
            i++;
        }
        
        return null; // Key not found
    }
    
    public void remove(K key) {
        int index = hash(key);
        int i = 0;
        
        while (i < capacity) {
            int j = (index + i) % capacity;
            
            if (table[j] == null) {
                return; // Key not found
            } else if (!table[j].isDeleted && table[j].key.equals(key)) {
                table[j].isDeleted = true;
                size--;
                return;
            }
            
            i++;
        }
    }
    
    @SuppressWarnings("unchecked")
    private void resize() {
        int oldCapacity = capacity;
        capacity = capacity * 2;
        Entry<K, V>[] oldTable = table;
        table = (Entry<K, V>[]) new Entry[capacity];
        size = 0;
        
        for (int i = 0; i < oldCapacity; i++) {
            if (oldTable[i] != null && !oldTable[i].isDeleted) {
                put(oldTable[i].key, oldTable[i].value);
            }
        }
    }
}

Hash Table in Python

# Hash Table implementation with separate chaining
class HashTable:
    class Node:
        def __init__(self, key, value):
            self.key = key
            self.value = value
            self.next = None
    
    def __init__(self, capacity=10):
        self.capacity = capacity
        self.size = 0
        self.buckets = [None] * capacity
    
    def _hash(self, key):
        return hash(key) % self.capacity
    
    def put(self, key, value):
        index = self._hash(key)
        
        # If bucket is empty
        if self.buckets[index] is None:
            self.buckets[index] = self.Node(key, value)
            self.size += 1
            return
        
        # If key already exists, update value
        current = self.buckets[index]
        while current:
            if current.key == key:
                current.value = value
                return
            if current.next is None:
                break
            current = current.next
        
        # Key doesn't exist, add to the end of the list
        current.next = self.Node(key, value)
        self.size += 1
    
    def get(self, key):
        index = self._hash(key)
        current = self.buckets[index]
        
        while current:
            if current.key == key:
                return current.value
            current = current.next
        
        return None  # Key not found
    
    def remove(self, key):
        index = self._hash(key)
        current = self.buckets[index]
        prev = None
        
        while current:
            if current.key == key:
                # If it's the first node in the bucket
                if prev is None:
                    self.buckets[index] = current.next
                else:
                    prev.next = current.next
                self.size -= 1
                return
            prev = current
            current = current.next

Applications

Hash tables are widely used in various applications due to their efficient lookup, insertion, and deletion operations.

1. Database Indexing

Hash tables are used to create indexes in databases for fast data retrieval. They allow for O(1) lookup time for exact match queries.

2. Caching

Hash tables are used in caching systems to store and retrieve data quickly. Examples include browser caches, DNS caches, and web server caches.

3. Symbol Tables in Compilers

Compilers use hash tables to store information about variables, functions, and other identifiers during the compilation process.

4. Associative Arrays

Hash tables are used to implement associative arrays or dictionaries in programming languages, such as Python's dict, Java's HashMap, and JavaScript's Object.

5. Counting Frequencies

Hash tables are used to count the frequency of elements in a collection, such as counting the occurrences of words in a document.

// Count word frequencies in a document
Map<String, Integer> wordFrequency = new HashMap<>();

for (String word : document.split("\s+")) {
    word = word.toLowerCase().replaceAll("[^a-zA-Z]", "");
    if (!word.isEmpty()) {
        wordFrequency.put(word, wordFrequency.getOrDefault(word, 0) + 1);
    }
}

6. De-duplication

Hash tables are used to remove duplicates from a collection by storing unique elements as keys.

// Remove duplicates from an array
public int[] removeDuplicates(int[] nums) {
    Set<Integer> set = new HashSet<>();
    for (int num : nums) {
        set.add(num);
    }
    
    int[] result = new int[set.size()];
    int i = 0;
    for (int num : set) {
        result[i++] = num;
    }
    
    return result;
}

Practice Problems

Apply your knowledge of hash tables by solving these carefully selected problems:

1. Two Sum

Given an array of integers and a target sum, return indices of the two numbers that add up to the target.

EasyHash TableArray

2. Valid Anagram

Given two strings, determine if one is an anagram of the other.

EasyHash TableString

3. Group Anagrams

Given an array of strings, group anagrams together.

MediumHash TableString

4. LRU Cache

Design and implement a data structure for Least Recently Used (LRU) cache.

MediumHash TableLinked List