master-algorithms-py/hash_objects at master · autistic-symposium/master-algorithms-py

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
hash_map_array.py	hash_map_array.py
hash_set_bst.py	hash_set_bst.py
hash_set_linked_list.py	hash_set_linked_list.py

hash objects

hash tables (also known as hash maps) are data structure that organizes data using hash functions, in order to support quick insertion and search (map keys to buckets).

collision

ideally, a perfect hash function will be a 1-1 mapping between the key and the buckets. however, in most cases, a hash function is not perfect and there is a tradeoff between the amount of buckets and the capacity of a bucket. if the hash function is not a 1-1 mapping, collisions must be handled:
- how to organize the values in the same bucket?
- what if too many values are assigned to the same bucket?
- how to search for a target value in a specific bucket?
popular collision techniques are:
- separate chaining: a linked list is used for each value, so that it stores all the collided items.
- open addressing: all entry records are stored in the bucket array itself. when a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found.

keys

when the order of each element in the string/array doesn't matter, you can use the sorted string/array to calculate a key.
if you only care about the offset of each value, you can use it as the key.
in a tree, you might want to directly use the Node() class as key sometimes, but in general, the serialization of the subtree might be a better idea.
in a matrix, you might want to use the row index or the column index as key. sometimes you might want to aggregate the values in the same diagonal line.

implementing a hash set

the difference between a hash set and a hash map is that the set can never have repeated elements.
to implement a hash set data structure, you need to implement:
- a hash function (to assign an address to store a given value), and
- a collision handling (since the nature of a hash function is to map a value from a space A to a corresponding smaller space B).
overall, there are several strategies to resolve the collisions:
- separate chaining: for value with the same hash key, we keep them in a bucket, and each bucket is independent of each other.
- open addressing: whenever there is a collision, we keep on probing the main space with certain strategy until a free slot is found.
- 2-choices hashing: we use two hash functions rather than one, and pick the generated address with fewer collisions.
if we were to implement separate chaining, the primary storage underneath a hashset is a continuous memory as array, where each element in this array corresponds to a bucket that stores the actual values.
- given a value, first we generate a key for the value via the hash function (the generated key serves as the index to locate the bucket).
- once the bucket is located, we then perform the desired operation on the bucket (such as add, remove, and contain).
- use a prime number as the base of the module to reduce collisions.

class HashSet:

    def __init__(self, size):
        self.size = size
        self.bucket = [Bucket() for _ in range(self.size)]

    def _get_hash_key(self, key):
        return key % self.size

    def add(self, element: int):
        bucket_index = self._get_hash_key(element)
        self.bucket[bucket_index].insert(element)
    
    def remove(self, element: int)L
        bucket_index = self._get_hash_key(element)
        self.bucket[bucket_index].delete(element)

    def contains(self, element: int:
        bucket_index = self._get_hash_key(element)
        return self.bucket[bucket_index].exists(element)

buckets as linked lists

a good choice for buckets are linked lists, as their time complexity for insert and delete is constant (once the position to be updated is located).
- you just need to be sure you never insert repeated elements.
time complexity for search is O(N/K) where N is the number of all possible values and K is the number of predefined buckets (the average size of bucket is N/K).
space complexity is O(K+M), where K is the number of predefined buckets, and M is the number of unique values that have been inserted in the HashSet.
lastly, to optimize search, we could maintain the buckets as sorted lists (and obtain O(log(N)) time complexity for the lookup operation).
- however, insert and delete are linear time (as elements would need to be shifted).

class Node:
    def __init__(self, value=None, next=None):
        self.value = value
        self.next = next


class Bucket:
    def __init__(self):
        self.head = Node(0)

    def insert(self, value):
        if not self.exists(value):
            self.head.next =  Node(value, self.head.next)
        else:
            print(f'node {value} already exists')
        
    def delete(self, value):
        prev = self.head
        current = self.head.next
        while current is not None:
            if current.value == value:
                prev.next = current.next
                return True
            prev = current
            current = current.next
        return False
    
    def exists(self, value):
        current = self.head.next
        while current is not None:
            if current.value == value:
                return True
            current = current.next
        return False

buckets as binary search trees

another option for a bucket is a binary search tree, with O(log(N)) time complexity for search, insert, and delete. in addition, bst can not hold repeated elements, just like sets.
time complexity for search is O(log(N/K), where N is the number of all possible values and K is the number of predefined buckets.
space complexity is O(K+M) where K is the number of predefined buckets, and M is the number of unique values in the hash set.

class Node:
    def __init__(self, value=None):
        self.val = value
        self.left = None
        self.right = None


class Bucket:
    def __init__(self):
        self.tree = BSTree()

    def insert(self, value):
        self.tree.root = self.tree.insert(self.tree.root, value)
        
    def delete(self, value):
        self.tree.root = self.tree.delete(self.tree.root, value)
    
    def exists(self, value):
        return (self.tree.search(self.tree.root, value) is not None)


class BSTree():
    def __init__(self):
        self.root = None

    def search(self, root, value) -> Node:
        if root is None or value == root.val:
            return root

        return self.search(root.left, value) if val < root.value \
            else self.search(root.right, value)
    
    def insert(self, root, value) -> Node:
        if not root:
            return Node(value)
        
        if value > root.val:
            root.right = self.insert(root.right, value)
        elif value == root.val:
            return root
        else:
            root.left = self.insert(root.left, value)
    
    def successor(self, root):
        # one step right and then all left
        root = root.right
        while root.left:
            root = root.left
        return root.value
    
    def predecessor(self, root):
        # one step left and then always right
        root = root.left
        while root.right:
            root = root.right
        return root.value

    def delete(self, root, key) -> Node:
        if not root:
            return None

        if key > root.val:
            root.right = self.delete(root.right, key)
        
        elif key < root.val:
            root.left = self.delete(root.left, key)
        
        else:
            if not (root.left or root.right):
                root = None
            elif root.right:
                root.val = self.sucessor(root)
                root.right = self.delete(root.right, root.val)
            else:
                root.val = self.predecessor(root)
                root.left = self.delete(root.left, root.val)
        
        return root

implementing a hash map

same as before, we need to tackle two main issues: hash function design and collision handling.
a good approach is using a module function with an array or linked list. at this time, there is no constraint for repeated numbers.

class Bucket:

    def __init__(self):
        self.bucket = []

    def get(self, key):
        for (k, v) in self.bucket:
            if k == key:
                return v
        return -1
    
    def put(self, key, value):
        found = False
        for i, k in enumerate(self.bucket):
            if key == k[0]:
                self.bucket[i] = (key, value)
                found = True
                break
        if not found:
            self.bucket.append((key, value))

    def remove(self, key):
        for i, k in enumerate(self.bucket):
            if key == k[0]:
                del self.bucket[i]

del is an O(N) operation, as we would copy all the i elements. to make it O(1) we could swap the element we want to remove with the last element in the bucket.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hash_objects

hash_objects

README.md

hash objects

collision

keys

implementing a hash set

buckets as linked lists

buckets as binary search trees

implementing a hash map

Files

hash_objects

Directory actions

More options

Directory actions

More options

Latest commit

History

hash_objects

Folders and files

parent directory

README.md

hash objects

collision

keys

implementing a hash set

buckets as linked lists

buckets as binary search trees

implementing a hash map