ex40.tex

\chapter{Exercise 40: Binary Search Trees}

The binary tree is the simplest tree based data structure and while it has been
replaced by Hash Maps in many languages is still useful for many applications.
Variants on the binary tree exist for very useful things like database indexes,
search algorithm structures, and even graphics processing.

I'm calling my binary tree a \ident{BSTree} for "binary search tree" and the best
way to describe it is that it's another way to do a \ident{Hashmap} style key/value
store.  The difference is that instead of hashing the key to find a location,
the \ident{BSTree} compares the key to nodes in a tree, and then walks the
tree to find the best place to store it based on how it compares to other nodes.

Before I really explain how this works, let me show you the \file{bstree.h} header
file so you can see the data structures, then I can use that to explain how
it's built.

\begin{code}{src/lcthw/bstree.h}
<< d['code/liblcthw/src/lcthw/bstree.h|pyg|l'] >>
\end{code}

This follows the same pattern I've been using this whole time where I have a 
base "container" named \ident{BSTree} and that then has nodes names \ident{BSTreeNode} that make up the actual contents.  Bored yet?  Good, there's no reason to
be clever with this kind of structure.

The important part is how the \ident{BSTreeNode} is configured and how it gets
used to do each operation: set, get, and delete.  I'll cover get first since
it's the easiest operation and I'll pretend I'm doing it manually against
the data structure:

\begin{enumerate}
\item I take the key you're looking for and I start at the root.  First thing
    I do is compare your key with that node's key.
\item If your key is less-than the \ident{node.key}, then I traverse down the tree using
    the \ident{left} pointer.
\item If your key is greater-than the \ident{node.key}, then I go down with \ident{right}.
\item I repeat step 2 and 3 until either I find a matching node.key, or I get to
    a node that has no left and right.  In the first case I return the \ident{node.data}, in 
    the second I return \ident{NULL}.
\end{enumerate}

That's all there is to \ident{get}, so now to do \ident{set} it's nearly the
same thing, except you're looking for where to put a new node:

\begin{enumerate}
\item If there is no \ident{BSTree.root} then I just make that and we're done.  That's the first node.
\item After that I compare your key to \ident{node.key}, starting at the root.
\item If your key is less-than or equal to the \ident{node.key} then I want to 
    go left.  If your key is greater-than (not equal) then I want to go right.
\item I keep repeating 3 until I reach a node where the left or right doesn't
    exist, but that's the direction I need to go.
\item Once there I set that direction (left or right) to a new node for the
    key and data I want, and set this new node's parent to the previous node
    I came from.  I'll use the parent node when I do delete.
\end{enumerate}

This also makes sense given how get works.  If finding a node involves going
left or right depending on how they key compares, well then setting a node 
involves the same thing until I can set the left or right for a new node.

Take some time to draw out a few trees on paper and go through some setting
and getting nodes so you understand how it work.  After that you are ready
to look at the implementation so that I can explain delete.  Deleting in
trees is a \emph{major} pain, and so it's best explained by doing a 
line-by-line code breakdown.


\begin{code}{src/lcthw/bstree.c}
<< d['code/liblcthw/src/lcthw/bstree.c|pyg|l'] >>
\end{code}

Before getting into how \ident{BSTree\_delete} works I want to explain a
pattern I'm using for doing recursive function calls in a sane way.
You'll find that many tree based data structures are easy to write
if you use recursion, but that forumlating a single recursive function
is difficult.  Part of the problem is that you need to setup some
initial data for the first operation, \emph{then} recurse into 
the data structure, which is hard to do with one function.

The solution is to use two functions.  One function "sets up" the 
data structure and initial recursion conditions so that a second
function can do the real work.   Take a look at \ident{BSTree\_get}
first to see what I mean:

\begin{enumerate}
\item I have an initial condition to handle that if \verb|map->root|
    is \ident{NULL} then return \ident{NULL} and don't recurse.
\item I then setup a call to the real recursion, which is in
    \ident{BSTree\_getnode}.  I create the initial condition of
    the root node to start with, the key, and the \ident{map}.
\item In the \ident{BSTree\_getnode} then I do the actual recursive
    logic.  I compare the keys with \verb|map->compare(node->key, key)|
    and go left, right, or equal depending on that.
\item Since this function is "self-similar" and doesn't have to handle
    any initial conditions (because \ident{BSTree\_get} did) then
    I can structure it verys simply.  When it's done it returns
    to the caller, and that return then comes back to \ident{BSTree\_get}
    the result.
\item At the end, the \ident{BSTree\_get} then handles getting the
    \ident{node.data} element but only if the result isn't \ident{NULL}.
\end{enumerate}

This way of structuring a recursive algorithm matches the way I structure
my recursive data structures.  I have an initial "base function" that handles
initial conditions and some edge cases, then it calls a clean recursive
function that does the work.  Compare that with how I have a "base struct"
in \ident{BStree} combined with recursive \ident{BSTreeNode} structures
that all reference each other in a tree.  Using this pattern makes it
easy to deal with recursion and keep it straight.

Next, go look at \ident{BSTree\_set} and \ident{BSTree\_setnode} to see
the exact same pattern going on.  I use \ident{BSTree\_set} to configure
the initial conditions and edge cases.  A common edge case is that there's
no root node, so I have to make one to get things started.

This pattern will work with nearly any recursive algorithm you have to figure
out.  The way I do this is I follow this pattern:

\begin{enumerate}
\item Figure out the initial variables, how they change, and what the stopping
    conditions are for each recursive step.
\item Write a recursive function that calls itself, with arguments for 
    each stopping condition and initial variable.
\item Write a setup function to set initial starting conditions for the
    algorithm and handle edge cases, then it calls the recursive function.
\item Finally, the setup function returns the final result and possibly
    alters it if the recursive function can't handle final edge cases.
\end{enumerate}

Which leads me finally to \ident{BSTree\_delete} and \ident{BSTree\_node\_delete}.
First you can just look at \ident{BSTree\_delete} and see that it's the setup 
function, and what it is doing is grabbing the resulting node data and freeing
the node that's found.  In \ident{BSTree\_node\_delete} is where things
get complex because to delete a node at any point in the tree, I have to
\emph{rotate} that node's children up to the parent.  I'll break this
function down and the ones it uses:

\begin{description}
\item[bstree.c:190] I run the compare function to figure out which direction I'm going.
\item[bstree.c:192-198] This is the usual "less-than" branch where I want to go left.
    I'm handling the case that left doesn't exist here and returning \ident{NULL}
    to say "not found".  This handles deleting something that isn't in the 
    \ident{BSTree}.
\item[bstree.c:199-205] The same thing but for the right branch of the tree.  Just keep
    recursing down into the tree just like in the other functions, and return
    \ident{NULL} if it doesn't exist.
\item[bstree.c:206] This is where I have found the node since the key is equal (\ident{compare}
        return 0).
\item[bstree.c:207] This node has both a \ident{left} and \ident{right} branch, so it's
    deeply embedded in the tree.
\item[bstree.c:209] To remove this node I need to first find the smallest node that's
    greater than this node, which means I call \ident{BSTree\_find\_min} on
    the right child.
\item[bstree.c:210] Once I have this node, I will do a swap on its \ident{key} and
    \ident{data} with the current node's.  This will effectively take this
    node that was down at the bottom of the tree, and put it's contents
    here so that I don't have to try and shuffle this node out by its
    pointers.
\item[bstree.c:214] The \ident{successor} is now this dead branch that has the current
    node's values.  It could just be removed, but there's a chance that it has
    a right node value, which means I need to do a single rotate so that
    the successor's right node gets moved up to completely detach it.
\item[bstree.c:217] At this point, the successor is removed from the tree, its values
    replaced the current node's values, and any children it had are moved up
    into the parent.  I can return the \ident{successor} as if it were the
    \ident{node}.
\item[bstree.c:218] At this branch I know that the node has a left but no right, so
    I want to replace this node with its left child.
\item[bstree.c:219] I again use \ident{BSTree\_replace\_node\_in\_parent} to do the
    replace, rotating the left child up.
\item[bstree.c:220] This branch of the if-statement means I have a right child but 
    no left child, so I want to rotate the right child up.
\item[bstree.c:221] Again, use the function to do the rotate, but this time of the
    right node.
\item[bstree.c:222] Finally, the only thing that's left is the condition
    that I've found the node, and it has no children (no left or right).
    In this case, I simply replace this node with NULL using the same
    function I did with all the others.
\item[bstree.c:210] After all that, I have the current node rotated out of the tree
    and replaced with some child element that will fit in the tree.  I just
    return this to the caller so it can be freed and managed.
\end{description}

This operation is very complex, and to be honest, in some tree data structures
I just don't bother doing deletes and treat them like constant data in my 
software.  If I need to do heavy insert and delete, I use a \ident{Hashmap}
instead.

Finally, you can look at the unit test to see how I'm testing it:

\begin{code}{tests/bstree\_tests.c}
<< d['code/liblcthw/tests/bstree_tests.c|pyg|l'] >>
\end{code}

I'll point you at the \ident{test\_fuzzing} function, which is an interesting
technique for testing complex data structures.  It is difficult to create a
set of keys that cover all the branches in \ident{BSTree\_node\_delete}, and
chances are I would miss some edge case.  A better way is to create a "fuzz"
function that does all the operations, but does them in as horrible and random
a way as possible.  In this case I'm inserting a set of random string keys,
then I'm deleting them and trying to get the rest after each delete.

Doing this prevents the situation where you test only what you know to work,
which means you'll miss things you don't know.  By throwing random junk at
your data structures you'll hit things you didn't expect and work out any
bugs you have.


\section{How To Improve It}

Do \emph{not} do any of these yet since in the next exercise I'll be using this
unit test to teach you some more performance tuning tricks.  You'll come back and do
these after you do Exercise 41.

\begin{enumerate}
\item As usual, you should go through all the defensive programming checks and add
    \ident{assert}s for conditions that shouldn't happen.  For example, you should
    not be getting \ident{NULL} values for the recursion functions, so assert that.
\item The traverse function traverses the tree in order by traversing left, then right,
    then the current node.  You can create traverse functions for reverse order as well.
\item It does a full string compare on every node, but I could use the \ident{Hashmap}
    hashing functions to speed this up.  I could hash the keys, then keep the hash in
    the \ident{BSTreeNode}.  Then in each of the set up functions I can hash the 
    key ahead of time, and pass it down to the recursive function.  Using this hash I can
    then compare each node much quicker similar to I do in \ident{Hashmap}.
\end{enumerate}

\section{Extra Credit}

Again, do \emph{not} do these yet, wait until Exercise 41 when you can use performance
tuning features of Valgrind to do them.

\begin{enumerate}
\item There's an alternative way to do this data structure without using recursion.  The Wikipedia
    page shows alternatives that don't use recursion but do the same thing.  Why would this 
    be better or worse?
\item Read up on all the different similar trees you can find. There's AVL trees, Red-Black trees,
    and some non-tree structures like Skip Lists.
\end{enumerate}