\documentclass[12pt]{article}
\usepackage{html}
\latex{\usepackage{fullpage}}
\usepackage{/u/mlittman/papers/tex/newalg}
\usepackage{epsfig}
\newcommand{\ul}{\begin{itemize}}
\newcommand{\li}{\item}
\newcommand{\el}{\end{itemize}}
\renewcommand{\ni}{\noindent}
\newcommand{\href}[2]{\htmladdnormallink{#2}{#1}}
\begin{document}
\title{Analysis of Hashing (12)}
\author{Michael L. Littman}
\date{October 21st, 1997}
\maketitle
\section{REVIEW}
\subsection{Announcements}
\ni Returning exams today.
\ni Next time: stable marriage or string matching?
\subsection{Hashing Idea}
\ni Direct-address table gives us fast insertions, deletions, and
finds (store item with key $k$ in entry $T[k]$).
\ni Hashing emulates direct-address table when the table is much
smaller than the size of the universe of keys. Key $k$ is stored in
entry $T[h(k)]$.
\ni Need to cope with {\em collisions\/}: chaining (linked list), open
addressing (extended hash function and probing).
\ni Today: Analyze expected running time.
\section{ANALYSIS}
\subsection{Worst Case}
\ni Let's say we insert $n$ distinct keys into an initially empty hash
table of size $m$ and then we want to do a find. What's the running
time for the find?
\ni Load factor is $\alpha = n/m$.
\ni Worst case for chaining?
\ni Worst case for open addressing?
\subsection{Uniform-Hashing Assumption}
\ni So, clearly hashing with a bad hash function is not terribly
useful.
\ni So, let's assume we have a good one.
\ul
\li For chaining: $\Pr(h(k) = j) = 1/m$ (the probability that a key
ends up in slot $j$ is $1/m$ for all slots).
\li For open addressing: $\Pr(h'(k,i) = j) = 1/m$ (the probability
that a key ends up in slot $j$ on the $i$th probe is $1/m$ for all
slots).
\el
\subsection{Comments on the Assumption}
\ni Very odd assumption (even somewhat ill-defined).
\ni Stronger (less general, worse) than the assumption in randomized
quicksort. There, we showed that the randomization in the algorithm
made it so no particular input elicits worst-case behavior.
\ni Weaker (more general, better) than the assumption needed to show
non-randomized quicksort efficient. There, we'd need to assume that
the probability of a bad input (reversed list) was small.
\ni Here, we are putting a restriction on the {\em combination} of our
hash function and our inputs. The better our hash function is, the
less we care about the key sequence.
\subsection{Chaining Analysis}
\ni Expected number of steps to discover an item $k$ is not in the
hash table:
\ul
\li We need to look at each item in the list $T[h(k)]$.
\li Under our uniform-hashing assumption, the $h(k)$ slot looks like
all the others.
\li What's the expected number of keys in a slot?
\li By ``linearity of expectation'', the expected number of keys in a
slot $j$ is equal to $\sum_{i=1}^n \Pr(\mbox{key\ } i \mbox{\ goes\
in\ slot\ } j) = \sum_{i=1}^n 1/m = n/m = \alpha$.
\el
\ni So, running time is proportional to the load factor: $\Theta(1+\alpha)$.
\subsection{Open-Addressing Analysis}
\ni To discover that $k$ is not in the table, we need to keep probing
until we find an empty slot.
\ni In math: (ML: THIS ISN'T QUITE RIGHT...)
\begin{eqnarray*}
\lefteqn{\mbox{expected\ running\ time}}\\
&=& \mbox{expected\ probes\ until\ hit\ empty\ spot} \\
&=& \sum_{l=1}^n l \Pr(\mbox{miss\ $l-1$\ times}) \Pr(\mbox{hit\ on\ $l$th\ time})\\
&=& \sum_{l=1}^n l\; n/m (n-1)/m (n-2)/m \ldots (n-l)/m (m-n)/m \\
&\le& \sum_{l=1}^n l (n/m)^{l-1} (m-n)/m \\
&=& (m-n)/m \sum_{l=1}^n l \alpha^{l-1} \\
&=& (m/n) (m-n)/m \sum_{l=0}^n l \alpha^l \\
&\le& (m/n)(m-n)/m \alpha/(1-\alpha)^2 \\
&=& (m/n)(1-\alpha) \alpha/(1-\alpha)^2 \\
&=& 1/(1-\alpha)
\end{eqnarray*}
\subsection{In Words}
\ni As an upper bound, we assume that each probe hits a completely
random slot. This is an upper bound because the probability that we
hit an empty slot goes up as we use up the filled slots.
$$\Pr(\mbox{collide}) = \frac{\mbox{filled\ slots}}{\mbox{slots}} =
\frac{n}{m} = \alpha.$$
\ni $T = 1 + \alpha T$ (look familiar?)
\ni $T = 1/(1-\alpha)$.
\subsection{Punchline}
\ni As long as we keep the table only a constant-fraction full, we are
ok. For $\alpha = 9/10$, $T = 10$.
\ni But, if the table is more full than that: $\alpha = (m-1)/m$,
$T=m$!
\section{UNIVERSAL HASHING}
\subsection{Motivation}
\ni Even a clever hash function is vulnerable to bad expected-case
performance.
\ni Any fixed hash function has some set of keys that will result in a
pile up.
\ni We can use randomness to get around this, mostly.
\ni Resulting algorithm is the only truly provably good hashing scheme
(uniform hashing is a crock), but I don't think anyone uses it in
practice.
\subsection{A Universal Hash Function}
\ni Let's assume 20 bit keys, so $|U| = 2^{20}$, about a million.
\ni Further, assume $m = 97$ (a prime), and chaining for collision
resolution.
\ni Break the key $x$ into 5 blocks of 4 bits each ($x_i$ is the
number represented by the bits in the $i$th block of $x$).
\ni Before beginning, pick (randomly uniformly) a multiplier for each
block $0 \le a_i < m$.
\ni $$h(x) = (a_1 x_1 + a_2 x_2 + a_3 x_3 + a_4 x_4 + a_5 x_5) \bmod m.$$
\subsection{Some Facts}
\ni Hash value always in the correct range. Why?
\ni Consider $x=1$, for all slots $j$, $\Pr(h(x) = j) = 1/m$. Why?
\ni What about $x=2$? For what value of $a_5$ does $h(x) = 1$?
% 49
\ni Basically, any given $x\neq 0$ is equally likely to map to any
slot.
\subsection{Universal Hashing Summary}
\ni Like using random pivots in quicksort, the idea is that we'll pick
a random hash function from a family of hash functions.
\ni If the family has a particular ``universal'' property (not
difficult to have), we can prove a true expected running time for
worst-case inputs.
\ni The critical property is that any pair of keys collides in just
$1/m$ of the possible hash functions.
\ni Not always useful in practice, because often we use our hash table
only once (and the randomization, in this case, only occurs once per
use).
\ni Still, gives us a warm, fuzzy feeling.
\section{DYNAMIC OPERATIONS}
\subsection{Data Structures}
\ni Now we've got another data structure to add to our collection
(hashing joins heaps, lists, binary search trees, splay trees, etc.)
\ni Each has different running time properties.
\ni We can also combine them in various ways.
\subsection{Operations}
\ni Here is a collection of dynamic operations:
\ul
\li insert: put item into data structure
\li find min: return pointer to item with minimum key
\li find max: return pointer to item with maximum key
\li delete: delete item given pointer
\li find: return pointer to item with given key
\li successor: return pointer to next largest item given pointer to item
\li first: return pointer to the item inserted earliest
\li next: given a pointer to an item, return a pointer to the next
most recent inserted item.
\el
\subsection{Combining Data Structures}
\ni How would you:
\ul
\li insert, delete, find max, find min: all amortized $O(1)$, except
insert $O(\log n)$.
\li insert, delete, first, next: all $O(1)$.
\li insert, delete, find, first, next: all expected $O(1)$, except
first, next $O(1)$ worst case.
\li insert, delete, find, find min: all $O(\log n)$, except find is
$O(1)$ expected.
\el
\section{EXAM 1}
\subsection{Comments}
\ni Lots of people forgot to circle the problems being done for
credit. Annoying. Didn't take it out on you this time...
\section{HOMEWORK}
\subsection{Homework 6}
\ni Background: Ch. 18.1, 18.2, 12.
\ni Due October 23rd:
\ul
\li 6.1: Last time, you created a binary search tree from the letters
of the word ``algorithm.'' Do it again, and annotate the tree with
the number of credits the splay analysis would assign to each node.
Now splay on ``m.'' What is the depth of the resulting tree? Where
are the credits now?
\li 6.2: Use the amortized analysis template to argue that we can
satisfy a sequence of $n$ operations to a heap (initially empty) in
$O(\log n)$ amortized time for inserts and $O(1)$ amortized time for
delete-max.
\li 6.3: Give worst-case bounds for executing min, max, min, max, min,
max, ... $k$ times on a balanced binary search tree and a splay tree.
Which is asymptotically more efficient if $k = \Theta(\sqrt{n})$?
\li 6.4: CLR Exercise 12.2-2 (pg 226).
\li 6.5: CLR Exercise 12.2-6 (pg 226).
\li 6.6: CLR Exercise 12.3-1 (pg 231).
\li 6.7: When a hash table fills up, it is often necessary to expand
it (especially with open addressing). Let's say that we start off
with a table of size $m=101$ and every time we reach $\alpha \ge 1/2$,
we set $m \leftarrow 2m+1$ and then reinsert all the items into the
new table. Argue that the {\em amortized\/} time to insert a single
item into the hash table over a sequence of $n$ insertions is $O(\log
n)$. Use the uniform-hashing assumption if you need it.
\el
\subsection{Homework 7 (Incomplete)}
\ni Background: Ch. 27.3.
\ni Due October 30th:
\ul
\el
\end{document}