Reminder:  A hash table uses a hash function to convert a record or a key field into an integer.  The information is then stored in hash table keyed to that integer.

Inserting the element into the hash table takes "approximately" constant time.  Looking it up and removing it are also "approximately" constant time operations.  The reason it is not exactly constant is that there is a certain risk of collision, two or elements hashing to the same integer, and so more than one element might be stored at that integer.  No matter how carefully you design your hash table and function, there is still a risk (hopefully small) of collision.  You cannot completely eliminate collision.

So, the way we store hash tables in modern languages is as an array of linked list.  So, if my hash function takes us (say) from 0 to 65536, then I would have LinkedList[] HT = new LinkedList [65537], or something like that.  If my string "Hey There" hashed to 44291, then I would say HT[44291].add ("Hey There") or something like that.

0 -> _____ -> ______
1 -> _____ -> ______
2 -> null
3 -> _____
4 -> null

and so forth.  You hope that every list in the array contains no more than one element, but if your hope is unrealistic, then at least there probably won't be too many records to search.

What *I* do is I estimate how large I expect my hash table to be and then I find a prime approximately DOUBLE that, and that's how big I make my array.  (This does NOT completely eliminate collisions.)

And this actually works pretty well.  I sometimes DEFINE hash tables as "arrays of linked lists."

So here's the question.  Hash tables were known long before there were programming languages with pointers.  (BASIC, ForTran, COBOL, all them "classic" languages, didn't have pointers, so therefore they didn't have linked lists, which require pointers.)

Turing-complete:  it can solve any problem a Turing machine can solve under the assumption that integers never go out of bounds and you never run out of RAM.  Now, BASIC, COBOL, ForTran, and those guys, are Turing-Complete.  They don't have recursion and they don't have pointers, but they can still solve the same class of problems, even without that stuff.

If you don't have pointers, how could a language without pointers solve a problem that you would normally use pointers on?  Like a linked list.  Like a binary tree.

Use a big-ass array.  And your pointers, instead of pointing to a memory location points to a location in the big-ass array.   P->next  might be something like MEM[P[1]].  Can be done.  Pain in the neck.

You could do something like this with hash tables, but the big stumbling block is the big-ass array.  And so for hash tables, a different strategy used was "linear probing tables".  (This is an obsolete data structure.)

Linear probing table:  Instead of an array of linked list, it's just an array of stuff.  Instead of 65537 different linked lists, I have an array of 65537 different records.   If my string "Hey There" hashed to 44291, I would just stick "Hey There" into position 44291, in the array.

Just as fast as an array of linked lists, just as easy to compute, just as easy to extract, absolutely perfect, except for one teeny-tiny detail, barely worth mentioning:  collisions.

What if something else is already there?  That's why it's called "linear probing."  If I have a collision I just keep moving forward until I find a blank space.  So, if 44291 is filled, I try 44292, then 44293, and so forth until I find a blank space.  Then I add it to the blank space.

So, if position p is already filled, I try (p+1)%size until I find a black space to put it.

When I want to search later, same thing.  If position p doesn't have what I want, I move forward until I either find what I want or find a blank space.  (Blank space meaning it ain't there.)

So, what are the problems with this?

Once a lot of stuff gets into the array, it's a lot harder to find a blank space.  So how would you address this problem?  Make the array MUCH larger than your anticipated data size.

What happens with delete?


44291->"Hey There" (hashes to 44291).
44292->"George of the Jungle" (hashes to 44291).
44293->"Cake For Breakfast" (hashes to 44291).
44294->BLANK


I delete "George of the Jungle".

44291->"Hey There" (hashes to 44291).
44292->BLANK
44293->"Cake For Breakfast" (hashes to 44291).
44294->BLANK

Search for "Cake For Breakfast".  What happens?  I hash to 44291.  It's not "Hey There" so I advance 44292.  It's blank, so I'm done and report that "Cake For Breakfast" isn't there, even though it totally is.

So, how could I address this problem?

SCREW the blank.  Search the whole array anyway.

The classic solution is to use a <DELETED> marker.  So that BLANK and DELETED are not the same thing.  When the hash table is created, everything is BLANK, but when an item is removed, it's replaced with the DELETED marker.  (In other words, once an entry stops being BLANK, it will never be BLANK again.)

44291->"Hey There" (hashes to 44291).
44292->DELETED
44293->"Cake For Breakfast" (hashes to 44291).
44294->BLANK

The DELETED marker can be replaced with a new record in the event of an add.  But it is not considered a BLANK for the purposes of searching.  If you see a DELETED marker, you skip the record and keep going.  Don't stop until you hit a BLANK.

And this will work for a while, if the hash table just keeps growing, eventually you will run of BLANKs, and every search for a nonexistent record will require you to search the whole table.

There has also been some work with "quadratic probing table".  In that case, instead of using (p+1)%size to find the next entry to try in the case of a collision, a quadratic function is used instead, maybe (p^2+p+1)%size.  This makes it so that colliding elements are not stored in adjacent positions, they're spaced out a bit, making it less likely for collisions to occur between elements of different hash values.

However, the problems of filling the hash table still remains.