Binary Search is mathematically the fastest search algorithm there is.

Today we'll learn something faster.

Fallacy of equivocation:  I use a word with more than one meaning in an argument, using different definitions at different points of the argument.

Hashing:  What is hashing?  A hash function is a function that takes search data (usually a string) and converts it to a number.  Ideally, very few other strings will hash to that same number.  So, if you look up a piece of information using that number, you'll find the record / search data you're looking for.

Let's say I want to look up "Lucky The Leprechaun".  If I had a huge array or a flat file, and I use binary search, that would take lg n time, over the total number of records I have.  It's lg n time whether or not "Lucky The Leprechaun" is even in there.

But, let's say I'm able to hash "Lucky The Leprechaun" into a number, say, 12345.  Then I can look in an array at position 12345 and this takes constant time, because I can easily jump directly to whatever position I want in an array or flat file.  And if that's where "Lucky The Leprechaun" is, I win.  And constant time beats lg n time, so I've got method that works better than binary search.

Binary Search is GUARANTEED to be lg n.  Hashing is not GUARANTEED to be constant.  If there are 100 other strings that ALSO hash to 12345, then I have to go through those to find the one I want.

Hash functions:  convert a string to an integer.  They should look random but not actually be random.  If I convert "Lucky The Leprechaun" to 12345, that that string should ALWAYS convert to 12345 every time I run the algorithm.

There are some really bad hashing functions.

Based on n-ary numbers.  What is the difference between decimal numbers and decary numbers?  decary numbers have no digit of zero.

1 2 3 4 5 6 7 8 9 X

Under this system, I can uniquely express every positive number.

37   is still 37

50   is now 4X

100  is now 9X

101 is now  X1

1005 is now 9X5     9x100 + 10x10 + 5 = 1005

There aren't a lot of advantages to doing it this way.  In particular, dealing with fractions is a real pain in the neck:  1.05 ==> .X5  0.05...is this even possible?

P = 1,  Z = 0  M = -1,  base 3 system.

P = 1
PM = 2
PZ = 3
PP = 4
PMM = 5
Z = 0
M = -1
MP = -2

The decary has ONE BIG ADVANTAGE:  The representation is unique.

In our number system, 5  can be 5, 05, 005, 0005, etc.  But in decary, 5 is only 5.  That's the only way to do it.

1.099     1.09X   1 = .X = .9X = .99X = .999X = .9999999...

Let's say I have character set:  For now, I'll say capital letters (but in reality, it could be the entirety of 8-bit ASCII, 256 total characters).

A  B  C  ... Z
1  2  3      26

Imagine the alphabet and words in the alphabet to represent numbers written in a 26-ary system.

CAT would be 3*26^2 + 1*26+20.  Every string represents a unique non-negative integer, and every non-negative integer represents a unique string.  ("" translates to 0).  "CAT" translates to 2074, and it is the ONLY string that translates to 2074.

So, if I have some sort of database keyed on animal names, I don't have to binary search the word "CAT" to see if it's in there.  I just check position 2074 to see if its in there.

What's the teeny tiny barely worth mentioning problem with this strategy?

"CAT" hashes to 2074.  What does "ELEPHANT" hash to?  Do I have enough array space to store all strings up to the word ELEPHANT?  That's the down side.  (No, you don't have enough space.)

So, to solve this problem, we have to mod our number by the size of the array to keep the space issue in check.  I usually mod by a prime.  (However, this is not strictly necessary).  65537 is prime, and it's one more than a power of 2.  (65537 = 2^16 + 1).  But you can use any prime that you feel is approximately the amount of information that you're going to store.

What is the downside to modding by a prime.  NOW we can get duplicates.  It is no longer unique.

If I mod by 65537, "CAT" hashes to 2074, but "CAT" will not be the only string that hashes to 2074.  However, the other strings that hash to 2074 will look like "FDFDSREWQR".  It is extremely unlikely that the other strings that hash to 2074 will look like anything you want to use anyway.

If more than one item hashes to the same thing we call this a "collision."  You can't eliminate them completely, but if you can minimize the probability of this, you have a very fast algorithm for data storage and retrieval.  So important is this that Java has hash tables as part of its standard library.

The modern way to implement hash tables is as an array of linked list.

Let's say my array size is 65537.  So I have entries from 0 to 65536.  When I add "CAT", "CAT" hashes to 2074 and so I add "CAT" to the 2074th. linked list in the array.  I have linked lists in there so that if, just by chance, something else also hashes to 2074 then I have room for both.  I just add the new one to the linked list.

In actual practice, the linked lists seldom get larger than two elements.  Trying to linked lists sorted, say, is more work than necessary.  When you look up "CAT", and you check the 2074th. entry in the array, there probably won't be more than two things in there, so checking them all isn't much of a hardship.

Before we had languages with easy pointers, then hash tables were stored in linear or quadratic probing tables.