NEURAL NETWORKS and adjusting the weights.

Simulated annealing is not actually used for adjusting the weights in a neural network.  It's too general for this problem.  Simulated annealing is designed for harder problems than neural networks.

In particular, it's designed for problems where there might be several local maxima.

Because the error function is convex.  What this boils down to is that there is only going to be one minimum anyway.  Once you find a point that's smaller than it's neighbors, you've found the minimum.

What is normally used is "stochastic gradient descent."

In general, "gradient descent" algorithms are like this:

The gradient of a multivariable function f (x1,...xn) is the derivative vector of the function.

<df/dx1, df/dx2, df/dx3,...,df/dxn>, but with multivariable functions, we don't really use a d symbol we the "partial" symbol, which is a stylized d and looks more like a backward 6.

It turns out the gradient at any given point ALSO gives the direction of steepest slope.  So, if you follow the direction indicated by the gradient vector, you'll that max or min point faster.  No need to choose a random direction.

If the error function is simple enough (and it never is), you could hypothetically set the gradient vector equal to zero, solve for the variables, and that would give you the minimum function.  But these vectors are generally two complex for easy algebra so we use gradient descent which is an iterative method designed specifically to find where that gradient vector is zero.

Newton's method:  To find the square root of n:  Let x1 be an "estimate" of the square root.  Then (x1 +  n/x1) / 2 is a better estimate.  Then you use that estimate to find an even better estimate.


For each weight in our network, we can adjust the weight using

w := w - alpha*(d/dw)Loss(<w1...wn>)

<w1...wn> is our vector of weights.  Loss is the error function.  d/dw just means the partial derivative of this with respect to the weight we're looking at.  alpha is "step size".

You do this a bunch of times, it will eventually converge onto the minimum point provided alpha isn't too large, so you decrease alpha as you proceed with the algorithm (much like simulated annealing decreases the probability as time goes on).

Gradient descent algorithms work like this.  It's based on the idea that FINDING the partial derivatives are easy.  It's setting them to zero and solving them that's hard.

Stochastic gradient descent is when you don't change all of the weights at once.  You grab a fixed number of weights randomly and update those.  Then grab another set and update those, etc.  And if you have a parallel architecture, grabbing a random subset and farming out to different processors works to your advantage as well.

If you shrink alpha appropriately throughout the process, this will converge (might take a while) to the minimum point.

ENCODING INFORMATION into inputs for a neural network.

<TYPE> = <Thai,Italian,French,Burger>  but interpreting this as an integer  Thai=0; Italian=1; French=2; Burger=3  is not that particuarly useful.  Because, neural networks aren't based on integers.  These are continuous functions.  For example, if Thai=0 and Italian=1, what would 0.7 mean in this context?  Not a whole lot because there's no real continuum here.  Thai, Italian, French, and Burger are not related to each other at all.

A better system would be a four-bit system, one for each type of restaurant, with the the assurance that no two of the four bits will ever both be 1 on input.  "hot bit system".  Every value of the attribute gets its own bit, but we don't have more than one set to one at the same time.

Images/Image processing.  Because this is the opposite problem.

Let's a pixel has a value between 0 and 255.  If my image has m rows of pixel and n columns of pixels, how many total pixels do we need?  3mn.  Because there are three values of 255 for each pixel because there are the red, green, and blue components of color.

So you would imagine that to process an m x n image, you would need 3mn inputs into your neural network...but this also causes a problem, because these inputs are not independent of each other.  The 3 color pixels are greatly related to each other, and each pixel is certainly related to other nearby pixels.  If you want your neural network to recognize, a thumb, a leaf, or really anything, you are looking not at a single pixel, not at the whole image, but at a portion of the image such that if a leaf appears ANYWHERE in the image, you recognize it as a leaf.