AT-MOST-ONCE Property:

a "critical reference" to a variable is when you reference a variable that is modified by another thread.

y = x+1       %%    x = 12
    ^  That is a critical reference

Now, in the assignment statement

var = expr;

if <expr> contains NO MORE THAN ONE critical reference and var is not read by any other thread

OR

if <expr> contains no critical reference at all, the AT-MOST-ONCE property is satisfied.

The AT-MOST-ONCE property means that for all intents and purposes, an assignment can be thought of as atomic, even if the concept of atomicity does not exist in your language or on your platform.

<y = x+1> can be thought of as atomic, even if actually isn't.

We are assuming that memory reads and writes are already atomic.

Let's say x was 5.  The first thread starts running y = x+1.  It loads the 5 into x, but before the addition, the second thread runs setting x to 12.

How does this screw up my answer?  Not a whole lot.

y is going to end up as 6 whether x = 12 fires after my statement or duirng my statement.

I may still have a race condition, but I know that I'm not going to get some screwy answer, and I certainly don't have to worry about the details of how arithmetic expressions are parsed and computed.

PARALLEL SORTING

What the asymptotic runtime of a regular old sequential sorting algorithm? n lg n.

If I have a sequential algorithm that runs in n lg n time, and I run it on n processors, how much time will it take now?  n lg n  /  n = lg n.  This means, I can't do any better than that.  That doesn't mean I can achieve lg n time.

Amdahl's law says that a program can be only sped up so much no matter how many processors you have.  Some aspects of an algorithm are inherently serial and can't be sped up by parallelizing.

It turns out that you CAN sort n items using n processors in lg n time.  This was worked out in 1983, and it's a complicated mess.  But there are some easy algorithms that will sort n items in lg^2 (n) time.

Now, this is another example of theoretical computer science not agreeing with real life practical matters:  in real life, I don't have a changing of processors based on my data size.

PARALLEL MERGE SORT

1.  Divide array in half.  (Doesn't matter how.)
2.  Recursively sort first half.
3.  Recursively sort second half.
4.  Merge sorted halves together.

(BASE CASE:  arrays of size 0 and 1 don't need to be sorted.)

Dividing the list in half takes constant time.  Merging them back together takes n time.  Using math and stuff (or the Master Theorem), we can see that the recursive aspect will take n lg n time.  (At each level of recursion I will have to scan every entry in the array as part of a merge operation.  And, I will have to call recursion until I'm dealing with arrays of size 1.  At the top level, I deal with an array of size n, at the next level, I have arrays of size n/2, at the next I have arrays of size n/4, all the way down to 1.  This means that I lg n number of levels my recursion.  Each level has to scan every single element in the array, and that's where n lg n derives.

What would be an easy first step to adjusting this algorithm so that it runs in parallel with at least some meaningful speed up?

Steps 2 and 3 could be run in parallel.  The first half and the second half don't interfere with each other at all, so if I sort the first half and second half simultaneously I'm going to get some speed up.

It now runs in O(n) time.  Which is still cool, that's still faster than a regular merge sort.  We can still improve upon it, though.

The bottleneck is now in the merging process.

_ _ _ X _          _ _ _ _ _

I have two lists and I want to merge them together.  Assign a thread to each item in the unmerged lists.  In the above example, I have ten threads.

I have one thread dedicated to that X.  How can that thread tell (relatively easily) where that X will go in the merged list?

So, first of all, where do we know the X CANNOT go?  We know that there are three positions that come before it in its own list?  So, if we know exactly how many positions come before it in the second list, we know exactly where it will go in the merged list?  Use binary search to tell you exactly how many elements come before the X in the second list?  You now know how many elements come before it in the first list and in the second list, so you know exactly where it goes in the merged list...so put it there.

Voila.  Your sort now runs in O (lg^2(n)) time.