******************************************************************************************
* SORT WITHIN CASES: IMPLEMENTATION OF SOME POPULAR SORTING ALGORITHMS *
******************************************************************************************
*Kirill Orlov. kior@comtv.ru
* 2007/03/16
*Below is SPSS syntax of some well-known sorting algorithms.
*In this case, the algos are used to sort values within cases.
*Replace number 13 with the number of variables you are going to submit to sorting, and run
*the corresponding syntax of this or that algo.
*The following algos considered:
*SIMPLE SORTING ALGORITHMS: BUBBLE, SELECTION, INSERTION.
*ALGORITHMS FOR QUICK SORTING OF LARGE VECTORS: SHELL's INSERTION, HEAPSORT, QUICKSORT.
*What's a "stable" sorting algorithm:
*If in the data (vector) there are identical values ("ties") then a "stable" sorting algo
*guarantees that original order among them will be kept, that is, the value occupying initially earlier (more left)
*position in the vector than another value, equal to it, will stay more left than it in the output vector.
*"Non-stable" sorting algo does not quarantee it.
******************************************************************************************
*Create data.
input prog.
vector v (13 f2).
loop #c= 1 to 20.
-loop #v= 1 to 13.
- comp v(#v)= rnd(rv.uniform(1,13)).
-end loop.
-end case.
end loop.
end file.
end input prog.
exec.
*************** BUBBLE (=EXCHANGE) algorithm, STABLE VERSION ******************.
*Vars are looked through in pairs and swap their values when necessary.
*The procedure repeates until no more swap is needed.
*This algo is regarded as the most slow among simple sorting algorithms. It is "stable".
vector v= v1 to v13.
loop #it= 1 to 13. /*It can take maximum as much iterations of the procedure as there're vars
-comp #sorted= 1. /*That's signal that the vector is sorted, let's think meanwhile it's sorted
-loop #i= 2 to 13-#it+1. /*Run through vars, taking pair: this var and preceding one (or could take this and the following - doesn't matter)
/*Could write just 'to 13', but 'to 13-#it+1' is time-thrifty (the thing is that at each iteration, the largest [smallest, for desc sort] value
/*of all the series will definitely find itself at the vector's tail, therefore we need not to touch this tail during next iteration)
- do if v(#i) instead <]
-end loop. /*so that on exit from the run m is indeed the pos of min (max) value
-comp #temp= v(#i). /*Now do swap:
-comp v(#i)= v(#m). /*put min (max) value in pos i
-comp v(#m)= #temp. /*and the value having been in i - put it in pos m
end loop. /*Take next value (i+1) as supposed min (max)
exec.
*If data contain missing values then they will stay in their positions. To sort them all to the vector's tail
*add to the condition of the IF command (that is inside loop): 'or miss(v(#m))'.
******** SELECTION (=EXTREMUM) algorithm, SHIFT VERSION (STABLE) *********.
*This version is based on the previous one except that instead of swapping 2 values (current and the extremum)
*it undertakes additional run-through to move all values that are on the left of the extremun to the right by 1 position.
*At a cost of this prolongation (slowing) of the work "stability" is achieved.
vector v= v1 to v13.
loop #i= 1 to 13-1.
-comp #m= #i.
-loop #j= #i+1 to 13.
- if v(#j)#val. /*And we must find a place within them where to insert our value
/*Therefore, if the next value at the left to it is greater [lesser - for descending sort] then run through the sorted subvector from right to left
- comp v(#i+1)= v(#i). /*shifting values there 1 pos to the right until we run into value which is not greater than ours
-end loop.
-comp v(#i+1)= #val. /*There, we insert our value beside it, at the right; now the left, sorted subvector expanded by one value
end loop. /*Take next value in order, etc
exec.
*If data contain missing values then they will stay in their positions and serve as points of division in "subvectors": sort will go
*within subvectors only. E.g., if a case is 5 4 . 5 6 4, the output is 4 5 . 4 5 6.
*If you want to escape this division and want to sort all missing to the vector's tail, to if-condition inside loop add: 'or miss(v(#i))'.
*************************** SHELL's INSERTION algorithm ****************************.
*Shell (1959) offered how to speed up Insertion algo in situations of large as well as far-from-sortedness vectors.
*The main point is that values "on the left" - among which a place to insert a taken value is searched - are scanned through
*with large step (i.e. step not 1 but larger) which accelerates finding a place for insertion. After all values have been thus inserted
*and the vector becomes closer to sordetness the procedure repeates with a lesser step, then again with still lesser step. In the end, with step 1
*(i.e. usual Insertion algo is applied when the vector is almost sorted).
*For instance, there's a vector of 13 values. Initially take such a step so that the vector split in subvectors of 2 values,
*distanced from each other by step: 13/2=6.5, i.e. step'll be 6. These are values with positions 1-7, 2-8, 3-9, etc.
*Within these subvectors - within pairs of values - insertion sorting will be done. After that, another splitting of vector is done, 4 values in subvector,
*separated by step 3 (13/4=3.25). These are positions 1-4-7-10, 2-5-8-11, 3-6-9-12. And these subvectors undergo sorting.
*At the end, the whole vector will be sorted.
*Shell's sort is not "stable", nor it is "online" sort.
vector v= v1 to v13.
loop #n= 1 to 13. /*We'll do "splittings" of vector (number of splittings in fact will be less then 13, we'll be over before we reach n=13)
-comp #n= 2**#n. /*with n elements (values) in a subvector: 1-st splitting n=2, at 2-nd splitting n=4, at 3-rd n=8 etc
-comp #step= trunc(13/#n). /*Elements in subvector are not adjacent, rather, they are separated by gap step which's therefore equals the number of full (with n elements) subvectors
-if #step=1 #n= 13. /*If vector contains only one full subvector - take the whole vector instead
-loop #sv= 1 to #step. /*For each "splitting", undertake Insertion sorting in each of its subvectors, sv is the number of a subvector
*****Highlighted section is Insertion algo, the only nuance being that it goes with step step instead of step 1.
- loop #ins= #sv+#step to #sv+#step*(#n-1) by #step.
- comp #val= v(#ins).
- loop #i= #ins-#step to #sv by -#step if v(#i)>#val. /*[for desc sort: < instead of >]
- comp v(#i+#step)= v(#i).
- end loop.
- comp v(#i+#step)= #val.
- end loop.
*****.
-end loop.
end loop if #step=1. /*If the subvector was the only one (and it is the whole vector) that's signal to end job, sorting is done
exec.
*If there are missing values in your data then to if-condition in 'loop #i' add: 'or miss(v(#i))', and they will be sorted to vector's tail.
*Without this addition, missing values will produce absurd sorting result.
************************ HEAPSORT algorithm ******************************.
*This is a quick algo for large vectors. It is not "stable".
*The job comprises 2 stages. On the 1st the vector is sorted into partly sorted vector called heap.
*On the 2nd stage the heap is further sorted into the result.
*Heap is a hierarchical structure of "parent with 2 children" type where a parent element is greater than its children
*(such heap is called max-heap) or a parent element is lesser than its children (heap called min-heap).
*Heap is stored as vector wherein for each element on position p
*there are corresponding children elements on positions 2*p and 2*p+1. An example of max-heap in form of figure and vector:
* 10
* 8 6
* 3 6 5 1
* 3 2 3 5 4
*
*posit 1 2 3 4 5 6 7 8 9 10 11 12
*value 10 8 6 3 6 5 1 3 2 3 5 4
*
*On both stages, the algo is the same, swapping: elements on parental positions are compared with their children,
*and if it occures one of the children is greater (lesser, for building min-heap) than its parent the child swaps positions with the parent.
*The difference between the stages is in regulations for the algo: on the 1st stage all parents are being checked which lead to heap where
*the 1st element - root parent - is the largest (smallest) of all. On the 2nd stage that parent is withdrawn from swapping and the rest of vector
*undergoes heapifying, then the root parent is withdrawn again, one more shortening the part which's left for heapifying, etc.:
*each time that part becomes shorter and, besides, needs less ans less further sorting.
*It may seem paradoxal at first glance but it is bulding of max-heap that results in ascending sort, and it is bulding of mix-heap that
*brings descending sort.
vector v= v1 to v13.
comp #end= 13.
loop #stage= 1 to 2. /*Two stages: create heap on the 1st, sort heap on the 2nd
-loop #p= trunc(13*#stage/2) to 1 by -1. /*On the 1st stage p is parent's position: scan all parents (i e all elements from vector's mid to the left)
/*On the 2nd stage value p runs through all elements from vector's end to the left
- do if #stage=2. /*On the 2nd stage there're differences with the 1st:
- comp #temp= v(1). /*Swap 1st element (most grand parent) with last element (rightmost child) in the vector
- comp v(1)= v(#p).
- comp v(#p)= #temp.
- comp #end= #p-1. /*And "withdraw" the last element from the vector, for it's already sorted: it's the largest (smallest, for desc sort) of all currently;
/*And in so doing the still-not-sorted part of the vector will gradually shorten
- comp #p= 1. /*As for parent - to compare with children - take always currently the 1st element of vector
- end if.
*****Highlighted is comparison/swap of parent and child ("sifting"="percolating").
- loop #i= 1 to 13 if #p*2<=#end. /*Will repeat while a parent has a child (i e parent's pos p is such that 2*p doesn't exit vector)
/*i will be not used (introduced only to escape SET MXLOOPS) and cycles will terminate before i=13
- comp #c= #p*2. /*Position of the child (left one of the two)
- do if #c+1<=#end. /*When there're 2 children
- if v(#c+1)>v(#c) #c= #c+1. /*then, if the right child is greater [lesser, for desc sort: < instead >] than the left, take the right one into comparison
- end if.
- do if v(#p) instead <] than the parent
- comp #temp= v(#p).
- comp v(#p)= v(#c).
- comp v(#c)= #temp.
- comp #p= #c. /*so that the parent is now on the child's pos (and so we'll then compare it with its futher children: its former grand-children)
- else. /*And if parent is greater than the largest [lesser than the smallest] child, exit
- break.
- end if.
- end loop. /*to take next parent in a queue for parent-children comparison sift
*****.
-end loop.
end loop.
exec.
*If there are missing values the sort will be with errors. To ward it off and sort all missing to the vector's tail,
*to condition 'if v(#c+1)>v(#c)...' add: 'or miss(v(#c+1))', and to condition 'do if v(#p)0. /*r is position in vectors s and e to read from: read s (start pos) and e (end pos) of the input subvector; the whole job lasts while there's something written in vector s (and e)
-comp #s= #s(#r). /*Copy start pos because it will change (while e, end pos, will remain for a while)
-loop #w= #w to 13 if #s<#e(#r). /*The task with the taken subvector is to repeatedly divide it into left and right wings with recoil from left (i e divide, then divide its right wing,
/*then divide the right wing of the right wing, etc) while there remains smth to divide: end pos is beyond start pos;
/*and w being pos in vectors s รจ e on which to write the whereabouts of emergent left wings, for we'll deal with them later
****Hightlited section is the division of subvector****.
- comp #p= trunc((#s+#e(#r))/2). /*Select pivot value: let it be the middle element in the subvector
- comp #temp= v(#p). /*Carry it to the head of the subvector (it is taking it away to a safe place)
- comp v(#p)= v(#s). /*while the 1st element goes to pivot's former pos (swap)
- comp v(#s)= #temp.
- comp #p= #e(#r). /*Now, p will be the reception pos for values being thrown over to the right wing (let p be the end of the subvector initially)
- loop #i= #e(#r) to #s+1 by -1. /*Do throwing over of values: take value one by one starting from the end (do not touch the pivot value kept on pos s)
- do if v(#i)>v(#s). /*If the value is greater [lesser, for desc sort: < instead >] than the pivot value
- comp #temp= v(#i). /*then put it on the right wing on the reception pos p
- comp v(#i)= v(#p). /*(swap with the value occupying it)
- comp v(#p)= #temp.
- comp #p= #p-1. /*And move the reception pos by 1 to the left
- end if.
- end loop. /*Take next value to compare it with the pivot value and to through over to the right if necessary
- comp #temp= v(#s). /*After throwings-over are done, put the pivot value on the pos p, wherever it is now
- comp v(#s)= v(#p). /*(swap with the value occupying it)
- comp v(#p)= #temp. /*Two wings are ready: all values on the right of the pivot value (its pos p) are larger [smaller], while on the left of it are not larger [not smaller] than it
****.
- comp #s(#w)= #s. /*Start pos of left wing: write it to our memorandum
- comp #e(#w)= #p-1. /*Start pos of left wing: write it to our memorandum
- comp #s= #p+1. /*Start pos of right wing: on next loop we'll devide subvector which starts here and ends on pos e
-end loop. /*And now we turn to do it
end loop. /*When there's nothing to divide anymore, next subvector to divide is taken by reading its whereabouts from s(r) e(r)
exec.
*In the presence of missing values sort is not affected but positions of missing values move. If you don't like it and want
*to sort all missing to the vector's tail, to 'do if' condition add: 'or miss(v(#i))'.
*Though a pivot element may be selected arbitrary, speed depends on its selection.
*If the input vector is already partly sorted then choosing a pivot element by an edge of (sub)vector considerably slows down the job.
*Therefore it's most optimal and universal to take a pivot element from a middle part of (sub)vector, like the above syntax does.
*Or you might choose it randomly.