Parallel Cartographic Modeling (Collaboration with U. South Carolina)

Publication

Discussions

  • Initial problem description by April:
    1. Can a flux footprint model be implemented in a GIS framework? I think Sara has already shown that yes it can.
    2. Is parallelization appropriate?
    3. Will the results have scientific validity and an improvement over current footprint model techniques?

Flux footprint modeling

Sara's sequential python code

More datasets by April (shared through dropbox)

24 hours of data. The files are different versions of the same information:

  1. CR5000-Sugar_FastData_2012_07_21_0700.dat = a complete 24 hour raw data file as it is collected from the data logging system in the field. I don't think that you will need to work with file at all, but I thought it could be useful for everyone to have a feel for the starting point.
  2. 07_21_2012_Footprint_input.dat= This file represents the input formatted for Sara's python code at the full data resolution of 0.1 seconds. To create this I have run a script in IDL on the original data – long term this could be something we examine in Python as well – but for now I'm comfortable with this pre-processing happening outside of the GIS framework.
  3. 07_21_2012_Footprint_input_1s.dat = This file is a identical to the file above, except that the wind speed, direction and stability have been calculated on a one-second averaging basis, so it has 1/10th the data points, it may be better for a scientific investigation to use a time-averaged data like this, but based on Sara's estimates, I think that it would still represent substantial processing time on a normal PC.

Week 1's updates (March 02-09, 2012) by Babak

  • The first thing we did for understanding how this Python program works was to instrument it with 3 timers: Input Timer, Compute Timer and Output Timer. These 3 timers were added to the code for each phase as shown below:

  • All the input works of the application is done in this area.

  • All the computation works of the application is done in this area.

  • We call this code, FluxFootprintModel_code_orig.py.
  • The results of running the instrumented version of the original code are as follows:

Optimization 1: Loop Invariant Code Motion

  • After looking at the code more carefully, we realized that the 2 phases “Calculate Euclidean Distance” and “Calculate Euclidean Direction” are completely invariant of the loop over timesteps and can be done out of the loop just one time, instead of being done for every iteration of the loop.
  • Since these two phases seem to be expensive, we predict that we get a large performance improvement.
  • We call this second code, FluxFootprintModel_code_opt1.py.
  • The results of running the instrumented version of opt1 code are as follows:

Correctness of Optimization 1

  • We have to make sure that the results of these two programs are the same so that our optimization have not affected correctness of the program. In order to this verification we visualized the results of these two programs using qGIS application.
  • Result of FluxFootprintModel_code_orig.py:

  • Result of FluxFootprintModel_code_opt1.py:

  • As we can see, these outputs are exactly the same. Therefore, we can say that the optimization is valid.

Parallelization

  • The first step to parallelize this application is within Python programming language itself using parallelization strategies developed for Python. We have looked at different parallel libraries for Python such as ParallelPython, etc. But the one that we chose for this application is a Python built-in library named multiprocessing.
  • The first way to parallelize this application that comes to mind is to divide the rows of the input files(for each of which we have to run one iteration) into blocks equal to the number of processors and give each processor one block. Therefore the changes that we have done to this application is as follows:
  • Multiprocessing package needs the parallel portion of the program to be in a separate function, so that when programmer creates different processes, he/she specifies that function(with different arguments, probably) to be run by different processes. Therefore, we put the for loop into another function named compute().

  • Therefore, the big picture looks like this: The main() function of the application gets started, it reads the input file and creates the 5 arrays that we have from the file in memory. It then needs to divide these big arrays into equal P number of pieces, where P is the number of processors that we have. In order to do this division, we can use a well-known trick in Python easily as follows:

  • We use Optimization 1 defined in last section in our parallel program too. So, at this stage of the main() function we calculate DistanceRaster and DirectionRaster arrays.
  • After finding those array, we create P number of processes in a for loop and store them in an array of objects of type Process. These Processes are started by having their function start() being called:

  • Therefore, every process gets it's own unique rank, number of processes(referred to as size), a queue(for sharing information between parent and children), DistanceRaster, DirectionRaster and a chunk of each array. And starts to work on them.
  • In order to share the result of each process, we use a Queue datastructure implemented in Multiprocessing for this purpose. The idea is that every Process at the end puts its own sumall array in the queue and exits. Parent process, knows how many items is going to be in this queue, waits on this queue and remove each element that comes in it. At the end of this process, it sums all of these partial sums in order to get the final sum array.
  • The parent process outputs the final results to the output file.

Correctness of Parallelization

  • Our parallel implementation of this application works too. Here is the visualization of the output of parallel application:

Week 2's updates (March 12-16, 2012) by Babak

Evaluation of Parallelization

  • We evaluated “07_21_2012_Footprint_input_1s.dat” dataset provided by April on Ember shared memory computer ranging the number of processors from 1 to 64 and here are the results:

Week 3's updates (March 19-24, 2012) by Babak

  • Our first reason for not getting linear or even close to linear speed up on this program was that the dataset is not large enough to saturate processors. Therefore we got new larger datasets from April. We ran our program with data file named “07_23_2011_Footprint_input.dat” and here are the results:

  • Note that we did not run the code using less than 16 processors for this experiment, because it would take a long time.
  • As we can see from this table, again we are not getting the speedup that we want! Therefore, we need to reconsider our parallelization strategy.

Week 5's updates (April 02-08, 2012) by Babak

  • After looking more carefully at the code, we realized that the solution to sharing data between processes in multiprocessing package(using queues) has a unefficient point and that is when parent process starts all the processes, it has to check the queue for the results. This checking is done inside a while loop and for each iteration of this loop, they propose a time.sleep() for 5 seconds. As we increase the number of processors, these sleeps accumulate and become a huge waste of time. We removed that sleep(5) function and the results are much better.
  • We also instrumented the code much finer by adding several new timers.
  • The new results for the smaller dataset(“07_21_2012_Footprint_input_1s.dat”) can be seen below:

  • The new results for the larger dataset(“07_23_2012_Footprint_input.dat”) can be seen below:

  • If we compare these results with the previous results, we can see that the timings have improved much more! However, unfortunately we still are far from linear speedup! The second table shows that there are some processors having much less work to do(due to the minimum compute time)! Therefore, now we have ecnountered load-imbalance issue. In order to resolve this issue, new decomposition and parallelization strategies are required.
  • Because of these results, we re-thought about the program and we came up with this idea that instead of decomposing the chunks and giving them statically to processors, we can decompose all the chunks first and put them in a queue(named job queue) and every processor steals a job from this job queue as soon as it get idle. This process will be continuing until all the jobs of the job queue are done.

Week 6's updates (April 09-14, 2012) by Babak

  • As we can see, using some of the processor numbers we are getting better speedup in the new code than older version of code, and for those processor numbers that we are getting worse running time, we expect that by varying the number of chunks(and hence changing the granularity of the application). Therefore, for each number of processors, we have to find the optimal number of chunks for each number of processors.
  • Here are some of the results we got from our experiments with the new version of the code:

  • Therefore, by changing the number of chunks to a better number, we could get much better speedup and scalability over the older version of the code:

Updates (June 11-25, 2012) By Kiumars

  • I was assigned to improve the code’s performance and potentially implement it on other platforms, such as OpenMP and CUDA. Therefore, the first thing that I did was trying to implement the code in C, so we can easily use OpenMP and CUDA.

IMPORTANT NOTICE: The codes have been tested only on my work PC so the output is based on using 4 cores.

  • I investigated the code and found out that we can improve the code in three ways(in addition to Babak’s improvements).

Making the Euclidean Direction and Distance calculations parallel

  • On the original version these two calculations were implemented as serial codes and considering their embarrassingly parallel nature it is natural that we seek to parallelize them as well.
  • In addition, these two calculations were implemented on two separate loops, which can be reduced to only one. It means that we can calculate both arrays by iterating the loop once.
  • The following is the result of parallelizing Euclidean direction and distance calculations. From now on, I’m going to refer to Babak’s code as the “parallel code version 1”, and I’m going to call mine “parallel code version 2”. The running time of the below table is the time it took to calculate Euclidean directions and Euclidean distances combined.

Summarizing the vector operations

  • Vector operations in high-level languages like python are implemented very efficiently. However, it’s a common practice to try to summarize them as much as possible. There are two reasons to do so : First, It will save memory which is a precious resource in parallel programs. Second, it will speed up our program, especially when we are doing vector operations on very large arrays (as our test cases). For instance take a look at the following piece of code:

  • Here we do 4 vector operations and use 5 vector variables to do so. Now compare it to the following code:

  • Here we only use one vector variable and only perform one vector operation. In conclusion, we should try to merge vector operations to reach a final closed formula, as it's going to save memory and speed up the program.

Loop- Carried Dependencies and Data race

  • One of the main principles of parallel programming is that we should be aware of data races and loop-carried dependencies and avoid them at any costs. However, the main loop of our program that compute the result has two data race issues, which we have to resolve before going any further.
  • Please note that data races might not necessarily lead to wrong result, but they make our implementation unpredictable and potentially at some point, the code is not going to behave as we expected.
  • The first data race happens when we are updating WindDirect_Arr on every iteration. Take a look at the following code:

  • Now as we don’t know the order of execution of the loop iterations in the parallel model, this can make the WindDirect_Arr’s value unpredictable. Please note that the value of the array depends on the value of the array in the previous iteration and we don’t know what previous operation was. This can be resolved(as I did in the final code), by updating the WindDirect_Arr update method such that all the updates it needs are going to be done at once. Therefore, there is no loop-carried dependency and the new model will be embarrassingly parallel.
  • The other issue happens when we are trying to update the sumall variable by
    • sumall = sumall + phi
    • As in the parallel model, two processors may try to update the sumall simultaneously; the outcome of the operation is unpredictable.
  • There are two ways to solve this issue:
    1. We can put the operation in a critical section block. It means that only one processor at a time will run this operation(Parallel Code version 2.1).
    2. This one is a little bit tricky. We can also change the inner loop and outer loop to make each processor responsible for calculating one element of the sumall. On the original model, each processor is going to calculate a complete phi array and then try to add it to the sumall array. In other words, on the original model is processor is going to calculate a small fraction of each element of the sumall array. On the other hand, in the new model, each processor is going to calculate one element of the sumall array(not a fraction of it) and leave calculating the other elements to other processors(Parallel Code version 2.2).

Running time and the output

  • First of all, we can take a look at the output that we are getting from 2 different approaches that we have:

  • We can see the output stays the same as the previous version(Babak’s).This is the running time of the 4 sections:

Updates (June 26-July 6, 2012) By Kiumars

  • Improvements
  • Made couple of changes to the Euclidean direciton and distance calculation method.
  • Changed the timing method of the program to use the linux's own
    gettimeofday

    function.

  • Profiled the code and found out the time that each core is using in every section.
  • Ran the code on Trestles supercomputer on SDSC.

Euclidean Direction & Distance

  • sqrt function can be very time consuming, so I used the computation that we did for Euclidean Direction to compute the Euclidean Distance.
  • As we calculate
    alpha = atan(y_temp/x_temp)

    we can use

    directiondistance = abs(y_temp/sin(alpha))

    .

  • In this particular case, as the function is quite fast(in the milliseconds order), there is really no need to make it faster. However, this can be seen as a general approach in designing the DSL. We should monitor complicated calculation to see if we can substitute them with faster alternatives. This can eventually lead to a trade-off between time/storage and accuracy in many cases.

Running the code on Trestles

  • I ran the code on one node, containing 32 cores.
  • The Parallel version 2.2 is now using static scheduling with the chunk size of 1. The reason is that, the earlier iterations in Version 2.2 was much lighter than the later iterations. By using static scheduling with the chunk size 1, we can improve load balancing as the harder and easier steps will be evenly distributed among processors. However, the method of scheduling can be still a matter of discussion as we might get better result by using dynamic or guided scheduling.


Version Input Time Euclidean D&D Compute Time Output Time
Version 1.0 0.620 5.130 1067 1.000
Version 2.1 0.184 0.035 246 0.379
Version 2.2 0.184 0.033 56 0.381


Here you can also check the output of the codes.

Version 2.1 Version 2.2


NOTE: The previous version of the code had a bug which was led to producing wrong results. The problem was in the routine that was calculating DirectionRaster.

for(i=0; i<600; i++)
{
  int offset; //this is the offset value for the i-th row
  for(j=0; j<600; j++)
  {
    /*
    ...
    Calculating the Direction Raster for the cell(i,j)= DRIJ
    ...
    */
    DirectionRaster[count+j] = DRIJ;
    /*
    on the previous version it was:
    DirectionRaster[count] = DRIJ;
    Therefore our DirectionRaster vector was filled with wrong values on most of the elements.
    */
  }
}

Result on Ember

I ran the code(model 1) on Ember using different number of cores and two different datasets. The result are shown in the following table: Result for 90k dataset

Number of Cores Input Time Euclidean D&D Compute Time Output Time
1 0.145 0.015 1726.677 0.461
2 0.143 0.008 867.163 0.465
4 0.142 0.006 434.856 0.307
8 0.135 0.012 352.227 0.640
16 0.146 0.023 111.516 0.311
32 0.154 0.093 55.700 0.525
64 0.189 0.203 47.116 0.662
128 0.131 1.854 26.478 0.642


Result for 900k dataset

Number of Cores Input Time Euclidean D&D Compute Time Output Time
1 1.957 0.015 17704.774 0.536
2 1.969 0.009 8893.676 0.702
4 1.667 0.007 4547.601 0.417
8 1.809 0.010 3577.133 0.594
16 2.013 0.145 1134.169 0.561
32 1.991 0.050 690.254 0.341
64 1.981 0.177 493.950 0.393
128 1.688 0.614 251.200 0.552


I plot the computation time for the 900k dataset with respect to number of cores.

Direction Raster and Distance Raster Operators

I divided the method that was responsible for calculating Euclidean directions and distances and made two separate method for them. This part actually shows the power of parallelism in implementing the usual operators of cartographic modeling.
Euclidean Direction

# of Cores/Raster size 500×500 2000×2000 8000×8000 32000×32000 64000×64000
1 0.009 0.154 2.460 39.333 165.832
2 0.004 0.078 1.236 19.772 82.069
4 0.002 0.037 0.590 9.428 37.718
8 0.002 0.020 0.316 5.065 20.969
16 0.002 0.030 0.396 3.983 16.466
32 0.014 0.059 0.187 2.885 10.787
64 0.007 0.018 0.132 1.687 6.512


You can see the speedup obtained by using OpenMP in the following picture.

Euclidean Distance

# of Cores/Raster size 500×500 2000×2000 8000×8000 32000×32000 64000×64000
1 0.005 0.064 1.026 16.106 65.782
2 0.003 0.033 0.514 8.139 32.556
4 0.018 0.030 0.242 3.624 14.403
8 0.010 0.016 0.141 2.225 9.892
16 0.063 0.033 0.242 1.931 8.643
32 0.105 0.093 0.242 2.510 6.059
64 0.277 0.042 0.267 1.440 5.431


The speedup of Euclidean Distance can be seen in the below.