Data-Intensive Parallel Spatial Analysis

Problem Definition

  1. How to do spatial k-nearest neighbour search on 1 billion points efficiently using cloud computing technologies (e.g., Google mapreduce/bigtable/gfs)?
  2. How to do spatial interpolation, local clustering analysis based on the answer for above question 1?

Team

  • Yan Liu
  • Shaowen Wang
  • Kaichao Wu
  • Yanli Zhao

Resources

Roadmap

Week 03/08 - 03/14
  • Setup resources by re-organizing our previous efforts. Done
  • Setup weekly discussion schedule. Mondays 4-5pm, weekly
Week 03/15 - 03/21
  • Paper reading on Gi*(d) decomposition and computation
  • Discussion on original idea for hbase/hdfs/mapred-based Gi*(d) computation
Week 03/22 - 03/28
  • Study of adaptive quadtree decomposition and its role in hadoop-Gi*(d) solution
  • Discussion on scalability issues of hadoop-Gi*(d).
Week 03/29 - 04/04
  • Continue the discussion on hadoop-Gi*(d) algorithms. Kaichao's slides
    1. Storage models: the storage model might be affected by user query patterns, e.g., point query or polygon query
    2. Query pattern: user selects a point or points within a polygon, how to identify the points whose Gi*(d) values should be calculated?
  • Clarify the tasks to be done on hadoop
    1. Construction of data and index table (index table structure does not need to be complete). Data table and data table access is not progressing well with Kaichao's code
    2. Storage of sample datasets at large scale and see how to fetch data/rows distributed on multiple data blocks and data nodes.
    3. How to retrieve hbase table rows? Commands and code examples. Kaichao has developed such code
    4. How to write map tasks to fetch hbase table rows? What if the specified row range span more than one data nodes? Does it indicate the need for the reduce task?
  • Kaichao and Yanli has implemented a prototype Gi*(d) program to compute one point's gid value using HBase and Map/Reduce
    1. Sample data has 96 points, stored in a HBase table as 10 rows (10 points in first 9 rows, 6 in the last row)
    2. Each map task takes a point as input (easily extensible to a point set), and computes partial results for W*, S*, etc. from this point to all points contained in the row which the map task is mapped to
    3. Reduce task aggregate partial results to form the global value of the gid for this point
  • Next steps
    1. To extend the prototype into a modular function gidQuery(pointSet)
    2. To figure out how to do adaptive quad-tree decomposition and space filling curve indexing for an input point dataset, based on point density
    3. To figure out how a HBase table is stored on HDFS, i.e., the location of rows in data block, data node, or memory. Touch base on the support from Hadoop for manipulating data location
    4. To analyze and evaluate storage strategies to increase the data parallelism within gidQuery(): identify and represent hot spots; possibly interleave the storage of hotspot quads to make the computation of gidQuery() use as many data nodes as possible simultaneously
Week 04/05 - 04/11
  • To extend the prototype into a modular function gidQuery(pointSet)
  • To figure out how to do adaptive quad-tree decomposition and space filling curve indexing for an input point dataset, based on point density
  • To figure out how a HBase table is stored on HDFS, i.e., the location of rows in data block, data node, or memory. Touch base on the support from Hadoop for manipulating data location
List of pending TODOs from previous discussions
  • Introduction to MapIMG raster processing algorithm
    1. A straightforward, less computational-intensive geoprocessing method, but data-intensive, widely and frequently-used
    2. MapIMG is the sequential code developed by USGS. MapIMG uses GCTP which includes auto-translated C code from Fortran
    3. Question 1: how complicated is the software?
    4. Question 2: how complicated is the algorithms?
    5. Question 3: Based on answers to Q1 and Q2, how should we approach it in hadoop context?

Archive

References