Gi*(d) Hadoop: Data-Oriented Computation of the Gi*(d) Statistic using Google MapReduce and BigTable

Problem Definition

Team

  • Qian Huang
  • Yan Liu
  • Shaowen Wang

Resources

We use a VM for the Gi*(d) Hadoop project development. The VM host is hadoop.cigi.uiuc.edu.

Roadmap

Week 09/20 - 09/27
  • Setup and play with Hadoop/HBase
    • Install hadoop for pseudo-distribution mode
    • Develop a simple mapreduce application that uses 2 data servers
    • Extend the mapreduce application to use hbase data
Week 09/28 - 10/04
Week 10/05 - 10/11
  • Finish a mapreduce application that creates multiple map tasks to process a hbase table collectively. Each map task processes a disjoint set of rows in the hbase table.
  • Wrap up mapreduce and htable practice
  • Read Gi*(d) paper
  • Discussion about hadoop-based Gi*(d) computation
Week 10/12 - 10/18
  • Confirm that dfs.block.size can be changed per file base; if not, redesign

(confirmed that the dfs.block.size can be changed per file base, but the mapper task will still read from the default value stored in the conf file, QianHuang)

  • Explore how data files/blocks are distributed and replicated over multiple data nodes
  • Can we control how data blocks are distributed or replicated to facilitate balanced mapred computation on the data nodes where decomposed datasets are stored?

(there is a new class of ReplicationTargetChooser in the unrelease version of Hadoop which can help us assign the store.)