USGS Collaboration

pRasterBlaster

Identified Tasks

pGeotiff
  1. resole compile issue on local box
  2. modify I/O code to use fwrite() in order to understand output data pattern
  3. replace output code with
    1. a MPI IO fileview setup for output raster
    2. MPI IO calls to write data
NetCDF/HDF5
  1. Testing parallel I/O feature on netcdf/hdf5: working out examples that benefit from parallel I/O
  2. netcdf/hdf5 library for outputing geospatial dataset (*the focus before May 14*)
  3. Combine above two and code the parallel netcdf/hdf5 geospatial data output library

Parallel I/O

In pRasterBlaster code, each process writes a number of rows of output raster cells to a location specified by an offset in the output raster file. Currently, the write is incorrect in parallel computing environment because it simply uses GDAL's geotiff data driver which is sequential and doesn't consider file locking. If there are many processes writing the output raster file, some processes could not get the lock (the number of file opens is limited, we suspect). This task tackles this issue and aims to find a solution for parallel I/O in prasterblaster.

MPI IO is what we think should leverage as a parallel I/O solution. There can be several ways to use MPI IO in pRasterBlaster:

  1. each processor output its part as a single tiff. after mpirun, a process aggregates all tiff files into the final tiff. This is a shortcut solution that we should avoid unless there is deadline to catch
  2. write a proprietary library to output geotiff using MPI IO. This can be costly, too, because geotiff format can be complex. There is reference code from TauDEM. src/tiffIO.cpp and src/tiffIO.h, and tiffTest.cpp are the files of relevance. We don't need to care about tiff reading part. We just need to write to one format. This solution might be good as the effort to build our data IO capabilities
  3. change pRasterBlaster output format to netcdf/hdf5. netcdf/hdf5 has embedded MPI IO support if netcdf4 is compiled with hdf5 and hdf5 is compiled with parallel IO option. On most of xsede clusters, hdf5 is compiled with parallel IO option. But those who have netcdf4 installed all using sequential hdf5. We need to do a netcdf4 install by our own and test it out. NetCDF is recognized by arcgis and gdal, which is good.
  4. extend gdal geotiff data driver to make it use MPI IO. This method can be costly because it mixes several things together: GDAL data driver development, rewriting geotiff driver to support MPI IO, and make it work in pRasterBlaster. There can be a quick way: we use gdal data driver to create output raster and metadata on rank 0, then close it. Then we replace output row writing part of code with a mpi io implementation and make sure the result raster can be recognized by common gis software such as gdal

At this point, method 2 & 3 seem the way to go.

For method 2, the following tasks are identified:

  1. read TauDEM tiffIO code and test it out
  2. evaluate the cost of writing our own code for tif IO

For method 3, the following tasks are identified to make it happen:

  1. Configure XSEDE clusters to enable parallel netcdf/hpf5
  2. Test write performance of netcdf/hpdf5 MPI IO
  3. Find/develop examples to write a raster into netcdf/hpf5; and be able to convert it into common GIS raster format, e.g., geotiff (using gdal?)
  4. Integrate found solution into pRasterBlaster

Resources

Datasets

Map Generalization

Larry@USGS will visit us on the second day of USGS visit. We will summarize TauDEM experience with him and discuss potential collaboration directions. Choonhan@SDSC was able to run TauDEM successfully on Trestles and is willing to share experience with us within the CyberGIS project. We need man power to test his solution and do primitive performance study.

Open Service API to pRasterBlaster computation

Computation management of pRasterBlaster via open service API has been developed by Yan. During USGS visit, we will demonstrate the capability in terms of batch processing, triggering discussions on how to improve USGS operation of pRasterBaster and allow CyberGIS community to access pRasterBlaster service. Topics include:

  • What are online GIS solution to mapping of non-Mercator projections?
  • Where to store computed results of pRasterBlaster in USGS production mode? If transfer to USGS storage, we might want to open data transfer service to CyberGIS users by extending the open service API

Updates from 1st Week (March 16.) Babak, Eric, Yan

  • Tiff: (Leader–fixed–, IDF–single linked-tree–, Binary–data–)
  • Geotiff Spec
  • GTiff2000
  • We don't need to support a lot of formats, because we are writing and we just need to support one(probably row-wise, without compression) that other applications can read.

Tasks:

  1. Yan, Eric: Read GDAL's geotiff driver.
  2. Babak: Support NetCDF-HDF5 for Geotiff(NetCDF4) → First step: Load a NetCDF file, read it and write it again in NetCDF format. Yan has a sample NetCDF file from Mike.

Updates from 2nd Week (March 22.) Babak

  • Tried to understand HDF5 and NetCDF file formats. A report is being written by me from the tutorials and examples that I've been reading.
  • Got a relatively large(200MB) NetCDF file from Mike. Wrote a small program to extract some facts from this file(Number of dims, Number of attributes, etc.)

Updates from 4th Week (Apr 09.) Babak

  • After reading about how to use parallel function calls of NetCDF-4, I was trying to compile a code containing a nc_create_par() function call on my desktop and I was getting error about the compiler not finding this function in Netcdf library.
  • The problem was that it is hard to compile MPICH2 + HDF5 + NetCDF4 with Parallel I/O enabled. Here is how I made it work:
  • How to compile MPICH2:
$ CFLAGS="-fPIC"
$ CXXFLAGS="-fPIC"
$ ./configure --enable-romio --enable-shared --prefix=/opt/mpich2-1.4.1p1-gcc44/
$ make && make install
  • How to compile HDF5?
$ CC=mpicc
$ CXX=mpicxx
$ ./configure --enable-parallel --with-szlib=/opt/szip-2.1-gcc44/ --prefix=/opt/hdf5-1.8.6-gcc44 --with-pic
$ make check install
  • How to compile NetCDF-4?
$ export CC="mpicc"
$ export CXX="mpicxx"
$ CPPFLAGS="-I/opt/hdf5-1.8.6-gcc44/include/ -I/opt/szip-2.1-gcc44/include/" 
$ LDFLAGS="-L/opt/hdf5-1.8.6-gcc44/lib/ -L/opt/szip-2.1-gcc44/lib/" 
$ ./configure --prefix=/opt/netcdf-4.2-gcc44 --with-pic

Updates Apr 13 - Eric

  • I have finished scanning through the geotiff documentation
  • I also reviewed the prasterblaster code to decide the best approach for integrating parallel IO
  • I believe the best approach will be the following:
    • Add an MPI_IO file handle to the ProjectedRaster class
    • The file handle will be opened upon creation
    • writeRaster will use MPI-IO parallel file writing functions
    • Then the file handle will be closed
  • To test the correctness I will compare the output between a parallel write and a serial write to make sure they are binary equivalent

Updates Apr 28 - Babak

I have given a NetCDF file named L3b_20111101-20111130__GLOB_4_GSM-MERMODSWF_CHL1_MO_00.nc. When I looked at the header of this NetCDF file(using ncdump command), I found the website providing it: [http://www.globcolour.info/]]

The summary of product information of this website is as follows:

  • GlobColour Level-3 output data includes binned, mapped and browse products which are described in the following sections. The binned and mapped products are stored in netCDF files. The netCDF library or third-party tools including netCDF readers must be used to read the GlobColour products. The browse products are written in PNG format.
  • netCDF (Network Common Data Form) is a machine-independent, self-describing, binary data format standard for exchanging scientific data.
  • The ncdump utility, available on the UCAR server, generates a CDL text representation of a netCDF file on the standard output, optionally excluding some or all of the variable data in the output.

The following rules are applied when writing the global binned (ISIN grid) and mapped products:

  • Each parameter is stored in a single file including metadata and accumulated statistical data.
  • Global metadata are stored as global attributes
  • Accumulated statistical data are stored as variables
  • Metadata related to statistical data are stored as variable attributes.

Naming Convention: Lzz_date_time_ROI_SR_INS_PRD_TC_nn.ext

  • Lzz is the product level (L3b for level 3 binned ISIN grid, L3m for level 3 mapped grid)
  • date is specified in UTC format as yyyymmdd
  • time is specified in UTC format as hhmmss
  • ROI is the name of the region of interest (e.g. GLOB for global coverage)
  • SR indicates the resolution of the grid (e.g. 4 for 1/24° ISIN grid)
  • INS is the instrument acronym (MER for MERIS, MOD for MODIS, SWF for SeaWiFS or any combinaison of these names for the merged products)
  • PRD is the product type (CHL for chlorophyll...).
  • TC is the time coverage (TR for track-level products, DAY for daily, MO for monthly, YR for annual products)
  • nn is a counter. For track products, we store in this counter the data-day in yyyymmdd format.
  • ext is the file extension (nc for netCDF files, png for PNG files)

A netCDF dataset is made up of three basic components:

  • dimensions
  • variables
  • variables attributes
  • global attributes

The variables store the actual data, the dimensions give the relevant dimension information for the variables, and the attributes provide auxiliary information about the variables or the dataset itself.

First Version of Parallel I/O Library using NetCDF (Jun 2012, Babak)

We are going to document our first version of Parallel I/O library for GIS which read/write(s) from/to NetCDF files using NetCDF-4 parallel operations. This library is written in C and is stored on this SVN link <SVN LINK TO THE LIBRARY>.

This library consists of a general info module(info.c and info.h files) and two components: one for read(par_data_read.h, par_data_read.c) and one for write(par_data_write.h, par_data_write.c), each described in the following:

Info Module

In the info.h file we have defined two important data structures for storing two main concepts of NetCDF file format: Dimensions and Variables:

Dimension Data Structure

Dimensions are the length of the data that you want to store. As we'll see in next section, variables which are used for storing the data in NetCDF are made by dimensions. The data structure defined in our info module which is simply a C struct containing an integer for id of the dimension, a string for the name of the dimensino and a size_t for the length of the dimension is our abstraction of Dimensions in NetCDF.

Variable Data Structure

Varibales are the most important NetCDF constructs in which data get stored. Our data structure to represent variables is again a C struct containing an integer for id of the variable, a string for the name of it and a nc_type for the type of the data we want to store, an integer for the number of dimensions that the variable has, an array of integers for the IDs of the dimensions that the variable is made up of and an integer for the number of attributes* assigned to this variables.

In the first version of this library, we are going to skip attributes, as they are used for the metadata.

Parallel Data Read Module

This module contains all the general functions to read the data from a NetCDF file in parallel. For now we have implemented two functions for reading any kind of variables of type Short and Double. The general declaration of a read function looks like this: <type> *read_<type>_variable_by_name(char *var_name, int ncid, var_info *vinfo, dim_info *dinfo);

Therefore, our current implementation has this function for <type>=short and double. All of these functions are declared in par_data_read.h file and defined in par_data_read.c file. As it can be seen from the declaration, we get the name of the variable that user wants to read, the ncid of the NetCDF file that user has opened using approprate NetCDF functions for openning files, var_info is a pointer to an array of Variable structures that has been filled with get_vars_info() function of Info module and dinfo is a pointer to an array of Dimension structures filled with get_dims_info() function of that module. In this way, this read_<type>_variable_by_name() function of this module has access to all the variables and dimensions that the file has. Based on the name, we can inquiry the variable that the user wants to read and therefore get the dimensions of that variable. Based on this number of dimensions we fill the variable needed for parallel reading of the data(start[] and count[] arrays that nc_get_vara_<type> function needs). After calculating these arrays(here we should decide if we want to read in a 1D way, 2D way or even higher dimensions, for now, we have used just 1D), we malloc a variable with the size of the variable in the memory and call NetCDF get variable functions to read data into memory in parallel. After this stage, we have a MPI_Barrier to synchronize and then we can return the pointer to that variable in the memory.

Parallel Data Write Module