OSG Troubleshooting Problem

This page is put up to explore challenges associated with the current OSG infrastructure

Problems experienced by VOs submitting jobs to OSG

Based on our earlier experience helping VOs run their jobs on OSG, we found that VOs and application developers that want to run their jobs on OSG encounter a number of problems. Here are some of the causes:

  • From the perspective of the user/VO it is quite unclear what a green dot in VORS really mean.
    • Currently VORS uses site_verify that does some minimal testing like ping, running a simple fork job and a gridftp test.
    • Does it sufficiently exercise the OSG software infrastructure, if not we need to identify what components are missing and need to be added.
    • The goal will not be to provide a monitoring system that the resource administrators could rely on (they should be using tools like RSV) but rather it is aimed job submitters and VOs.
    • Further we have a plethora of monitoring systems in OSG (VORS, RSV, GIP Validate, GridScan etc). These tools might be good to help resource administrators find problems, but they can be quite confusing to users. So many users end up running test jobs to exercise the system. Though in some cases it may be good for users to test jobs under their own DN, it would be preferable if such repetitive test jobs from a large number users can be avoided.
      • Ideally a green in VORS would indicate that all the software components distributed by the OSG (and deemed critical) has been installed and configured correctly to OSG's satisfaction.
      • There will be a clear and well known set of components that are getting tested.
      • This would satisfy the requirements of many VOs and they can then test any additional requirements that are specific to their VO.
  • How accurate is the information about VO support that a site advertises through various monitoring sources.
    • Based on our experience it seems that it is not very accurate especially for small and newer VOs.
    • There is a need to investigate how best to detect and notify administrators of this problem.
    • Also there is a need to verify that this information is being consistently reported between different monitoring system like GIP and VORS.
      • In the past some discrepancies have been reported.
  • Some sites place extra constraints on the jobs that are submitted.
    • For example on few sites (typically using NSFLite) the rsl parameter remote_initial_dir is not respected.
    • We need a standard mechanism to publish such site specific information.
  • It is also important for the VOs to know the resources it can expect from the site. In particular what priority will the user's VO have for running jobs at the site and how often. Ideally the jobs will be able to use resources as and when they become available, but in the current environment the VOs spend some upfront cost to make their application successfully run at a site and this will help them identify resources for which it is worthwhile to spend energy on.
  • How to facilitate application developers implement tools for effectively using OSG.
    • Let us consider the example of OSG Storage parameters/model. The release document provides details of the various model, but from the perspective of users and application developers there is very little information available on how to develop software that could correctly use the storage parameters.
    • For a commonly asked question from LIGO user has been is the $OSG_DATA on the site mounted using NFS or a shared FS. This problem could be resolved if we have a defined set of rules/API that application developers can easily adopt.

The first three points imply a need to a more precise information about OSG sites that users and VOs can confidently rely upon. The forth bullet is trying to convey a need for additional information that is currently unavailable. The last point can be thought of as a need for documentation and communication about aspects of present OSG S/W stack as well as a need for development of new features to enable application developers.

The Monitoring system

In OSG we have

  • Information Services: GIP, CEMon, BDII and ReSS
  • Monitoring Services: VORS, Grid Scanner, RSV, GIP Validation, VOMS Monitor, ldap information display, VO-centric tool
  • Central Information Sources: GOC registration database, Maintenance scheduling, ticketing system

https://twiki.grid.iu.edu/twiki/bin/view/MonitoringInformation/WebHome provides a list of OSG monitoring and information systems. We have a many components to monitor the OSG S/W stack.This includes VORS, Grid Scanner, RSV, GIP Validate, LDAP Information Display and VOMS Monitor. This raises few questions

  • With so many different sources of monitoring information it is unclear which of the sources need to be checked and which is considered definitive.
    • There is a need to consolidate the monitoring sources and provide possibly a single reference point for validating an OSG site.
  • A larger question is how accurate are the various OSG monitoring services and how do we verify it.
    • For example during the past year multiple tickets were opened when LIGO reported discrepancy between the ReSS query they had performed using the VO centric kit and information being provided by contacting resource administrators.
  • What are the OSG infrastructure components need to be tested, what are deemed as critical and what is the best way to monitor them.
  • Other unresolved question include: How should the information system handle the need of individual VOs to publish information they would like to use for matchmaking.

The Condor-G Grid Monitor Scalability

In order to improve scalability of GRAM-2 jobs Condor-G provides grid_monitor that essentially replaces the polling function of the jobmanager.

This interaction has resulted number of problems addressed by us. For example LIGO workflows were failing and the outputs were getting lost. Troubleshooting the problem identified the problem occurring due to a race condition between the Jobmanager and GridMonitor. We had problems in which the job status was not getting propogated back to Condor-G. High latency of gatekeeper has been reported at http://www.mwt2.org/sys/gatekeeper/ with Grid Monitor restarts being the suspected cause.

Since Condor-G forms a critical piece of the OSG infrastructure and its interaction with JobManager have been the cause of a number of problems investigated by the troubleshooting team we believe a through investigation of this piece along with some scalability testing will be useful in improving the infrastructure.

Job and Site Reliability Metric

Each VO has some measurement of reliability/efficiency of their jobs running on the Grid. Some of them classify it based on their application success rate, while others differentiate between Grid and application errors. Currently Gratia uses the error codes of application to determine their success rate, but there is no uniform OSG wide measurement of Grid site or job reliability. Such measures will help us identify problems with the infrastructure as well as provide additional benchmarks to evaluate new infrastructure components. There are a lot of open questions that need to be resolved, how do we define job reliability and site reliability, how do they relate to one another, how do we identify success/failure (checking logs to see if user successfully retrieved the sandbox, should we count resubmission by Condor-G), how do we handle pilot jobs.

Implementing Grid Passport for Troubleshooting

This will be a longer term project that is aimed at improving troubleshooting capability in OSG and help user and VO get a better understanding of their job execution process and better error tracking information on an occurrence of a failure.