|
Medical Bioinformatics and e-Bioscience
|
|
|
|
MOTEUR: Advanced Page
This page is intended for advanced users and developers of the e-BioInfra. Please read the
MOTEUR page first.
Details of workflow execution
Figure 1 - Detailed sequence diagram of workflow execution with
MOTEUR on the e-BioInfra
The execution of a workflow by
MOTEUR involves various software components illustrated in
figure 1. The main steps are listed below:
- the process is activated when the user presses the "RUN" button on the MOTEUR plugin in the VBrowser.
- the plugin then contacts the MOTEUR web service that runs in the e-BioInfra server (alegre). The workflow description (SCUFL file), the input values (XML file) and the user proxy are sent via https protocol to the server alegre.
- the web service is implemented by a cgi program ( moteur_server) that writes its own log file ( moteur_service.log). It creates a new directory (with an unique name) to store all data related to this workflow execution. It then runs a script ( submitWorkflow.sh).
- the script configures environment variables and then starts a MOTEUR engine to execute the workflow on the grid. This script has been adapted for the e-BioInfra environment.
- the engine is a Java program that interprets the SCUFL, GASW descriptor(s) and the input files, and executes the workflow on the grid using the user grid proxy. It also generates monitoring information for the workflow and individual jobs as html pages.The engine remains running as long as the workflow is not complete yet.
- for each workflow component and combination of input values (also called workflow task), an individual grid job is executed. The engine dowloads the GASW descriptor, generates a script to perform the task on the grid, and generates the corresponding job description file ( jdl). Each job is identified with the name of the workflow component followed by an unique identifier. The gasw, shell and jdl files are stored using this job identifier. All job names are stored in a single file ( jobs.txt)
- the jobs are submitted to the glite Workload Management System (WMS). After some interaction with other gLite middlware components, the WMS submits the job to the queue of some computing resource and returns a job identifier ( job id). This is used to follow the status of the job by MOTEUR. The user can also follow the progress of jobs using the JobMonitoring plugin or the glite command-line utilities. All job ids are stored in a single file ( jobs.vljids)
- ultimately the jobs will run on some worker node. Due to the wrapping automatically done by GASW, the steps performed by a job are
- download programs (application executable and dependencies)
- download the data (input files to the workflow component)
- run the executable
- upload results (output files of the workflow component)
- return exit code (0=ok, non-zero = error)
- the engine monitors the status of all jobs, updating it in the corresponding <job name>.jdl.log file
- Jobs that are not completed successfully (either aborted or completed with exit code <> 0) are retried a given number of times (RETRYCOUNT)
- When a job is not completed before a given time (TIMEOUT), it is killed by MOTEUR and retried again a given number of times (RETRYCOUNT)
See also the official
MOTEUR troubleshooting page.
Generated files
For each workflow executed with
MOTEUR, various files are generated and collected into a single directory with a fixed structure (see below). A new directory is created for each workflow, with an unique name:
workflow-<unique id>
This directory can be reached in various manners:
- via the workflow html monitoring page via the link 'Log'
- via the webserver:
https://alegre.science.uva.nl:9443/workflows/workflow-<unique id>
- directly at the e-BioInfra server (or with sFTP)
/var/www/html/workflows/workflow-<unique id>
The most relevant files/directories in this directory are:
- workflow.xml: workflow description (Scuf file) executed by MOTEUR
- inputs.xml: contains all workflow input values
- workflow.err: stderr for MOTEUR engine during the workflow execution
- workflow.out: stdout for MOTEUR engine during the workflow execution
- gasw/: contains the GASW descriptors for all the executed workflow components. The files in this directory are named with the workflow component (e.g. hello-GASW.xml)
- jdl/: contains the jdl files automatically generated by MOTEUR for all the executed workflow components based on the GASW descriptor. The files are named after the job name ( <job-name>.jdl, e.g. hello-GASW-12345.jdl). Note that a new .jdl file is generated every time the job is submitted (also when it is retried because of some error). The file <job-name>.jdl.log (e.g. hello-GASW-12345.jdl.log) contains the job history, for example:
https://wmslb2.grid.sara.nl:9000/a:11:05:2009:16:06:24:SUCCESSFULLY_SUBMITTED
https://wmslb2.grid.sara.nl:9000/a:11:05:2009:16:06:24:WAITING
https://wmslb2.grid.sara.nl:9000/a:11:05:2009:16:06:34:READY
https://wmslb2.grid.sara.nl:9000/a:11:05:2009:16:06:44:QUEUED
https://wmslb2.grid.sara.nl:9000/a:11:05:2009:16:08:05:RUNNING
https://wmslb2.grid.sara.nl:9000/a:11:05:2009:16:10:05:COMPLETED
- sh/: contains the shell scripts ( <job-name>.sh) automatically generated by MOTEUR for all the executed workflow components based on the GASW descriptor. The files are named with the workflow component added to a random number (e.g. hello-GASW-12345.sh). Note that a new file is generated every time the job is submitted (also when it is retried because of some error).
- jobs.txt: contains the names of all the submitted jobs (e.g. hello-GASW-12345)
- jobs.php: this is a php script that displays the job monitoring page. It reads the files containing job status ( jdl/*.jdl.log) and presents status in a human-readable format. In the current version of MOTEUR, this page opens upon clicking on the boxes.
- jobs.vljids: This file contains the job-ids of all the jobs submitted for the workflow. Left-clicking this file in the VBrowser, the JobMonitoring plugin will open and display the job status.
Notes:
- the moteur_server.log file is not accessible to the user.
- the new MOTEUR server/plugin enables the user to configure TIMEOUT and RETRYCOUNT (and other parameters) for each new workflow. The configuration is saved in the directory conf.
Handling Errors
Errors may occur during the execution of a workflow, mostly during the execution of jobs on the grid.
In some cases, the causes of failure are transient, and simply retrying the jobs will work. These errors can be detected by inspecting the workflow logs, but they are not relevant, since they have been automatically recovered by
MOTEUR.
Other errors persist during the whole workflow execution (so retrying the jobs does not work), but they are solved soon afterwards (e.g. by intervention of the grid admins). Recovery can be achieved by running again the complete workflow or repeating the workflow part with for the input values to generate missing results.
Finally, some errors persist because they are due to application failure (bugs) or permanent failure in the grid resources (e.g., some file was lost). In this case, manual intervention is needed to fix the problem.
To find out the type of error at hand, it is necessary to dive into the files generated during workflow execution.
| What you see |
Possible cause |
Files to look |
| workflow monitoring page only shows Submitting Workflow |
Error in the web service, workflow not started |
moteur_server.log workflow.err/.out |
| workflow never ends (green and grey boxes) |
Error in the engine, workflow died |
moteur_server.log workflow.err/.out |
| red boxes in the monitoring page |
failure in running the job on the grid (ABORTED, TIMEOUT) application terminates with non-zero exit code (ERROR) |
jdl/*.log sh/* stdout/stderr of jobs |
Troubleshooting
- find the job-name of the jobs that failed (jdl/*.log):
- jobs with ABORTED, ERROR, TIMEOUT
- ABORTED means some problem on the grid, need to inform grid.support if it persists
- TIMEOUT means that MOTEUR killed the job because it was taking too long to complete. If the job was already RUNNING, this means that the current timeout is too small. Enlarge it and rerun the jobs/workflows
- ERROR means that the job completed with error (exit code non zero). These are the jobs that need most attention because they may point to application errors.
- recover the job-id for the job of interest
- matching line number in jobs.txt with jobs.vljids
- recover the std.out and std.err files
- examine files and check if you can find the following cases
- can't download program(s)
- can't download data
- error when running the application
- can't upload the result
- other strange things, such as missing files, libraries, executables, etc.
See Also