.. include:: ../../global.inc .. include:: chapter_numbers.inc .. _manual.tasks_as_recipes: ################################################################### |manual.tasks_as_recipes.chapter_num|: `Tasks and Recipes` ################################################################### * :ref:`Manual overview ` .. index:: pair: tasks as recipes; Manual | The python functions which do the actual work of each stage or :term:`task` of a **Ruffus** pipeline are written by you. | The role of **Ruffus** is to make sure these functions are called in the right order, with the right parameters, running in parallel using multiprocessing if desired. **Ruffus** manages the data flowing through your pipeline by supplying the correct parameters to your pipeline functions. In this way, you will get the following features for free: #. only out-of-date parts of the pipeline will be re-run #. multiple jobs can be run in parallel (on different processors if possible) #. pipeline stages can be chained together automatically Much of the functionality of **ruffus** involves determining the data flow through your pipeline, by governing how the output of one stage of the pipeline is supplied as parameters to the functions of the next. .. index:: pair: skip up-to-date; Manual .. _manual.skip_up_to_date: ======================================= Skip jobs which are up to date ======================================= Very often it will necessary to re-run a computational pipeline, because part of the data has changed. **Ruffus** will run only those stages of the pipeline which are absolutely necessary. By default, **Ruffus** uses file modification times to see which parts of the pipeline are out of date, and which :term:`task`\s need to be run again. This is so convenient that even if a pipeline is not file-based (if it, for example, uses database tables instead), it may be worth while to use dummy, "sentinel" files to manage the stages of a pipeline. (It is also possible, as we shall see later, to add custom functions to determine which parts of the pipeline are out of date. see :ref:`@parallel ` and :ref:`@check_if_uptodate `.) .. index:: single: inputs / outputs parameters .. _manual.io_parameters: .. index:: pair: inputs / outputs parameters; Manual ================================= *Inputs* and *Outputs* parameters ================================= **Ruffus** treats the first two parameters of each job in each task as the *inputs* and *outputs* parameters respectively. If these parameters are strings, or are sequences which contain strings, these will be treated as the names of files required by and produced by that job. The presence and modification times of the *inputs* and *outputs* files will be used to check if it is necessry to rerun the job. Apart from this, **Ruffus** imposes no other restrictions on the parameters for jobs, which are passed verbatim to task functions. Most of the time, it is sensible to stick with file names (strings) in the *inputs* and *outputs* parameters but **Ruffus** does not try to second-guess what sort of data you will be passing through your pipelines (except that strings represent file names). Thus, given the following over-elaborate parameters (parameter passing will be discussed in more detail from :ref:`|manual.files.chapter_num| `): :: [ [[1, 3], "afile.name", ("bfile.name", 72)], [[56, 3.3], set(custom_object(), "output.file")], 33.3, "oops"] This will be passed `"as is"` to your task function: :: do_something([[1, 3], "afile.name", ("bfile.name", 72)], # input [[56, 3.3], set(custom_object(), "output.file")], # output 33.3, # extra parameter "oops") # extra parameter **Ruffus** will interprete this as: :: Input_parameter = [[1, 3], "afile.name", ("bfile.name", 72)] Output_parameter = [[56, 3.3], set(custom_object(), "output.file")] Other_parameter_1 = 33.3 Other_parameter_2 = "oops" **Ruffus** disregards the *structure* of your data, only identifying the (nested) strings. Thus there are 2 input files: :: "afile.name" "bfile.name" and 1 output file: :: "output.file" .. index:: pair: rules; for rerunning jobs .. _manual.skip_up_to_date.rules: ======================================= Checking if files are up to date ======================================= The following simple rules are used by **Ruffus**. #. The pipeline stage will be rerun if: * If any of the *inputs* files are new (newer than the *output* files) * If any of the *output* files are missing #. In addition, it is possible to run jobs which create files from scratch. * If no *inputs* file names are supplied, the job will only run if any *output* file is missing. #. Finally, if no *outputs* file names are supplied, the job will always run. The :ref:`example ` in the next chapter shows how this works in practice. .. index:: pair: Exception; Missing input files ======================================= Missing files ======================================= If the *inputs* files for a job are missing, the task function will have no way to produce its *output*. In this case, a ``MissingInputFileError`` exception will be raised automatically. For example, :: task.MissingInputFileError: No way to run job: Input file ['a.1'] does not exist for Job = ["a.1" -> "a.2", "A file"] .. index:: single: Timestamp resolution ======================================= Caveats: Timestamp resolution ======================================= | Note that modification times have precision to the nearest second under some older file systems (ext2/ext3?). This may be also be true for networked file systems. | **Ruffus** is very conservative, and assumes that files with *exactly* the same date stamp might have been created in the wrong order, and will treat the job as out-of-date. This would result in some jobs re-running unnecessarily, simple because an underlying coarse-grained file system does not distinguish between successively created files with sufficiently accuracy. To get around this, **Ruffus** makes sure that each task is punctuated by a 1 second pause (via ``time.sleep()``). If this is gets in the way, and you are using a modern file system with nanosecond timestamp resolution, you can turn off the delay by setting ``one_second_per_job`` to ``False`` in :ref:`pipeline_run ` Later versions of **Ruffus** will allow file modification times to be saved at higher precision in a log file or database to get around this.