doit for task automation

Having to manage computational workflows can be difficult, more importantly tracking which input and output files have recently been updated to ensure that you don’t repeat work will likely result in you screwing something up or repeating work.

doit solves this problem by tracking which files within a workflow need to be created or updated at any time , and executing those processes in the correct order. If you’ve ever written a make file, none of this will be new to you - I do hope that you’ll appreciate the much simpler and easy to use syntax though !

Here I’ll go over a few use examples and use cases.

Getting pydoit

pip install doit

Hello World

doit allows you to build a series of rules that describe tasks to be completed. Each task has a series of dependencies, an action, and a target (or output). The simples doit work flow would be something like this:

1file_in --> processing_program --> file_out

When you run doit a few things can happen:

  • pydoit sees that file_out doesn’t exist, so it executes the task to make sure it is created.

  • pydoit sees that file_out exists, but file_in has changed since the last time the task has run. So it re-executes the task to make sure that file_out is re-formed.

  • pydoit sees that file_out exists, and nothing up stream of it has changed. So it does nothing.

Minimum Example:

Put all three files into a folder:

1├── dodo.py
2├── ShoulgisdISayHello
3└── HelloWorld.py

and run :

1doit

resulting in :

1.
2├── dodo.py
3├── Hello.txt
4├── HelloWorld.py
5└── ShouldISayHello

Pretty simple.

Doit for batch processing

By using a yield statement in our task, we dynamically generate tasks over any arbitrary list. The dodo file is pretty self explanatory, just take special notice of the use of python yield and the keyword name .

Grab these files and put them in a folder.

Run the following commands:

1mkdir data
2cd data
3for i in {1..3}; do touch $i.in;done
4cd ..
5doit -n 3

Outputs something like:

1.  ingest:ingest_data/1.in
2.  ingest:ingest_data/2.in
3.  ingest:ingest_data/3.in

Doit just processed those files in parallel using 3 processes.

Lets add some files into our landing area :

1cd data
2for i in {5..20}; do touch $i.in;done
3cd ..
4doit -n 3

Outputs something like:

 1-- ingest:ingest_data/2.in
 2-- ingest:ingest_data/1.in
 3-- ingest:ingest_data/3.in
 4.  ingest:ingest_data/5.in
 5.  ingest:ingest_data/17.in
 6.  ingest:ingest_data/18.in
 7.  ingest:ingest_data/16.in
 8.  ingest:ingest_data/15.in
 9.  ingest:ingest_data/14.in
10.  ingest:ingest_data/20.in
11.  ingest:ingest_data/12.in
12.  ingest:ingest_data/10.in
13.  ingest:ingest_data/6.in
14.  ingest:ingest_data/7.in
15.  ingest:ingest_data/8.in
16.  ingest:ingest_data/11.in
17.  ingest:ingest_data/9.in
18.  ingest:ingest_data/13.in
19.  ingest:ingest_data/19.in

It skipped over the first 3 ingestion tasks as they’d been completed.

It should pretty obvious how one might link up doit to a cron job to watch a landing area for new data to enter a pipeline.

Final Thoughts

Because doit provides a simple and modular way of describing tasks within a workflow, it allows for a simple and lightweight method of describing and executing workflows.

Its a powerful framework that has plenty of knobs to turn. Documentation of some of the cooler features are here