Help: GMI Outbreak Detection Methods

Inputs Outputs Method



Repository

Source code for the GMI Outbreak detection workflow can be found in this repository

Datasets

First of all, needed datasets have been collected in: datasets folder

  1. Input dataset: fastq input data obtained from GMI WGS standards and benchmarks repository. Here you can find instructions for download.
  2. Gold standard dataset: confirmed phylogeny for the outbreak being investigated.
  3. Input dataset ids: input dataset ids in .txt and .json format.
  4. Test dataset: a test tree for comparing with gold standard result. In this case just the same golden dataset. Robinson-Foulds metrics must be 0.
  5. benchmark_data: path where benchmark results are stored.

Nextflow pipeline and containers

Second, a pipeline has been developed which is splitted in three steps following OpenEbench specifications following this repo as an example:

Nextflow processes

  1. Validation and data preprocessing:

    1. Check results format:

      • Tree input: User input tree format is validated, nexus and newick formats are allowed being newick the canonical format. If format validated, a tree is outputted in the canonical format (.nwk).
      • VCF input:
    2. Get query ids:

      • Tree input: ids are extracted for user input tree in newick or nexus format. IDs are writed in: queryids.json
    3. Validate query ids:
      • Tree input: query ids are validated against ref input ids.
  2. Metrics:

    1. Precision/Recall calculation: common (TP), source (FP) and ref(FN) edges are calculated in the comparison of ref and test tree topologies. Recall and precision are calculated using this values and stored in a json file called {participant_id}_snprecision.json.
    2. Robinson-Foulds metric calculation: Normalized Robinson-Foulds test is performed between user tree and every participant tree already analyzed and stored in the benchmark_data folder in order to compare their topologies. Result value is writted to participant_matrix.json file.
  3. Data visualization and consolidation:
    1. Precision/Recall graph is created, classifying each participant inside a quartile.
    2. A all participant vs all participant heatmap is created usign normalized robinson-foulds matrix.

Containers info

Each step runs in its own container. Containers are built using a Dockerfile recipe which makes use of SCI-F recipes for software installation. All scif recipes are available in scif_app_recipes repository. Singularity recipes are also provided (Not yet adapted in nextflow pipeline).