An automatic pipeline for the BGC characterization of genomes. Moreover, it should integrate further data layers, such as a phylogenetic tree and the similarity to specific proteins.
- Nextflow installed
- Docker installed
- Docker images for:
- antiSMASH
- MMseqs2
- BiG-SCAPE
- chewBBACA
- MEGA-CC
- Streamlit installed
- pandas installed
- seaborn installed
- matplotlib installed
- Java 17 or later
- Pyhton 3.0
SID-Chart expects input data in the following format.
└── Data
├── dataset_staphy
├── ncbi_dataset
├── GCA_000433035.1
├── GCA_000433035.1_MGS324_genomic.fna
├── genomic.gbff
└── protein.faa
├── ...
└── ...
├── reference_BGCs
├── BGC0000943.gbk
├── ...
└── ...
├── reference_genome
└── GCF_001027105.1_ASM102710v1_genomic.fna
├── metadata.tsv
├── overviewBGCs.csv
├── proteins_uptake.fa
└── Staphylococcus_aureus.trn
- ncbi_dataset/ — Contains information about all species to be analyzed.
- reference_BGCs/ — Contains all biosynthetic gene clusters (BGCs) to be included in the analysis.
- reference_genome/ — Contains a FASTA file of a reference species required by chewBBACA.
- metadata.tsv — Lists the accession numbers and corresponding NCBI organism names.
- overviewBGC.csv — Provides a mapping between BGCs and lipoproteins.
- proteins_uptake.fa — Contains the lipoproteins to be analyzed.
- Staphylococcus_aureus.trn — A Prodigal training file required by chewBBACA (can be created using Pyrodigal)
File and folder naming expected by SID-Chart can be customized via nextflow.conf.
- Check that the input file names correspond to the default parameters defined in nextflow.conf.
- If they differ, either modify the values in nextflow.conf or provide the correct file names as arguments in run_pipeline.sh.
- Set the input directory inside the run_pipeline.sh script by defining it as the --input parameter in the Nextflow command.
- Inside the nf directory run:
./run_pipeline.sh [RUN_NAME]- Inside the web directory run:
./run_visualization.sh [RUN_NAME]