Skip to content

How to simulate amplicon-seq data? #221

@capoony

Description

@capoony

Hi all,

apologies for yet another request! Specifically, I want to simulate amplicon-seq reads of ONT data using NanoSim but fail at the simulation step which does not finish (at least within hours).

I have a reference sequence based on Sanger sequencing of the amplicon (Stor1_cox1.fa). In addition, I have ONT data of the same amplicon (COX1.fastq), which I could use for model training.

Following your suggestion in issue 112, I am using the "transcriptome" method.

conda activate nanosim

read_analysis.py transcriptome \
    -i ${wd}Syrphid/results/demo_ext/data/demultiplexed/Stor-1/COX1.fastq \
    -rg ${wd}simulations/data/Stor1_cox1.fa \
    -rt ${wd}simulations/data/Stor1_cox1.fa \
    -o ${wd}simulations/data/COX1_training \
    --no_intron_retention \
    -t 100

This finisihes without error. However, when I want to use the model for simulations, the script gets stuck even when simulating only 100 reads.

printf  """target_id\test_counts\tpm\nENSStor-1\t1000\t1000\n""" > ${wd}simulations/data/Stor1_cox1.exp

simulator.py transcriptome \
    -rt ${wd}simulations/data/Stor1_cox1.fa \
    -c ${wd}simulations/data/COX1_training \
    -o ${wd}simulations/data/Stor1_cox1_sim \
    -e ${wd}simulations/data/Stor1_cox1.exp \
    -n 100 \
    --no_model_ir \
    -t 4

Can you help me with this?

Moreover, I am wondering if this model can also be used for other amplicons with longer read lengths? I fear not if I understand the logic correctly. What to do in this case (when there is no amplicon-specific Training data available)?

Thanks a lot,

Testdata.zip

Martin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions