Due: End of Week 8 Points: 100
By the end of this lab, you will:
- Analyze Sanger sequencing chromatograms (forward and reverse reads)
- Perform quality control on DNA sequences
- Create consensus sequences from F+R pairs
- Build phylogenetic trees with reference sequences
- Identify mosquito species using COI barcodes
- Click the green "Code" button on GitHub
- Select "Codespaces" tab
- Click "Create codespace on main"
- Wait for the environment to load (~2-3 minutes)
- You're ready! Use the
-csscripts (see below)
If you have Docker installed locally, you can run the analysis on your own computer. See the README for Docker installation instructions.
Before analyzing the class data, you MUST complete the interactive tutorial:
In Codespaces:
./tutorial-cs.shWith local Docker:
./tutorial.shWhy this is mandatory:
- ✓ Teaches you all 5 steps of the workflow
- ✓ Uses test data (you can't break anything)
- ✓ Shows you what results should look like
- ✓ Takes only 15-20 minutes
- ✓ Makes the actual assignment much easier!
Do NOT skip this! Students who skip the tutorial get confused and make mistakes.
Important: Everyone will analyze the same class dataset. These are the pooled sequences from all students this semester. Sample names are initials only.
The .ab1 chromatogram files are in: data/student_sequences/
The dataset includes:
- Forward reads (sample names ending in F)
- Reverse reads (sample names ending in R)
- Example: AT-HV1F and AT-HV1R are a pair
OPTION 1 (RECOMMENDED): Use the automated script
In Codespaces:
./run-analysis-cs.shWith local Docker:
./run-analysis.shThis runs all 6 steps below automatically. This is the easiest way!
OPTION 2: Run each step individually (shown below)
The commands below show you what happens in each step. You can run them one by one if you want to understand the process better, or if you need to re-run just one step.
Note: The individual commands below use Docker. In Codespaces, you can run Python directly (e.g., python3 modules/01_quality_control/qc_chromatograms.py ...).
Check which sequences are good enough to use:
docker run --rm --entrypoint="" -v $(pwd):/workspace -w /workspace \
cosmelab/dna-barcoding-analysis:latest \
python3 modules/01_quality_control/qc_chromatograms.py \
data/student_sequences/ \
results/my_analysis/qc/ \
--openLook at the HTML report that opens. Count:
- How many sequences PASSED QC?
- How many samples have BOTH F and R that passed?
Combine forward and reverse reads:
docker run --rm --entrypoint="" -v $(pwd):/workspace -w /workspace \
cosmelab/dna-barcoding-analysis:latest \
python3 modules/02_consensus/create_consensus.py \
results/my_analysis/qc/passed_sequences.fasta \
results/my_analysis/consensus/ \
--pairs-only \
--openThe --pairs-only flag means: Only keep samples where BOTH F and R passed QC. This ensures high-quality consensus sequences.
Add the class consensus sequences to the database of known SoCal mosquitoes:
cat results/my_analysis/consensus/consensus_sequences.fasta \
data/reference_sequences/socal_mosquitoes.fasta \
> results/my_analysis/consensus/combined_with_references.fastaThis creates a file with the CLASS sequences + 52 reference sequences.
Line up all sequences so we can compare them:
docker run --rm --entrypoint="" -v $(pwd):/workspace -w /workspace \
cosmelab/dna-barcoding-analysis:latest \
python3 modules/03_alignment/align_sequences.py \
results/my_analysis/consensus/combined_with_references.fasta \
results/my_analysis/alignment/Build an evolutionary tree showing relationships:
docker run --rm --entrypoint="" -v $(pwd):/workspace -w /workspace \
cosmelab/dna-barcoding-analysis:latest \
python3 modules/04_phylogeny/build_tree.py \
results/my_analysis/alignment/aligned_sequences.fasta \
results/my_analysis/phylogeny/This takes ~2-3 minutes. Be patient!
Compare your sequences to GenBank database:
docker run --rm --entrypoint="" -v $(pwd):/workspace -w /workspace \
cosmelab/dna-barcoding-analysis:latest \
python3 modules/05_identification/identify_species.py \
results/my_analysis/consensus/consensus_sequences.fasta \
results/my_analysis/blast/After completing Steps 1-6 above, run the interactive question script:
python3 answer_assignment.pyWhat this script does:
- Asks you questions about your analysis results
- Guides you through the HTML reports (QC, BLAST, phylogeny)
- Collects your answers in a structured format
- Saves to
answers.jsonfor automatic grading
Why use this script?
- ✓ No formatting errors (it creates perfect JSON)
- ✓ Interactive and easy to use
- ✓ Automatically graded when you push to GitHub
- ✓ Immediate feedback on correctness
You will answer questions about:
- Species identification (BLAST results)
- Quality control statistics
- Phylogenetic tree interpretation
- Mosquito diversity assessment
NOTE: If you prefer to fill in answers manually instead of using the interactive script, you can fill in the tables and questions below. However, the interactive script (python3 answer_assignment.py) is recommended!
Fill in this table with the BLAST results for the class dataset:
| Sample | Species Identified | % Identity | Common Name |
|---|---|---|---|
Instructions:
- Only include samples with consensus sequences (had both F and R pass QC)
- Use the top BLAST hit from each sample
- Species names must be in italics: Genus species
- Get % identity from BLAST HTML report
a) How many of the 30 class sequences (.ab1 files) passed QC? b) How many samples had BOTH forward AND reverse reads pass? c) Why is it important to have both F and R reads?
Your answer:
a)
b)
c)
Look at the phylogenetic tree (results/my_analysis/phylogeny/tree.png).
a) Do the class samples cluster together, or are they spread across different parts of the tree? b) Which reference species are the class samples most closely related to? c) What does this tell you about mosquito diversity in the class sampling locations?
Your answer:
a)
b)
c)
a) What mosquito species did the class identify? List all unique species found. b) Do the BLAST results (% identity) agree with where samples clustered on the tree? c) Are these species known to occur in Southern California? (You may need to Google this!) d) How confident are you in the species identifications? (Consider % identity scores)
Your answer:
a)
b)
c)
d)
Submit your work by committing and pushing to GitHub:
# Add your completed assignment and results
git add answers.json results/
# Commit your work
git commit -m "Complete DNA barcoding analysis and assignment"
# Push to GitHub
git push origin mainImportant: Make sure to add answers.json (created by python3 answer_assignment.py)
Verify your submission:
- Go to your GitHub repository
- Click the "Actions" tab
- Check that "Auto-Grading" workflow passed ✅
- If it failed ❌, read the error message and fix missing files
What auto-grading checks:
- ✅ Tutorial completed (
results/tutorial/has all reports) - ✅ Analysis completed (
results/my_analysis/has all reports) - ✅ Assignment file exists (
assignment.md) - ✅ Answers are correct (since everyone has the same data, answers should match!)
Auto-grading checks your:
- Species identifications (BLAST results table)
- QC statistics (number of sequences that passed)
- Written answers (keyword matching for concepts)
| Component | Points | Criteria |
|---|---|---|
| Part 1: Commands executed correctly | 60 | All 6 steps completed, files generated |
| Part 2: Results table | 20 | Accurate data, correct format |
| Part 3: Question 1 | 5 | Complete, accurate answers |
| Part 3: Question 2 | 7 | Thoughtful tree interpretation |
| Part 3: Question 3 | 8 | Complete species analysis |
| Total | 100 |
- Re-run the tutorial:
./tutorial-cs.sh(Codespaces) or./tutorial.sh(Docker) - Read the visual workflow:
docs/pipeline_workflow.md - Understand IQ-TREE:
docs/iqtree_guide.md - Check your work: Compare your results to the tutorial results
- Codespaces issues: See
.devcontainer/README.md - Ask your instructor
This assignment teaches you:
- How DNA barcoding identifies species
- Why quality control matters in sequencing
- How consensus sequences improve accuracy
- How to interpret phylogenetic trees
- How to use bioinformatics tools for real research
Remember: The goal is to LEARN the workflow, not just get it done. Take your time, look at the results, and think about what they mean!
Good luck! 🧬🦟