lodash 中文文档 lodash 中文文档
英文官网 (opens new window)
GitHub (opens new window)
英文官网 (opens new window)
GitHub (opens new window)
  • 简介
  • 数组
  • 集合
  • 函数
  • 语言
  • 数学
  • 数字
  • 对象
  • Seq
  • 字符串
  • 实用函数
  • Properties

Kraken Tools


KrakenTools is a suite of scripts to be used alongside the Kraken, KrakenUniq, Kraken 2, or Bracken programs. These scripts are designed to help Kraken users with downstream analysis of Kraken results.

For news and updates, refer to the github page: https://github.com/jenniferlu717/KrakenTools/

Citation


KrakenTools has been published on September 28, 2022 as part of a protocol paper for using the Kraken software suite. Please cite the following when using any KrakenTools script:

[Lu J, Rincon N, Wood D E, Breitwieser F P, Pockrandt C, Langmead B, Salzberg S L, Steinegger M. Metagenome analysis using the Kraken software suite. Nature Protocols, doi: 10.1038/s41596-022-00738-y (2022)] (https://www.nature.com/articles/s41596-022-00738-y )

Please also cite the relevant paper for usage of KrakenTools with any of the listed programs.

Kraken 1
Kraken 2
KrakenUniq
Bracken

For issues with any of the above programs, please open a github issue on their respective github pages. This github repository is dedicated to only the scripts provided here.

Scripts included in KrakenTools


extract_kraken_reads.py
combine_kreports.py
kreport2krona.py
kreport2mpa.py
combine_mpa.py
filter_bracken_out.py
fix_unmapped.py
make_ktaxonomy.py
make_kreport.py
alpha_diversity.py (see Diversity/README.md)
beta_diversity.py (see Diversity/README.md)

Running Scripts:


No installation required. All scripts are run on the command line as described.

Users can make scripts executable by running

  1. ``` sh
  2. chmod +x myscript.py
  3. ./myscript.py -h

  4. ```

extract_kraken_reads.py


This program extract reads classified at any user-specified taxonomy IDs. User must specify the Kraken output file, the sequence file(s), and at least one taxonomy ID. Additional options are specified below. As of April 19, 2021, this script is compatible with KrakenUniq/Kraken2Uniq reports.

1. extract_kraken_reads.py usage/options


python extract_kraken_reads.py

-k, --kraken MYFILE.KRAKEN............. Kraken output file
-s, -s1, -1, -U SEQUENCE.FILE.......... FASTA/FASTQ sequence file (may be gzipped)
-s2, -2 SEQUENCE2.FILE................. FASTA/FASTQ sequence file (for paired reads, may be gzipped)
-o, --output2 OUTPUT.FASTA............. output FASTA/Q file with extracted seqs
-t, --taxid TID TID2 etc............... list of taxonomy IDs to extract (separated by spaces)

Optional:

-o2, --output2 OUTPUT.FASTA............. second output FASTA/Q file with extracted seqs (for paired reads)
--fastq-output.......................... Instead of producing FASTA files, print FASTQ files (requires FASTQ input)
--exclude............................... Instead of finding reads matching specified taxids, finds reads NOT matching specified taxids.
-r, --report MYFILE.KREPORT............. Kraken report file (required if specifying --include-children or --include-parents)
--include-children...................... include reads classified at more specific levels than specified taxonomy ID levels.
--include-parents....................... include reads classified at all taxonomy levels between root and the specified taxonomy ID levels.
--max #................................. maximum number of reads to save.
--append................................ if output file exists, appends reads
--noappend.............................. [default] rewrites existing output file

2. extract_kraken_reads.py input files


Input sequence files must be either FASTQ or FASTA files. Input files can be gzipped or not. The program will automatically detect whether the file is gzipped and whether it is FASTQ or FASTA formatted based on the first character in the file (">" for FASTA, "@" for FASTQ)

3. extract_kraken_reads.py paired input/output


Users that ran Kraken using paired reads should input both read files into extract_kraken_reads.py as follows:

  1. ``` sh
  2. extract_kraken_reads.py -k myfile.kraken -s1 read1.fq -s2 reads2.fq

  3. ```

Given paired reads, the script requires users to provide two output file names to contain extracted reads:

  1. ``` sh
  2. extract_kraken_reads.py -k myfile.kraken -s1 read1.fq -s2 reads2.fq -o extracted1.fq -o2 extracted2.fq

  3. ```

The delimiter (--delimiter or -d ) option has been removed.

  1. ``` sh
  2. `extract_kraken_reads.py -k myfile.kraken ... -o reads_S1.fa -o2 reads_s2.fa

  3. ```

4. extract_kraken_reads.py --exclude flag


By default, reads classified at specified taxonomy IDs will be extracted (and any taxids selected using --include-parents /--include-children. However, specifying --exclude will cause the reads NOT classified at any specified taxonomy IDs.

For example:

extract_kraken_reads.py -k myfile.kraken ... --taxid 9606 --exclude ==> extract all reads NOT classified as Human (taxid 9606).
extract_kraken_reads.py -k myfile.kraken ... --taxid 2 --exclude --include-children ==> extract all reads NOT classified as Bacteria (taxid 2) or any classification in the Bacteria subtree.
extract_kraken_reads.py -k myfile.kraken ... --taxid 9606 --exclude --include-parents ==> extract all reads NOT classified as Human or any classification in the direct ancestry of Human (e.g. will exclude reads classified at the Primate, Chordata, or Eukaryota levels).

5. extract_kraken_reads.py --include-parents/--include-children flags


By default, only reads classified exactly at the specified taxonomy IDs will be extracted. Options --include-children and --include parents can be used to extract reads classified within the same lineage as a specified taxonomy ID. For example, given a Kraken report containing the following:

  1. ``` sh
  2.     [%]     [reads] [lreads][lvl]   [tid]       [name]
  3.     100     1000    0       R       1           root
  4.     100     1000    0       R1      131567        cellular organisms
  5.     100     1000    50      D       2               Bacteria
  6.     0.95    950     0       P       1224              Proteobacteria
  7.     0.95    950     0       C       1236                Gammaproteobacteria
  8.     0.95    950     0       O       91347                 Enterobacterales
  9.     0.95    950     0       F       543                     Enterobacteriaceae
  10.     0.95    950     0       G       561                       Escherichia
  11.     0.95    950     850     S       562                         Escherichia coli
  12.     0.05    50      50      S1      498388                        Escherichia coli C
  13.     0.05    50      50      S1      316401                        Escherichia coli ETEC

  14. ```

extract_kraken_reads.py  [options] -t 562 ==> 850 reads classified as E. coliwill be extracted
extract_kraken_reads.py  [options] -t 562 --include-parents ==> 900 reads classified as E. colior Bacteria will be extracted
extract_kraken_reads.py  [options] -t 562 --include-children ==> 950 reads classified as E. coli, E. coli C, or E. coli ETECwill be extracted
extract_kraken_reads.py  [options] -t 498388 ==> 50 reads classified as E. coli Cwill be extracted
extract_kraken_reads.py  [options] -t 498388 --include-parents ==> 950 reads classified as E. coli C, E. coli, or Bacteria will be extracted
extract_kraken_reads.py  [options] -t 1 --include-children ==> All classified reads will be extracted

combine_kreports.py


This script combines multiple Kraken reports into a combined report file.

1. combine_kreports.py usage/options


python complete_kreports.py

-r 1.KREPORT 2.KREPORT........................ Kraken-style reports to combine
-o COMBINED.KREPORT........................... Output file

Optional:

--display-headers.............................. include headers describing the samples and columns [all headers start with #]
--no-headers................................... do not include headers in output
--sample-names................................. give abbreviated names for each sample [default: S1, S2, ... etc]
--only-combined................................ output uses exact same columns as a single Kraken-style report file. Only total numbers for read counts and percentages will be used. Reads from individual reports will not be included.

2. combine_kreports.py output


Percentage is only reported for the summed read counts, not for each individual sample.

The output file therefore contains the following tab-delimited columns:

perc............ percentage of total reads rooted at this clade
tot_all ........ total reads rooted at this clade (including reads at more specific clades)
tot_lvl......... total reads at this clade  (not including reads at more specific clades)
1_all........... reads from Sample 1 rooted at this clade
1_lvl........... reads from Sample 1 at this clade
2_all........... ""
2_lvl........... ""
etc..
lvl_type........ Clade level type (R, D, P, C, O, F, G, S....)
taxid........... taxonomy ID of this clade
name............ name of this clade

kreport2krona.py


This program takes a Kraken report file and prints out a krona-compatible TEXT file

1. kreport2krona.py usage/options


python kreport2krona.py

-r/--report MYFILE.KREPORT........ Kraken report file
-o/--output MYFILE.KRONA.......... Output Krona text file

Optional:

--no-intermediate-ranks........... [default]only output standard levels [D,P,C,O,F,G,S]
--intermediate-ranks.............. include non-standard levels

2. kreport2krona.py example usage


  1. ``` sh
  2. kraken2 --db KRAKEN2DB --threads THREADNUM --report MYSAMPLE.KREPORT \
  3.     --paired SAMPLE_1.FASTA SAMPLE_2.FASTA > MYSAMPLE.KRAKEN2
  4. python kreport2krona.py -r MYSAMPLE.KREPORT -o MYSAMPLE.krona
  5. ktImportText MYSAMPLE.krona -o MYSAMPLE.krona.html

  6. ```

Krona information: see https://github.com/marbl/Krona.

3. kreport2krona.py example output


--no-intermediate-ranks

  1. ``` sh
  2.     6298        Unclassified
  3.     8           k__Bacteria
  4.     4           k__Bacteria     p_Proteobacteria
  5.     6           k__Bacteria     p_Proteobacteria    c__Gammaproteobacteria
  6.     ...

  7. ```

--intermediate-ranks

  1. ``` sh
  2.     6298        Unclassified
  3.     79          x__root
  4.     0           x__root     x__cellular_organisms
  5.     8           x__root     x__cellular organisms   k__Bacteria
  6.     4           x__root     x__cellular organisms   k__Bacteria     p__Proteobacteria
  7.     6           x__root     x__cellular organisms   k__Bacteria     p__Proteobacteria   c__Gammaproteobacteria
  8.     ....

  9. ```

kreport2mpa.py


This program takes a Kraken report file and prints out a mpa (MetaPhlAn) -style TEXT file

1. kreport2mpa.py usage/options


python kreport2mpa.py

-r/--report MYFILE.KREPORT........ Kraken report file
-o/--output MYFILE.MPA.TXT........ Output MPA-STYLE text file

Optional:

--display-header.................. display header line (#Classification, MYFILE.KREPORT) [default: no header]
--no-intermediate-ranks........... [default] only output standard levels [D,P,C,O,F,G,S]
--intermediate-ranks.............. include non-standard levels
--read-count...................... [default] use read count for output
--percentages..................... use percentage of total reads for output

2. kreport2mpa.py example usage


  1. ``` sh
  2. kraken2 --db KRAKEN2DB --threads THREADNUM --report MYSAMPLE.KREPORT \
  3.     --paired SAMPLE_1.FASTA SAMPLE_2.FASTA > MYSAMPLE.KRAKEN2
  4. python kreport2mpa.py -r MYSAMPLE.KREPORT -o MYSAMPLE.MPA.TXT

  5. ```

3. kreport2mpa.py example output


The output will contain one tab character inbetween the classification and the read count.

--no-intermediate-ranks/--read-count

  1. ``` sh
  2.     #Classification                                           SAMPLE.KREPORT
  3.     k__Bacteria                                               36569
  4.     k__Bacteria|p__Proteobacteria                             21001
  5.     k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria      11648
  6.     ...

  7. ```

--intermediate-ranks/--read-count

  1. ``` sh
  2.     #Classification                                           SAMPLE.KREPORT
  3.     x__cellular_organisms                                     38462
  4.     x__cellular_organisms|k__Bacteria                         36569
  5.     x__cellular_organisms|k__Bacteria|p__Proteobacteria       21001
  6.     ...

  7. ```

combine_mpa.py


This program combines multiple outputs from kreport2mpa.py. Files to be combined must have been generated using the same kreport2mpa.py options.

Important:

Input files to combine_mpa.py cannot be a mix of intermediate/no intermediate rank outputs.
Input files should be generated using the same Kraken database.
Input files cannot be a mix of read counts/percentage kreport2mpa.py outputs. combine_mpa.pywill not test the input files prior to combining.

If no header is in a given sample file, the program will number the files "Sample #1", "Sample #2", etc.

1. combine_mpa.py usage/options


python combine_mpa.py

-i/--input MYFILE1.MPA MYFILE2.MPA....... Multiple MPA-STYLE text files (separated by spaces)
-o/--output MYFILE.COMBINED.MPA.......... Output MPA-STYLE text file

2. combine_mpa.py example output


  1. ``` sh
  2.     #Classification                                           Sample #1    Sample #2
  3.     k__Bacteria                                               36569         20034
  4.     k__Bacteria|p__Proteobacteria                             21001         18023
  5.     k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria      11648         15000

  6. ```

filter_bracken_out.py


This program takes the output file of a Bracken report and filters the desired taxonomy IDs.

1. filter_bracken_out.py usage/options


python filter_bracken_out.py

-i/--input MYFILE.BRACKEN.......... Bracken output file
-o/--output MYFILE.BRACKEN_NEW..... Bracken-style output file with filtered taxids
--include TID TID2................. taxonomy IDs to include in output file [space-delimited]
--exclude TID TID2................. taxonomy IDs to exclude in output file [space-delimited]

User should specify either taxonomy IDs with --include or --exclude. If both are specified, taxonomy IDs should not be in both lists and only taxonomies to include will be evaluated.

When specifying the --include flag, only lines for the included taxonomy IDs will be extracted to the filtered output file. The percentages in the filtered file will be re-calculated so the total percentage in the output file will sum to 100%.

When specifying the --exclude flag alone, all lines in the Bracken file will be preserved EXCEPT for the lines matching taxonomy IDs provided.

2. filter_bracken_out.py example usage


This program can be useful for isolating a subset of species to better understand the distribution of those particular species in the sample.

For example:

python filter_bracken_out.py [options] --include 1764 1769 1773 1781 39689 will allow users to get the relative percentages
Last Updated: 2023-06-23 00:03:28