Code Monkey home page Code Monkey logo

seqfilter's Introduction

Usage:
      SeqFilter <fa/fq>
      SeqFilter [filter] <fa/fq> --out <fa/fq>
      cat <fa/fq> | SeqFilter [filter] --out - | ...

Options:
    -o|--out <FILE/-> [OFF]
    -c|--stdout
        Output to file. -c for STDOUT.

    --stats <FILE> [-]
        Print stats to file. Default STDOUT or STDERR if STDOUT is in use.

    --ids <FILE/-/IDLIST>
        File of sequence IDs or literal list of IDs to be reported. Reads
        comma-, whitespace and newline separated lists. Leading '>' or '@'
        are ignored.

          SeqFilter fasta.fa --ids "seq35,seq49"  # list
          SeqFilter fasta.fa --ids ids.list       # file

    --ids-pattern <FILE/-/PATTERNLIST>
        Match a perl PATTERN or a link to a file containing multiple
        PATTERNs, one per line, against sequence ids. Keep matching
        sequences.

          SeqFilter seqs.fq --ids-patt 'comp2_c13_seq.*|comp2_c88_seq.*'

        Extended usage: add capture groups to perl P(A)TT(ER)N using "()".
        Splits matched sequences to different output files based on capture.
        Use sprintf conversions (%s, %02d, ...) in --out to define a
        template for the output file names.

          # split seqs by library (LIB1, LIB2, LIB14)
          SeqFilter multilib.fq --ids-pattern '\w+(\d+)' --out multilib_%02d.fq
          # creates multilib_01.fq, multilib_02.fq, multilib_14.fq

        NOTE: Perl needs to open a filehandle to every split file, this can
        slow things down considerably if you want to split into more than
        1000 different files with occurances of patterns randomly mixed in
        source file.

    --ids-exclude
        Reverse behaviour of --ids and --ids-pattern to excluding matched
        sequences, and keeping unmatched ones.

    --ids-rename <PATTERN>
        Provide a perl substitution pattern as string. The pattern is
        applied to every id. Use global "$COUNT" to access the output
        sequence counter, $PICARD to access the picard number.

          SeqFilter library_1.fq --ids-rename='s/.*\//sprintf("MYLIB_%05d\/%d", $COUNT, $PICARD)/e'
          # creates ids:
          # @MYLIB_00001/1
          # @MYLIB_00002/1 ...

    --desc-replace
    --desc-append
        Remove (no arg)/replace/append description of sequences.

    -l|--min-length <INT>
        Minimum sequence length.

    -L|--max-length <INT>
        Maximum length for sequences to be retrieved. Default off.

    -A|--fasta
    -Q|--fastq <CHAR>
        Convert FASTQ/FASTA to FASTA/FASTQ, respectively.

    -w|--line-width [80 for FASTA]
        FASTA only. Set 0 for single line. Ignored for FASTQ.

    --rc|--rev-comp <FILE/-/LIST>
        File of sequence IDs or list of IDs to be transformed, no argument
        to transform all sequences. Formatting follows the same rules as
        --ids files.

    --careful [OFF]
        FASTQ only. Check every FASTQ record to have a valid format. '@/+'
        at the start of id lines, sequence and quality of identical length,
        phred within boundaries. Slows down the parsing.

    --lower-case
    --upper-case
        Convert output sequence to lower/upper case.

    --iupac-to-N
        Convert non [ATGCNatgcn] characters to N.

    --phred-offset [auto]
        FASTQ only. Specify Phred offset for quality scores. Default
        auto-detect.

    --phred-transform
        FASTQ only. Transform phreds from input offset to specified
        "--phred-transform" offset, usually 33 to 64 or wise versa.

    --phred-mask
        FASTQ only. At least two values separated by ",", e.g "0,10" to mask
        all Nucleotides with phred below 10 with an "N". Optional additional
        values control:

          minimum length of unmasked regions
          minimum length of masked regions
          bps to ignore at the ends of masked regions (shorten masked regions to their
            core)
          a ratio that determines whether to mask/unmasked terminal regions that are
            smaller than are minimum unmasked region

        NOTE: Requires "--phred-offset".

    --trim-window <INT1>,[<INT2>],[<INT3>]
        FASTQ only. Trim sequences to quality >= SOFT,HARD,SIZE in a sliding
        window, default 10. The sliding window allows to have positions
        below the SOFT cutoff provided the window mean is higher than SOFT.
        Qualities below HARD, default 0, will always terminate a stretch. It
        is made sure that a) positions with quality below cutoff only occur
        within the remaining sequence, not at its start/end and b) windows
        never overlap eachother.

    --trim-lcs <INT,INT,INT>
        FASTQ only. Three values separated by ",", e.g. "30,40,50" to grep
        all stretches of quality >= 30 and minimum length 50 from the
        sequences. Faster than "--trim-window" yet breaks sequences even on
        a single low quality position.

        NOTE: "--trim-lcs" and "--trim-window" can be combined, e.g.

          --trim-lcs 5,40,100 --trim-window 10

        will generate sequences with qualities of at least 5 at every
        position and a window mean of 10.

    --substr <FILE/-/LIST>
        Pathname to a FILE containing information for subseq
        extraction/modification. The format is a tsv, by default lines of
        the format ID FROM TO are expected. Lines prepened by '#' are
        treated as comments and therefore ignored. If --substr-perl-style is
        set, the lines must start with the ID of the read, followed by the
        substr values OFFSET,LENGTH,REPLACESEQ,REPLACEQUAL. The parameter
        usage is than the same as for perl builtin "substr" function,
        meaning an OFFSET alone is sufficient, a positive value is set from
        the start of the sequence, a negative offset from the end, without
        LENGTH, the sequence is returned from OFFSET to its end.
        REPLACEMENTS are introduced at the OFFSET position, if LENGTH is 0,
        it is a simple insertion, else a part is deleted first and the
        REPLACEMENT is then inserted. Substring extraction is of course
        performed prior to any other trimming. To trim all reads use '*'
        instead of the read id. This command will be performed prior to any
        indiviual substr command.

          FROM TO:
            # extract sequence from pos 10 to pos 50
            read1 10 50

          OFFSET [LENGTH [REPLACEMENT]]
            # trim read1 head and tail by 10
            read1   10
            # extract from read2 250 nts starting at pos 15
            read2   15   250
            # replace 3 nt by an "N"" with qual "!" (for FASTQ)
            read3   3   1   N   !
            # trim from all reads 5nts at the beginning and the end.
            *   5
            *   -5

    --substr-perl-style
        By default, substr information are read according to the format FROM
        TO. Set this flag to switch the behaviour to perl substr() like
        style of "OFFSET [LENGTH [REPLACEMENT]]"

    -N|--Nx <INT,INT...>
        Report Nx value (N50, N90...). Default "50,90".

    -C|--base-composition <BASE(S),BASE(S),BASE(S),...>
        Report relative amount of given bases. Takes a "," separated list,
        each element of the list can be one or more bases (cummulative).

          --base-composition=GC,N        # combined GC and N content

    -H|--histogram
        Plot distribution of bases by length as ASCII plot. Uses linear
        scale for data sets with difference in order of magnitude < 2, log
        scale otherwise.

    --[no]-smart-labels
        Toggle shortening filepaths to shortest unique labels.

    -p|--progress
        Display progress bars (eq. '--verbose 2')

    -q|--quiet
        Omit all verbose messages. The same as --verbose=0, Superceeds
        --verbose settings.

    --verbose <INT>
        Toggle verbose level, default 2, which outputs statistics and
        progress. Set 1 for statistics only or 0 for no verbose output.

    -h|--help
        Display this help

    -V|--version
        Display current version

seqfilter's People

Contributors

thackl avatar greatfireball avatar

Stargazers

 avatar johnsonz avatar  avatar Luis Tataje avatar Federico López-Osorio avatar  avatar Antonio Rodriguez Franco avatar Markus J. Ankenbrand avatar  avatar

Watchers

James Cloos avatar  avatar Markus J. Ankenbrand avatar  avatar  avatar Simon Pfaff avatar José Afonso Guerra-Assunção avatar

seqfilter's Issues

--substr outputs too many sequences

When running SeqFilter with the --substr option, sequences that are not represented in the substring-file show up completely in the output. I would expect them to not be present at all.

Example:

$ cat test.fa
>1
AGTCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGTACGAGCTACGAGCTCGATCGGACTG
>2
CGATCGACTACGACGATCGAGCATCAGAGCTCGCTCGCTCGCTCGGCTGATCGAAAGCAG
>3
ACTGCCGACTACTGCATAACCCATCGGCATTACGCGATTCGATACGACTGACGATCTAGTCACG

$ cat substr.tab
1   3   7
1   2   42
2   25  12

Running

SeqFilter --substr substr.tab --out test.out test.fa

Result:

$ cat test.out 
>1.1 SUBSTR:3,5
CTGCT
>1.2 SUBSTR:2,41
TCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGT
>2 SUBSTR:12,14
TGATGCTCGATCGT
>3
ACTGCCGACTACTGCATAACCCATCGGCATTACGCGATTCGATACGACTGACGATCTAGTCACG

Expected Result:

$ cat test.out 
>1.1 SUBSTR:3,5
CTGCT
>1.2 SUBSTR:2,41
TCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGT
>2 SUBSTR:12,14
TGATGCTCGATCGT

--base-content output is not ordered correctly

Just a small bug in the output of --base-content results. Apparently, the order in the header is the same as specified in the command and the order of the results is alphabetical.

Example:

$ SeqFilter --base-content N,GC my-sequences.fa
#source	state	reads	bases	max	min	N50	N90	%N	%GC
my-sequences.fa	RAW	3981	268398063	958516	1560	91768	30966	38.96	0.00
$ SeqFilter --base-content GC,N my-sequences.fa
#source	state	reads	bases	max	min	N50	N90	%GC	%N
my-sequences.fa RAW	3981	268398063	958516	1560	91768	30966	38.96	0.00

Moving from verbose module to Log4perl?

Originally, I planned to use the well established Log::Log4perl module and Term::ProgressBar to substitute the Verbose module. @PfaffS already started the implementation. Should I finish the integration? Or are not not willing to replace your Verbose module?

Maybe with the help of @PfaffS?

Output for multiple Input file with same filename is confusing for different SeqFilter runs

Due to only the filename is printed for the output table if SeqFilter was called with a single filename, it is confusing for multiple runs of SeqFilter each with a single input file with identical filenames.

Currently I am comparing different assemblies. To achieve faster output, I am using multiple SeqFilter calls. The assembly output filename is called transcripts.fasta:

mkdir run_a run_b && \
echo -e ">blub\nACGT" >run_a/transcripts.fasta && \
echo -e ">bla\nACGT" >run_b/transcripts.fasta

In case I am using a single SeqFilter call, a unique path is printed:

SeqFilter run_a/transcripts.fasta run_b/transcripts.fasta 2>/dev/null | column -t
#source                  state  reads  bases  max  min  N50  N90
run_a/transcripts.fasta  RAW    1      15     15   15   15   15
run_b/transcripts.fasta  RAW    1      15     15   15   15   15
TOTAL                    RAW    2      30     15   15   15   15

but for two SeqFilter calls, it will result in an ambiguous list:

(SeqFilter run_a/transcripts.fasta 2>/dev/null; \
> SeqFilter run_b/transcripts.fasta 2>/dev/null) | column -t | | sed '2,$s/^#.*$//' | sed '/^$/d'
#source            state  reads  bases  max  min  N50  N90
transcripts.fasta  RAW    1      15     15   15   15   15
transcripts.fasta  RAW    1      15     15   15   15   15

Tested with SeqFilter 2.1.7 and 2.1.9; column and sed commands are not required and will only reformat the output.

The missing TOTAL line is obviously, but unfortunately, it is hard to identify the correct input file.

This might be solved by:

  1. printing the whole filename parameter instead of only the file name if specified
  2. enable printing of whole filenames via command line switch
  3. something different :)

Parameter `--ids -` for an empty set returns all sequences

Description
I want to use the option --ids inside a pipeline. In case the input steps generate no output, SeqFilter will return the complete sequence set instead of an empty file.

Version
2.1.8

Expected behavior
Returning an empty sequence file

Example

cat >seq.fa <<EOF
>A
ACG
>B
HIJ
>C
KLM
EOF

# If IDs are provided via pipeline, output is as expected
grep "^>[AB]" seq.fa | SeqFilter/bin/SeqFilter --ids - seq.fa 
# [17:24:14] SeqFilter/bin/SeqFilter-2.1.8
# [17:24:14] Detected FASTA format
# [17:24:14] --ids: STDIN
# [17:24:14] --in: seq.fa
# #source	state	reads	bases	max	min	N50	N90
# seq.fa	RAW	3	9	3	3	3	3
# seq.fa	FIL	2	6	3	3	3	3

# If no IDs are provided via pipeline, output contains complete sequence set
grep "^>[D]" seq.fa | SeqFilter/bin/SeqFilter --ids - seq.fa 
# [17:26:09] SeqFilter/bin/SeqFilter-2.1.8
# [17:26:09] Detected FASTA format
# [17:26:09] --ids: STDIN
# [17:26:09] --in: seq.fa
# #source	state	reads	bases	max	min	N50	N90
# seq.fa	RAW	3	9	3	3	3	3
# seq.fa	FIL	3	9	3	3	3	3

`--ids-pattern` returns 0 sequences

Hi Thomas,

When running SeqFilter with the --ids-pattern option, I don't get any matching sequences.

Example:

test.fa:

>blablub1
AGTCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGTACGAGCTACGAGCTCGATCGGACTG
>blubbla2
CGATCGACTACGACGATCGAGCATCAGAGCTCGCTCGCTCGCTCGGCTGATCGAAAGCAG
>bla3
ACTGCCGACTACTGCATAACCCATCGGCATTACGCGATTCGATACGACTGACGATCTAGTCACG
$ SeqFilter --ids_pattern 'bla.*' test.fa
[14:04:18] /storage/genomics/scripts/bin/SeqFilter-2.1.6
[14:04:18] Detected FASTA format
[14:04:18] --in: test.fa
#source	state	reads	bases	max	min	N50	N90
test.fa	RAW	3	192	68	60	64	60
test.fa	FIL	0	0	NA	NA	NA	NA

I would expect, to get 2 Sequences (the first and the last). Am I using this option correctly or is there a bug somewhere?

--ids-patt with spaces

When applying the --ids-patt with headers that include white spaces/descriptions, the patterns are only searched until the first space occurs.

I used sed -e "s/\[//g" -e "s/\]//g" protall.fasta | sed -e "s/ /_/g" now to fix my file prior to SeqFilter, but it might be good to add info about this to the help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.