bioinf-wuerzburg / seqfilter Goto Github PK

9.0 7.0 6.0 106 KB

Versatile FASTA/FASTQ sequence file analysis and modification tool

License: MIT License

Perl 98.93% Makefile 1.07%

seqfilter's Introduction

Usage:
      SeqFilter <fa/fq>
      SeqFilter [filter] <fa/fq> --out <fa/fq>
      cat <fa/fq> | SeqFilter [filter] --out - | ...

Options:
    -o|--out <FILE/-> [OFF]
    -c|--stdout
        Output to file. -c for STDOUT.

    --stats <FILE> [-]
        Print stats to file. Default STDOUT or STDERR if STDOUT is in use.

    --ids <FILE/-/IDLIST>
        File of sequence IDs or literal list of IDs to be reported. Reads
        comma-, whitespace and newline separated lists. Leading '>' or '@'
        are ignored.

          SeqFilter fasta.fa --ids "seq35,seq49"  # list
          SeqFilter fasta.fa --ids ids.list       # file

    --ids-pattern <FILE/-/PATTERNLIST>
        Match a perl PATTERN or a link to a file containing multiple
        PATTERNs, one per line, against sequence ids. Keep matching
        sequences.

          SeqFilter seqs.fq --ids-patt 'comp2_c13_seq.*|comp2_c88_seq.*'

        Extended usage: add capture groups to perl P(A)TT(ER)N using "()".
        Splits matched sequences to different output files based on capture.
        Use sprintf conversions (%s, %02d, ...) in --out to define a
        template for the output file names.

          # split seqs by library (LIB1, LIB2, LIB14)
          SeqFilter multilib.fq --ids-pattern '\w+(\d+)' --out multilib_%02d.fq
          # creates multilib_01.fq, multilib_02.fq, multilib_14.fq

        NOTE: Perl needs to open a filehandle to every split file, this can
        slow things down considerably if you want to split into more than
        1000 different files with occurances of patterns randomly mixed in
        source file.

    --ids-exclude
        Reverse behaviour of --ids and --ids-pattern to excluding matched
        sequences, and keeping unmatched ones.

    --ids-rename <PATTERN>
        Provide a perl substitution pattern as string. The pattern is
        applied to every id. Use global "$COUNT" to access the output
        sequence counter, $PICARD to access the picard number.

          SeqFilter library_1.fq --ids-rename='s/.*\//sprintf("MYLIB_%05d\/%d", $COUNT, $PICARD)/e'
          # creates ids:
          # @MYLIB_00001/1
          # @MYLIB_00002/1 ...

    --desc-replace
    --desc-append
        Remove (no arg)/replace/append description of sequences.

    -l|--min-length <INT>
        Minimum sequence length.

    -L|--max-length <INT>
        Maximum length for sequences to be retrieved. Default off.

    -A|--fasta
    -Q|--fastq <CHAR>
        Convert FASTQ/FASTA to FASTA/FASTQ, respectively.

    -w|--line-width [80 for FASTA]
        FASTA only. Set 0 for single line. Ignored for FASTQ.

    --rc|--rev-comp <FILE/-/LIST>
        File of sequence IDs or list of IDs to be transformed, no argument
        to transform all sequences. Formatting follows the same rules as
        --ids files.

    --careful [OFF]
        FASTQ only. Check every FASTQ record to have a valid format. '@/+'
        at the start of id lines, sequence and quality of identical length,
        phred within boundaries. Slows down the parsing.

    --lower-case
    --upper-case
        Convert output sequence to lower/upper case.

    --iupac-to-N
        Convert non [ATGCNatgcn] characters to N.

    --phred-offset [auto]
        FASTQ only. Specify Phred offset for quality scores. Default
        auto-detect.

    --phred-transform
        FASTQ only. Transform phreds from input offset to specified
        "--phred-transform" offset, usually 33 to 64 or wise versa.

    --phred-mask
        FASTQ only. At least two values separated by ",", e.g "0,10" to mask
        all Nucleotides with phred below 10 with an "N". Optional additional
        values control:

          minimum length of unmasked regions
          minimum length of masked regions
          bps to ignore at the ends of masked regions (shorten masked regions to their
            core)
          a ratio that determines whether to mask/unmasked terminal regions that are
            smaller than are minimum unmasked region

        NOTE: Requires "--phred-offset".

    --trim-window <INT1>,[<INT2>],[<INT3>]
        FASTQ only. Trim sequences to quality >= SOFT,HARD,SIZE in a sliding
        window, default 10. The sliding window allows to have positions
        below the SOFT cutoff provided the window mean is higher than SOFT.
        Qualities below HARD, default 0, will always terminate a stretch. It
        is made sure that a) positions with quality below cutoff only occur
        within the remaining sequence, not at its start/end and b) windows
        never overlap eachother.

    --trim-lcs <INT,INT,INT>
        FASTQ only. Three values separated by ",", e.g. "30,40,50" to grep
        all stretches of quality >= 30 and minimum length 50 from the
        sequences. Faster than "--trim-window" yet breaks sequences even on
        a single low quality position.

        NOTE: "--trim-lcs" and "--trim-window" can be combined, e.g.

          --trim-lcs 5,40,100 --trim-window 10

        will generate sequences with qualities of at least 5 at every
        position and a window mean of 10.

    --substr <FILE/-/LIST>
        Pathname to a FILE containing information for subseq
        extraction/modification. The format is a tsv, by default lines of
        the format ID FROM TO are expected. Lines prepened by '#' are
        treated as comments and therefore ignored. If --substr-perl-style is
        set, the lines must start with the ID of the read, followed by the
        substr values OFFSET,LENGTH,REPLACESEQ,REPLACEQUAL. The parameter
        usage is than the same as for perl builtin "substr" function,
        meaning an OFFSET alone is sufficient, a positive value is set from
        the start of the sequence, a negative offset from the end, without
        LENGTH, the sequence is returned from OFFSET to its end.
        REPLACEMENTS are introduced at the OFFSET position, if LENGTH is 0,
        it is a simple insertion, else a part is deleted first and the
        REPLACEMENT is then inserted. Substring extraction is of course
        performed prior to any other trimming. To trim all reads use '*'
        instead of the read id. This command will be performed prior to any
        indiviual substr command.

          FROM TO:
            # extract sequence from pos 10 to pos 50
            read1 10 50

          OFFSET [LENGTH [REPLACEMENT]]
            # trim read1 head and tail by 10
            read1   10
            # extract from read2 250 nts starting at pos 15
            read2   15   250
            # replace 3 nt by an "N"" with qual "!" (for FASTQ)
            read3   3   1   N   !
            # trim from all reads 5nts at the beginning and the end.
            *   5
            *   -5

    --substr-perl-style
        By default, substr information are read according to the format FROM
        TO. Set this flag to switch the behaviour to perl substr() like
        style of "OFFSET [LENGTH [REPLACEMENT]]"

    -N|--Nx <INT,INT...>
        Report Nx value (N50, N90...). Default "50,90".

    -C|--base-composition <BASE(S),BASE(S),BASE(S),...>
        Report relative amount of given bases. Takes a "," separated list,
        each element of the list can be one or more bases (cummulative).

          --base-composition=GC,N        # combined GC and N content

    -H|--histogram
        Plot distribution of bases by length as ASCII plot. Uses linear
        scale for data sets with difference in order of magnitude < 2, log
        scale otherwise.

    --[no]-smart-labels
        Toggle shortening filepaths to shortest unique labels.

    -p|--progress
        Display progress bars (eq. '--verbose 2')

    -q|--quiet
        Omit all verbose messages. The same as --verbose=0, Superceeds
        --verbose settings.

    --verbose <INT>
        Toggle verbose level, default 2, which outputs statistics and
        progress. Set 1 for statistics only or 0 for no verbose output.

    -h|--help
        Display this help

    -V|--version
        Display current version

seqfilter's People

Contributors

Stargazers

Watchers

Forkers

el-mat nterhoeven mz-cy-han1998 marcomeola fbemm greatfireball

seqfilter's Issues

Please add a LICENSE file

Suggestion: MIT or Public Domain

--substr outputs too many sequences

When running SeqFilter with the --substr option, sequences that are not represented in the substring-file show up completely in the output. I would expect them to not be present at all.

Example:

$ cat test.fa
>1
AGTCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGTACGAGCTACGAGCTCGATCGGACTG
>2
CGATCGACTACGACGATCGAGCATCAGAGCTCGCTCGCTCGCTCGGCTGATCGAAAGCAG
>3
ACTGCCGACTACTGCATAACCCATCGGCATTACGCGATTCGATACGACTGACGATCTAGTCACG

$ cat substr.tab
1   3   7
1   2   42
2   25  12

Running

SeqFilter --substr substr.tab --out test.out test.fa

Result:

$ cat test.out 
>1.1 SUBSTR:3,5
CTGCT
>1.2 SUBSTR:2,41
TCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGT
>2 SUBSTR:12,14
TGATGCTCGATCGT
>3
ACTGCCGACTACTGCATAACCCATCGGCATTACGCGATTCGATACGACTGACGATCTAGTCACG

Expected Result:

$ cat test.out 
>1.1 SUBSTR:3,5
CTGCT
>1.2 SUBSTR:2,41
TCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGT
>2 SUBSTR:12,14
TGATGCTCGATCGT

Implementation of a docker container providing SeqFilter software

I did implement a docker container for the SeqFilter package. The Dockerfile is available from GitHub and the readly to use image is available from Docker-Hub.

Maybe we can include that here or add it into the documentation.

Add documentation and bump version to 2.1.10

After closing two issues #7 and #12 I think it will be good to add a little documentation about #12 and bump the version number to 2.1.10.

Will you be able to do it @thackl ?

--base-content output is not ordered correctly

Just a small bug in the output of --base-content results. Apparently, the order in the header is the same as specified in the command and the order of the results is alphabetical.

Example:

$ SeqFilter --base-content N,GC my-sequences.fa
#source	state	reads	bases	max	min	N50	N90	%N	%GC
my-sequences.fa	RAW	3981	268398063	958516	1560	91768	30966	38.96	0.00
$ SeqFilter --base-content GC,N my-sequences.fa
#source	state	reads	bases	max	min	N50	N90	%GC	%N
my-sequences.fa RAW	3981	268398063	958516	1560	91768	30966	38.96	0.00

Moving from verbose module to Log4perl?

Originally, I planned to use the well established Log::Log4perl module and Term::ProgressBar to substitute the Verbose module. @PfaffS already started the implementation. Should I finish the integration? Or are not not willing to replace your Verbose module?

Maybe with the help of @PfaffS?

Output for multiple Input file with same filename is confusing for different SeqFilter runs

Due to only the filename is printed for the output table if SeqFilter was called with a single filename, it is confusing for multiple runs of SeqFilter each with a single input file with identical filenames.

Currently I am comparing different assemblies. To achieve faster output, I am using multiple SeqFilter calls. The assembly output filename is called transcripts.fasta:

mkdir run_a run_b && \
echo -e ">blub\nACGT" >run_a/transcripts.fasta && \
echo -e ">bla\nACGT" >run_b/transcripts.fasta

In case I am using a single SeqFilter call, a unique path is printed:

SeqFilter run_a/transcripts.fasta run_b/transcripts.fasta 2>/dev/null | column -t
#source                  state  reads  bases  max  min  N50  N90
run_a/transcripts.fasta  RAW    1      15     15   15   15   15
run_b/transcripts.fasta  RAW    1      15     15   15   15   15
TOTAL                    RAW    2      30     15   15   15   15

but for two SeqFilter calls, it will result in an ambiguous list:

(SeqFilter run_a/transcripts.fasta 2>/dev/null; \
> SeqFilter run_b/transcripts.fasta 2>/dev/null) | column -t | | sed '2,$s/^#.*$//' | sed '/^$/d'
#source            state  reads  bases  max  min  N50  N90
transcripts.fasta  RAW    1      15     15   15   15   15
transcripts.fasta  RAW    1      15     15   15   15   15

Tested with SeqFilter 2.1.7 and 2.1.9; column and sed commands are not required and will only reformat the output.

The missing TOTAL line is obviously, but unfortunately, it is hard to identify the correct input file.

This might be solved by:

printing the whole filename parameter instead of only the file name if specified
enable printing of whole filenames via command line switch
something different :)

Parameter `--ids -` for an empty set returns all sequences

Description
I want to use the option --ids inside a pipeline. In case the input steps generate no output, SeqFilter will return the complete sequence set instead of an empty file.

Version
2.1.8

Expected behavior
Returning an empty sequence file

Example

cat >seq.fa <<EOF
>A
ACG
>B
HIJ
>C
KLM
EOF

# If IDs are provided via pipeline, output is as expected
grep "^>[AB]" seq.fa | SeqFilter/bin/SeqFilter --ids - seq.fa 
# [17:24:14] SeqFilter/bin/SeqFilter-2.1.8
# [17:24:14] Detected FASTA format
# [17:24:14] --ids: STDIN
# [17:24:14] --in: seq.fa
# #source	state	reads	bases	max	min	N50	N90
# seq.fa	RAW	3	9	3	3	3	3
# seq.fa	FIL	2	6	3	3	3	3

# If no IDs are provided via pipeline, output contains complete sequence set
grep "^>[D]" seq.fa | SeqFilter/bin/SeqFilter --ids - seq.fa 
# [17:26:09] SeqFilter/bin/SeqFilter-2.1.8
# [17:26:09] Detected FASTA format
# [17:26:09] --ids: STDIN
# [17:26:09] --in: seq.fa
# #source	state	reads	bases	max	min	N50	N90
# seq.fa	RAW	3	9	3	3	3	3
# seq.fa	FIL	3	9	3	3	3	3

`--ids-pattern` returns 0 sequences

Hi Thomas,

When running SeqFilter with the --ids-pattern option, I don't get any matching sequences.

Example:

test.fa:

>blablub1
AGTCTGCTCGCTCGATCGACTACGATCCGATCGCTAGCTACGTACGAGCTACGAGCTCGATCGGACTG
>blubbla2
CGATCGACTACGACGATCGAGCATCAGAGCTCGCTCGCTCGCTCGGCTGATCGAAAGCAG
>bla3
ACTGCCGACTACTGCATAACCCATCGGCATTACGCGATTCGATACGACTGACGATCTAGTCACG

$ SeqFilter --ids_pattern 'bla.*' test.fa
[14:04:18] /storage/genomics/scripts/bin/SeqFilter-2.1.6
[14:04:18] Detected FASTA format
[14:04:18] --in: test.fa
#source	state	reads	bases	max	min	N50	N90
test.fa	RAW	3	192	68	60	64	60
test.fa	FIL	0	0	NA	NA	NA	NA

I would expect, to get 2 Sequences (the first and the last). Am I using this option correctly or is there a bug somewhere?

--ids-patt with spaces

When applying the --ids-patt with headers that include white spaces/descriptions, the patterns are only searched until the first space occurs.

I used sed -e "s/\[//g" -e "s/\]//g" protall.fasta | sed -e "s/ /_/g" now to fix my file prior to SeqFilter, but it might be good to add info about this to the help.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.