Code Monkey home page Code Monkey logo

fastaai's People

Contributors

cruizperez avatar kgerhardt avatar lmrodriguezr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fastaai's Issues

Data type of HMM best hit scores

In the kmer_extract function, selection of best HMM hits happens via parsing the filtered HMM file and selecting the highest score for each protein.

The problem is that the string.split() function returns a list of strings, not a best-matching type of each chunk. The HMM scores are being compared as strings, so "80" > "300" for the protein scores.

score = line[8]

in that function must be replaced by

score = float(line[8])

UnicodeDecodeError

Hey there! Thanks so much for this tool, it's exactly what I was looking for!

I was able to install FastAAI easily using pip install, and made sure all the dependencies installed properly. However, I keep getting the same error message anytime I run build_db (this is the only module I've tried thus far). Here's the line of code I'm running:

fastaai build_db --genomes Genomes_85_5/ --threads 20 --verbose --output Halomonas_fAAI_Build --database Halomonas_Build_DB.db --compress

I tried it both on a server that has Python 3.6 installed, and on my own computer that has Python 3.9 installed. Here's the error I get with Python 3.6:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10: invalid start byte

And here's the error I get with Python 3.9 (similar but slightly different):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 26: invalid continuation byte

Any suggestions on what could be going wrong, or how to resolve this issue?

Thank you!!

Description of output

Hi,
I successfully installed fastaai - however I could not find the fastaai conda package in your repo (I installed the other dependencies using conda and clones your github repo).

I alson succeeded in running a first sample pair (two fasta files).

However, I do not comprehend the output:
sample-107.fasta sample-100.fasta 0.45 0.2385 83 83 65.25653091282805

Could you also provide the column headers or describe the output in the readme?

Thanks for this very nice tool

Discrepancies with comperm results

Dear developer,
I tested 3 protein groups and found that fastAAI results are about 5% smaller than comperm, have you done a comparison between this software and classic blastp method and how accurate is it?
fastaai:

query_genome  A_Mi.faa  B_Ms.faa  C_Ms1.faa
A_Mi          >90%      84.67     84.67
B_Ms          84.67     >90%      >90%
C_Ms1         84.67     >90%      >90%

comparem

#Genome A  Genes in A  Genome B  Genes in B  # orthologous genes  Mean AAI  Std AAI  Orthologous fraction (OF)
A_Mi       11642       C_Ms1     13182       8222                 89.58     10.79    70.62
A_Mi       11642       B_Ms      13182       8222                 89.58     10.79    70.62
C_Ms1      13182       B_Ms      13182       13090                100.00    0.00     99.30

Another question is there a way to show the exact value of AAI less than 30 and greater than 90 instead of the token?
Looking forward your reply. Thanks a lot.

build_db: Device or resource busy

shutil.rmtree(td)

in the build_db code is trying to remove a directory in which processes are still trying to use those resources. A try: + except: will help, but also using temp directories that differ for each run could help (e.g., naming the directories based on UUIDs).

The specific error:

  File "/tmp/global2/nyoungblut/code/dev/ll_pipelines/llg/.snakemake/conda/e597b1bc4c3f6c65a46887160aeefc74/bin/fastaai", line 8, in <module>
    sys.exit(main())
  File "/tmp/global2/nyoungblut/code/dev/ll_pipelines/llg/.snakemake/conda/e597b1bc4c3f6c65a46887160aeefc74/lib/python3.7/site-packages/fastaai/FastAAI.py", line 3927, in main
    build_db(genomes, proteins, hmms, db_name, output, threads, verbose, do_comp)
  File "/tmp/global2/nyoungblut/code/dev/ll_pipelines/llg/.snakemake/conda/e597b1bc4c3f6c65a46887160aeefc74/lib/python3.7/site-packages/fastaai/FastAAI.py", line 1618, in build_db
    shutil.rmtree(td)
  File "/tmp/global2/nyoungblut/code/dev/ll_pipelines/llg/.snakemake/conda/e597b1bc4c3f6c65a46887160aeefc74/lib/python3.7/shutil.py", line 494, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/tmp/global2/nyoungblut/code/dev/ll_pipelines/llg/.snakemake/conda/e597b1bc4c3f6c65a46887160aeefc74/lib/python3.7/shutil.py", line 452, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/tmp/global2/nyoungblut/code/dev/ll_pipelines/llg/.snakemake/conda/e597b1bc4c3f6c65a46887160aeefc74/lib/python3.7/shutil.py", line 450, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs0000001c0007a6050021804d'

Conda env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
ca-certificates           2022.6.15            ha878542_0    conda-forge
fastaai                   0.1.15                   pypi_0    pypi
hmmer                     3.3.2                h87f3376_2    bioconda
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
libgfortran-ng            12.1.0              h69a702a_16    conda-forge
libgfortran5              12.1.0              hdcd56e2_16    conda-forge
libgomp                   12.1.0              h8d9b700_16    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libsqlite                 3.39.3               h753d276_0    conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
libzlib                   1.2.12               h166bdaf_2    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
numpy                     1.21.6           py37h976b520_0    conda-forge
openssl                   3.0.5                h166bdaf_1    conda-forge
pigz                      2.6                  h27826a3_0    conda-forge
pip                       22.2.2             pyhd8ed1ab_0    conda-forge
prodigal                  2.6.3                hec16e2b_4    bioconda
psutil                    5.9.2                    pypi_0    pypi
pyhmmer                   0.6.2                    pypi_0    pypi
pyrodigal                 1.1.2                    pypi_0    pypi
python                    3.7.12          hf930737_100_cpython    conda-forge
python_abi                3.7                     2_cp37m    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
setuptools                65.3.0           py37h89c1867_0    conda-forge
sqlite                    3.39.3               h4ff8645_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.12               h166bdaf_2    conda-forge

Removal of aai_index module

I recently updated FastAAI, and now I get the following error when running fastaai aai_index:

 I couldn't find the module you specified. Please select one of the following modules:

So, it appears that the UI was dramatically changed, but I can't find a changelog or release notes stating this. Which is the last pypi release to contain aai_index?

Also, it appears that your docs still include aai_index:

	If you wish to query multiple genomes against themselves in all vs. all AAI search, use aai_index instead.
	If you wish to query multiple genomes against multiple targets, use multi_query instead.

...but aai_index was commented-out a few lines lower in the code:

		#print("    multi_query  |" + " Create a query DB and a target DB, then calculate query vs. target AAI")
		#print("    aai_index    |" + " Create a database from multiple genomes and do an all vs. all AAI index of the genomes")

Install failure with python 3.12.1

Please make compatible with the latest version of python

Commands:

conda create -n fastaai python=3.12.1
conda activate fastaai
conda install pip -y
pip install FastAAI

Error:

....
pyrodigal/_pyrodigal.c:80984:16: note: in expansion of macro ‘__Pyx_IsTracing’
80984 | return __Pyx_IsTracing(tstate, 0, 0) && retval;
| ^~~~~~~~~~~~~~~
error: command '/usr/bin/gcc' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pyrodigal
Failed to build pyrodigal
ERROR: Could not build wheels for pyrodigal, which is required to install pyproject.toml-based projects

sqlite3.OperationalError: unrecognized token

When an genome name starts with a number, the tool fails:

Traceback (most recent call last):
  File "...../multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "...../site-packages/fastaai/FastAAI.py", line 2281, in do_sql_query_no_SD
    database.cursor.execute(temp_tab)
sqlite3.OperationalError: unrecognized token: "**11_0_1__xyz_PF01813_17**"

These are the lines apparently causing the error:

temp_tab = "CREATE TEMP TABLE " + temp_name + " (kmer INTEGER)"\
database.cursor.execute(temp_tab)

do_sql_query() has a similar fragment.

`single_query` is not working

Hello @KGerhardt

I noticed fastaai single_query seems to be broken.

The following command:

fastaai single_query -qp data/06.cds/gb_AQYU00000000.faa.gz -tp data/06.cds/Microbacterium_sediminis_GCA_001689915_1.faa.gz -o xxxx

Produces the output:

Query start:   Genome [Protein] Protein+HMM
Target start:  Genome [Protein] Protein+HMM

Output will be located at xxxx/results/gb_AQYU00000000_vs_Microbacterium_sediminis_GCA_001689915_1.aai.txt
/home/migagw/miniconda3/envs/miga-beta/share/rubygems/gems/miga-base-1.3.4.2/utils/FastAAI/fastaai/fastaai:273: DeprecationWarning: `Sequence.taxonomy_id` is not supported consistently in Easel and will be removed in `v0.8.0`
  easel_seq = easel_seq.digitize(pyhmmer.easel.Alphabet.amino())
Traceback (most recent call last):
  File "/home/migagw/miniconda3/envs/miga-beta/share/rubygems/gems/miga-base-1.3.4.2/utils/FastAAI/fastaai/fastaai", line 4803, in <module>
    main()
  File "/home/migagw/miniconda3/envs/miga-beta/share/rubygems/gems/miga-base-1.3.4.2/utils/FastAAI/fastaai/fastaai", line 4622, in main
    single_query(query_file, target_file, output, verbose, threads, do_compress)
  File "/home/migagw/miniconda3/envs/miga-beta/share/rubygems/gems/miga-base-1.3.4.2/utils/FastAAI/fastaai/fastaai", line 3784, in single_query
    print(query.partial_timings())
  File "/home/migagw/miniconda3/envs/miga-beta/share/rubygems/gems/miga-base-1.3.4.2/utils/FastAAI/fastaai/fastaai", line 919, in partial_timings
    protein_pred = self.prot_pred_time-self.init_time
TypeError: unsupported operand type(s) for -: 'float' and 'datetime.datetime'

The resulting folder doesn't have results other than the HMMs.

Query database improperly formatted. Exiting FastAAI

Hi,

Thanks so much for the nice tool! I encountered an error when using this tool.

Code and error:

fastaai build_db -p split -o fastaai_out --threads 90 --verbose --compress

Processing inputs
Completion |##################################################| 100.00% ( 229 of 229 ) at 04/09/2023 20:36:00

Collecting results
Database build complete!

fastaai db_query -q fastaai_out/database/FastAAI_database.sqlite.db -t fastaai_out/database/FastAAI_database.sqlite.db -o out --threads 80 --verbose

Query database improperly formatted. Exiting FastAAI

Do you have any suggestions? Thank you!

Matrix output is broken

fastaai db_query runs smoothly on a formatted database. However, it breaks upon usage of matrix output

fastaai db_query --query ./FastAAI/database/bac_proteidb --target ./FastAAI/database/bac_proteidb --threads 14 --verbose --output FastAAI_matrix --output_style matrix

Performing an all vs. all query on ./FastAAI/database/bac_proteidb
Perusing database metadata

Calculating AAI
Completion |##################################################| 100.00% ( 548 of 548 ) at 25/05/2023 14:44:00

Finalizing results.
Completion |################# | 35.71% ( 5 of 14 ) at 25/05/2023 14:44:00
Traceback (most recent call last):
File "/home/filipe/.local/bin/fastaai", line 8, in
sys.exit(main())
File "/home/filipe/.local/lib/python3.10/site-packages/fastaai/fastaai.py", line 4613, in main
db_query(query, target, verbose, output, threads, do_stdev, style, in_mem, store)
File "/home/filipe/.local/lib/python3.10/site-packages/fastaai/fastaai.py", line 3399, in db_query
mdb.run()
File "/home/filipe/.local/lib/python3.10/site-packages/fastaai/fastaai.py", line 3335, in run
self.db_on_disk()
File "/home/filipe/.local/lib/python3.10/site-packages/fastaai/fastaai.py", line 3265, in db_on_disk
self.write_mat_from_files(result_files, tempdir_path)
File "/home/filipe/.local/lib/python3.10/site-packages/fastaai/fastaai.py", line 3288, in write_mat_from_files
fh = open(f, "r")
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpgi3knqay/partial_results_group_7.txt'

The matrix file is produced with only a small part of the accessions. Any idea how to fix this?

Private repo

MiGA base code uses this repo, but it cannot be pulled because it's private

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.