guigolab / geneidx Goto Github PK

Nextflow pipeline for genome annotation of protein-coding genes

License: GNU General Public License v3.0

Shell 0.31% Nextflow 37.14% Perl 14.53% Dockerfile 0.59% Awk 1.78% Jupyter Notebook 15.58% Python 7.43% HTML 1.19% Groovy 21.46%

gene-prediction genome-annotation nextflow-pipeline

geneidx's People

Contributors

Stargazers

Watchers

Forkers

robertftenbosch apollo994

geneidx's Issues

"command not found"

Hi, I'm using WSL2 Ubuntu for Windows 10.

Ubuntu 22.04.3 LTS
Windows 10 Version 10.0.19045.3803

I get the error below with all the methods I've tried. Downloading the geneidx release from Github with or without using a terminal and then running it, or using the first method given in "Running GeneidX example" in readme.

$ nextflow run ./main.nf -profile docker \
                                        --tsv /mnt/d/biotools-linux/geneidx/data/assemblies.tsv \
                                        --column_taxid_value GCA_905340225.1 \
                                        --column_path_value https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true&gzip=true \
                                        --column_id_value 938171 \
                                        --outdir /mnt/d/biotools-linux/geneidx/output
[1] 543
--column_id_value: command not found
username@MSI:/mnt/d/biotools-linux/geneidx$ N E X T F L O W  ~  version 23.10.0
Launching `./main.nf` [curious_ride] DSL2 - revision: 2c57827f27

GENOMEMASKING+GENEID+BLASTx - NextflowPipeline
=============================================
output path: /mnt/d/biotools-linux/geneidx/output
tsv input path: /mnt/d/biotools-linux/geneidx/data/assemblies.tsv
id column: Assembly Accession
taxid column: GCA_905340225.1
path column: https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true
mask genomes? false

After this the terminal is stuck (even after 3-4 hours) and there doesn't seem to be any activity until I type a random prompt, like s.

s
s: command not found

[1]+  Stopped                 nextflow run ./main.nf -profile docker --tsv /mnt/d/biotools-linux/geneidx/data/assemblies.tsv --column_taxid_value GCA_905340225.1 --column_path_value https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true

I've tried excluding column_id_value command and this time it said command not found for outdir. I've excluded both and got a similar message again.

$ nextflow run ./main.nf -profile docker \
                                        --tsv /mnt/d/biotools-linux/geneidx/data/assemblies.tsv \
                                        --column_taxid_value GCA_905340225.1 \
                                        --column_path_value https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true&gzip=true
[1] 874
username@MSI:/mnt/d/biotools-linux/geneidx$ N E X T F L O W  ~  version 23.10.0
Launching `./main.nf` [prickly_leavitt] DSL2 - revision: 2c57827f27

GENOMEMASKING+GENEID+BLASTx - NextflowPipeline
=============================================
output path: /mnt/d/biotools-linux/geneidx/output
tsv input path: /mnt/d/biotools-linux/geneidx/data/assemblies.tsv
id column: Assembly Accession
taxid column: GCA_905340225.1
path column: https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true
mask genomes? false

Then I have to do the same random typing to stop the process again, even though there is no activity, and I get the same "Stopped" message.

Edit: Confirmed it happens in v2.0.0 too.

Protein set and parameter file the output folder

The current output folder structure makes complicated to understand what proteins and what parameter file were used when multiple species are run from the same input csv. This is especially true when multiple species use the same protein evidence/parameter file. Here I suggest to add a sym link to the protein database as well as the parameter filem in the species specific folder.

geneidx tries to write in location where it should only read & taxon id not found

Hi,

I tried to run geneidx. With singularity, I failed to pull the image - but possibly that's an issue on my own server.

Error executing process > 'param_value_selection_workflow:getParamName (7240)'

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name ferriolcalvet-geneidx.img.pulling.1679713549962 docker://ferriolcalvet/geneidx > /dev/null
  status : 2
  message:
    Usage: singularity [options]
    
    singularity: error: no such option: --name

Here's my output of singularity --version (installed via apt-get on Ubuntu ):

Singularity 1.00 (commit: 2ebc2f3f2059b96885416167363bde2e27ece106)
Running under Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
pygame 2.1.2 (SDL 2.0.20, Python 3.10.6)
Hello from the pygame community. https://www.pygame.org/contribute.html

With docker, I execute with root permissions. The image can be pulled. The problem is that apparently the container tries to write in places where the input data sits.

Here is my command:

sudo nextflow run main.nf -profile docker --genome /nas-hs/projs/data/Drosophila_melanogaster/data/genome.fasta.masked.gz --taxid 7240 --outdir .

Here is my error message:

Error executing process > 'UncompressFASTA (genome.fasta.masked.gz)'

Caused by:
  Process `UncompressFASTA (genome.fasta.masked.gz)` terminated with an error exit status (125)

Command executed:

  if [ ! -s  genome.fasta.masked ]; then
      echo "unzipping genome genome.fasta.masked.gz"
      gunzip -c genome.fasta.masked.gz > genome.fasta.masked;
  fi

Command exit status:
  125

Command output:
  (empty)

Command error:
  docker: Error response from daemon: error while creating mount source path '/nas-hs/projs/data/Drosophila_melanogaster/data': mkdir /nas-hs: read-only file system.
  time="2023-03-25T04:07:50+01:00" level=error msg="error waiting for container: context canceled"

Work dir:
  /home/katharina/git/geneidx/work/66/4a183e6f9fa512936ae52466a8b48b

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

It is rather clear how to fix (don't pipe the unpacked genome to any other place but the output directory).

Next, I copied the gzipped genome into a folder where root has writing permissions and tried again. I am using reference species Drosophila simulans with taxon id 7240. That is indeed the taxon id in NCBI Taxonomy

Here is my new call:

sudo nextflow run main.nf -profile docker --genome genome.fasta.masked.gz --taxid 7240 --outdir .

It fails to find the taxon for some obscure reason. There are most definitely D. simulans proteins at NCBI for this taxon, I have the protein set on my harddrive, too. I just don't know how to start the pipeline with a local protein set. Or maybe it finds the proteins, but s.th. goes wrong looking for geneid parameters for this taxon?

Error messages:

N E X T F L O W  ~  version 22.10.7
Launching `main.nf` [clever_bardeen] DSL2 - revision: bb4f07340a

GeneidX
=============================================
output			: /home/katharina/git/geneidx/output
genome			: genome.fasta.masked.gz
taxon			: 7240

WARN: A process with name 'getFASTA2' is defined more than once in module script: /home/katharina/git/geneidx/subworkflows/CDS_estimates.nf -- Make sure to not define the same function as process
[-        ] process > UncompressFASTA -
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[-        ] process > prot_down_wo... -
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
executor >  local (4)
[fe/ef4540] process > UncompressFA... [  0%] 0 of 1
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [  0%] 0 of 1
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
executor >  local (4)
[fe/ef4540] process > UncompressFA... [  0%] 0 of 1
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [  0%] 0 of 1
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [  0%] 0 of 1
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [  0%] 0 of 1
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8
  
  import os, sys
  import pandas as pd
  import requests
  from lxml import etree
  
  
  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"
  
  
  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None
  
  
  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []
  
      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])
  
      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)
  
  
  
  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage
  
  
  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###
  
      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
  
  
      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
  
  
      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue
  
          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)
  
          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))
  
  
      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
  
      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")
  
  
  
  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = "	")
  
      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
  
  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")
  
  
  
  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
  
  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)
  
  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)
  
  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
  
      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)
  
      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor >  local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [  0%] 0 of 1
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8
  
  import os, sys
  import pandas as pd
  import requests
  from lxml import etree
  
  
  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"
  
  
  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None
  
  
  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []
  
      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])
  
      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)
  
  
  
  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage
  
  
  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###
  
      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
  
  
      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
  
  
      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue
  
          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)
  
          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))
  
  
      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
  
      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")
  
  
  
  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = "	")
  
      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
  
  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")
  
  
  
  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
  
  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)
  
  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)
  
  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
  
      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)
  
      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor >  local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Execution cancelled -- Finishing pending tasks before exit
Oops ...

Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8
  
  import os, sys
  import pandas as pd
  import requests
  from lxml import etree
  
  
  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"
  
  
  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None
  
  
  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []
  
      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])
  
      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)
  
  
  
  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage
  
  
  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###
  
      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
  
  
      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
  
  
      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue
  
          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)
  
          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))
  
  
      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
  
      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")
  
  
  
  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = "	")
  
      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
  
  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")
  
  
  
  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
  
  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)
  
  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)
  
  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
  
      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)
  
      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
      print("/home/katharina/git/geneidx/data/Parameter_files.taxid/", selected_param, sep = "/", end = '')

Command exit status:
  127

Command output:
  (empty)

Command error:
  docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/etc/shadow" to rootfs at "/etc/shadow": mount /etc/shadow:/etc/shadow (via /proc/self/fd/6), flags: 0x5001: no such file or directory: unknown.
  time="2023-03-25T04:11:47+01:00" level=error msg="error waiting for container: context canceled"

Work dir:
  /home/katharina/git/geneidx/work/88/8e185cac6455345234538354fbf905

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Best wishes,

Katharina

EmptyDataError and empty annotation file

Hi Ferriol,

I hope you're doing well. I've been trying to use the geneidx Nextflow pipeline that you developed, but I've run into a couple of issues that I wanted to bring to your attention.

I encountered an error when trying to predict genes in the genome assembly GCA_000002985.3 (C. elegans). The pipeline failed with the following error message: pandas.errors.EmptyDataError: No columns to parse from file at the "pyComputeIntrons" subworkflow.
I also noticed that when I tried to annotate only chromosome V, the pipeline completed successfully, but the annotation file was empty, even though we know there are thousands of genes in chromosome V.

I would really appreciate it if you could take a look into these issues and let me know if there's anything I can do to work around them or if there's a fix that you can implement.
Let me know if you need more information or data to help reproduce the problem.

Thanks in advance,
Pablo

Failing on sequence without introns?

Hello,
I wanted to try your pipeline on a small Wheat sequence, but I faced several issues, starting from pyComputeIntrons process. Because intron_l is empty (I posit it's because I do not have introns in this little sequence?), the pandas indexing returns a Keyerror on dd_intron = dd_intron[[0,4,1,3]].
I fixed this in a dirty way by specifying columns=range(5) in the previous pd.DataFrame call.

Then, findORF fails, because <name>_gffread.fa is empty (so not a fasta file):

RuntimeError: <name>_gffread.fa is not fasta or fastq sequence file

I am not sure what I could do to avoid this issue.

For reference, the sequence I used is available on https://framagit.org/geniomhe/genomics/structural-genomics-project-r4-2023/-/blob/main/data/region4.fasta, and I used the following command:

nextflow run geneidx/main.nf -profile docker \
    --tsv ./data/geneidx.tsv \
    --column_taxid_value TAXID \
    --column_path_value PATH \
    --column_id_value ID \
    --outdir results/geneidx

on the following TSV:

ID	PATH	TAXID
region4	data/region4.fasta.gz	4565

How to run geneidx in local workstation via docker?

Hi,
I have this humble question, newbie to this domain of container.
I have all pre-requested already installed

Docker
Singularity
Nextflow
and the input data

The first thing I did I googled for the docker container :
docker pull ferriolcalvet/geneidx
It seems to be installed with success.
this is log from terminal
latest: Pulling from ferriolcalvet/geneidx 23858da423a6: Pulling fs layer 326f452ade5c: Pulling fs layer a42821cd14fb: Pulling fs layer 8471b75885ef: Pull complete 8ffa7aaef404: Pull complete 15132af73342: Pull complete f4aeaae2f8da: Pull complete 5d8aa591a389: Pull complete 4645a1f8dbad: Pull complete b8861be24c63: Pull complete 5067931dafb0: Pull complete 2af56fb3cf52: Pull complete 253e785a8a50: Pull complete f42d46aac9d9: Pull complete f9f8117fb535: Pull complete e15f4b46da28: Pull complete 99bd3a5629a5: Pull complete 27043e5ccd2d: Pull complete 1292524c3142: Pull complete f7b36a643fbe: Pull complete 860293db19cb: Pull complete f3dafd5c9e29: Pull complete b1325f591394: Pull complete 5065735ed845: Pull complete ab5a5ae1904e: Pull complete ad9050f7ac01: Pull complete Digest: sha256:de5f906aa7ac2fccc68235d3329eb89a06b811bc6cc7515035f1bc770127a1b5 Status: Downloaded newer image for ferriolcalvet/geneidx:latest docker.io/ferriolcalvet/geneidx:latest
2nd
I use nexflow cmd
nextflow run guigolab/geneidx -with-docker --genome reference.fa.gz --taxid 562
and this the msg I got
N E X T F L O W ~ version 22.10.0 Project guigolab/geneidxis currently stickied on revision: main -- you need to explicitly specify a revision with the option-r in order to use it

So how can I run geneidx?

Thank you in advance
Ben

Low BUSCO score on predicted geneset

Hi Ferriol!

I hope you're doing well. I recently tried out your gene prediction tool and wanted to give you some feedback. I found that the results were not as good as those obtained with more popular tools. In fact, the completeness of the gene predictions was only around 34%, compared to ~90% using other tools.
I obtained the genemark and augustus scores from the genesets generated by the braker pipeline. For the protein mode I used Swissprot as reference. I ran geneidx with default parameters. It was very fast indeed. Do you know what could improve the results?

All the best,
Pablo

Run plain geneid

Add nextflow parameter to run plain geneid after geneid parameter file selection.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.