Hi, To process long sequences, please use

I am so sorry to disturb you. I have seen these issues such as <a class="issue-link js

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

third, in <a class="issue-link js-issue-link" data-error-text="Failed to

some new questions about how to process seq which is more than 512 about dnabert HOT 5 CLOSED

jerryji1993 commented on August 26, 2024

some new questions about how to process seq which is more than 512

from dnabert.

Comments (5)

BinchaoPeng commented on August 26, 2024

I am so sorry to disturb you. I have seen these issues such as #5 ,#11,#16,#21.
first, in #5_11 Dec 2020, we can set MODEL_TYPE dnalongcat to process seq (seq_len > 512); and then, in #11_17 Jan 2021, the way is to turncate or split; and also, in #18_8 Apr 2021,you said to use --model_type dnalong. Because of one Q and three different A, I am confused for these answers and I don't konw how to process long seq correctly.

second，follow your answer in #18_8 Apr 2021, because the answer is the lastest, I modified the param "model_type": "bert" to "model_type": "dnalongcat" in file config.json ,and modified the param "max_len": 512 to "max_len": 3072 in file tokenizer_config.json. However it doesn't work, So may I hope you make a demo or write a detail note, after all, you are most familiar with how to do it.

third, in #16 ，the [CLS] token's hidden state is output[1]? also means sentence vector? and which is better to make a classify task between output[0](word vector) and output[1](sentence vector)?

Finally, a suggestion, sometimes, it is more convient to add your code in our project for soving problem rather than a final command line.

please reply me soon if you see here, Thanks very much!

Sincerely,PBC

from dnabert.

sheetalgiri commented on August 26, 2024

@BinchaoPeng did you find a workaround?

from dnabert.

victormaricato commented on August 26, 2024

third, in #16 ，the [CLS] token's hidden state is output[1]? also means sentence vector? and which is better to make a classify task between output[0](word vector) and output[1](sentence vector)?

About this:

In recent Transformer papers and in huggingface documentation, it is best to average the last hidden states (mean(output[0]) than to use [CLS] (output[1]) as it is a better "semantic" representation.

As DNABERT uses the [CLS] embedding and DNA sequences transformers are quite new and could have discrepancies with NLP's transformers, it is probably best to test both aggregation techniques.

from dnabert.

BinchaoPeng commented on August 26, 2024

it means output[0] is better。thanks！

from dnabert.

jerryji1993 commented on August 26, 2024

Closed.

from dnabert.

Recommend Projects

some new questions about how to process seq which is more than 512 about dnabert HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent