Comments (5)
I am so sorry to disturb you. I have seen these issues such as #5 ,#11,#16,#21.
first, in #5_11 Dec 2020, we can set MODEL_TYPE dnalongcat
to process seq (seq_len > 512); and then, in #11_17 Jan 2021, the way is to turncate or split; and also, in #18_8 Apr 2021,you said to use --model_type dnalong
. Because of one Q and three different A, I am confused for these answers and I don't konw how to process long seq correctly.
second,follow your answer in #18_8 Apr 2021, because the answer is the lastest, I modified the param "model_type": "bert"
to "model_type": "dnalongcat"
in file config.json
,and modified the param "max_len": 512
to "max_len": 3072
in file tokenizer_config.json
. However it doesn't work, So may I hope you make a demo or write a detail note, after all, you are most familiar with how to do it.
third, in #16 ,the [CLS] token's hidden state is output[1]
? also means sentence vector? and which is better to make a classify task between output[0]
(word vector) and output[1]
(sentence vector)?
Finally, a suggestion, sometimes, it is more convient to add your code in our project for soving problem rather than a final command line.
please reply me soon if you see here, Thanks very much!
Sincerely,PBC
from dnabert.
@BinchaoPeng did you find a workaround?
from dnabert.
third, in #16 ,the [CLS] token's hidden state is output[1]? also means sentence vector? and which is better to make a classify task between output[0](word vector) and output[1](sentence vector)?
About this:
In recent Transformer papers and in huggingface
documentation, it is best to average the last hidden states (mean(output[0]
) than to use [CLS]
(output[1]
) as it is a better "semantic" representation.
As DNABERT uses the [CLS]
embedding and DNA sequences transformers are quite new and could have discrepancies with NLP's transformers, it is probably best to test both aggregation techniques.
from dnabert.
it means output[0] is better。thanks!
from dnabert.
Closed.
from dnabert.
Related Issues (20)
- How can I create my own processor? HOT 1
- There is a bug about attention mask in source code
- Importing error of Transformers HOT 4
- How to get the high attention regions of a given sequence.
- AssertionError in kmer2seq for motif search
- attention maps generated in pre-training stage or fine-turning stage
- Pretraining error
- benchmark for the time and computation cost during the fine-tuning
- Shape of atten.npy
- install packages using pip HOT 2
- the seq longer than 512
- Unable to get motif image HOT 3
- Request for Detailed Information on Training the Tokenizer HOT 1
- Release pretraining data?
- what is the masking ratio
- Installation Issues.
- Creating a dataset needs very long time HOT 1
- Which model to use for genomic variants analysis? HOT 1
- early_stop not being triggered?
- How Can I track model loss and accuracy of each epoch during fine-tuning, to make sure model is stable?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dnabert.