Code Monkey home page Code Monkey logo

finbert's Introduction

FinBERT: Financial Sentiment Analysis with BERT

FinBERT sentiment analysis model is now available on Hugging Face model hub. You can get the model here.

FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification. For the details, please see FinBERT: Financial Sentiment Analysis with Pre-trained Language Models.

Important Note: FinBERT implementation relies on Hugging Face's pytorch_pretrained_bert library and their implementation of BERT for sequence classification tasks. pytorch_pretrained_bert is an earlier version of the transformers library. It is on the top of our priority to migrate the code for FinBERT to transformers in the near future.

Installing

Install the dependencies by creating the Conda environment finbert from the given environment.yml file and activating it.

conda env create -f environment.yml
conda activate finbert

Models

FinBERT sentiment analysis model is now available on Hugging Face model hub. You can get the model here.

Or, you can download the models from the links below:

For both of these model, the workflow should be like this:

  • Create a directory for the model. For example: models/sentiment/<model directory name>
  • Download the model and put it into the directory you just created.
  • Put a copy of config.json in this same directory.
  • Call the model with .from_pretrained(<model directory name>)

Datasets

There are two datasets used for FinBERT. The language model further training is done on a subset of Reuters TRC2 dataset. This dataset is not public, but researchers can apply for access here.

For the sentiment analysis, we used Financial PhraseBank from Malo et al. (2014). The dataset can be downloaded from this link. If you want to train the model on the same dataset, after downloading it, you should create three files under the data/sentiment_data folder as train.csv, validation.csv, test.csv. To create these files, do the following steps:

  • Download the Financial PhraseBank from the above link.
  • Get the path of Sentences_50Agree.txt file in the FinancialPhraseBank-v1.0 zip.
  • Run the datasets script: python scripts/datasets.py --data_path <path to Sentences_50Agree.txt>

Training the model

Training is done in finbert_training.ipynb notebook. The trained model will be saved to models/classifier_model/finbert-sentiment. You can find the training parameters in the notebook as follows:

config = Config(   data_dir=cl_data_path,
                   bert_model=bertmodel,
                   num_train_epochs=4.0,
                   model_dir=cl_path,
                   max_seq_length = 64,
                   train_batch_size = 32,
                   learning_rate = 2e-5,
                   output_mode='classification',
                   warm_up_proportion=0.2,
                   local_rank=-1,
                   discriminate=True,
                   gradual_unfreeze=True )

The last two parameters discriminate and gradual_unfreeze determine whether to apply the corresponding technique against catastrophic forgetting.

Getting predictions

We provide a script to quickly get sentiment predictions using FinBERT. Given a .txt file, predict.py produces a .csv file including the sentences in the text, corresponding softmax probabilities for three labels, actual prediction and sentiment score (which is calculated with: probability of positive - probability of negative).

Here's an example with the provided example text: test.txt. From the command line, simply run:

python predict.py --text_path test.txt --output_dir output/ --model_path models/classifier_model/finbert-sentiment

Disclaimer

This is not an official Prosus product. It is the outcome of an intern research project in Prosus AI team.

About Prosus

Prosus is a global consumer internet group and one of the largest technology investors in the world. Operating and investing globally in markets with long-term growth potential, Prosus builds leading consumer internet companies that empower people and enrich communities. For more information, please visit www.prosus.com.

Contact information

Please contact Dogu Araci dogu.araci[at]prosus[dot]com and Zulkuf Genc zulkuf.genc[at]prosus[dot]com about any FinBERT related issues and questions.

finbert's People

Contributors

ashater avatar doguaraci avatar pvdb2178 avatar theofpa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

finbert's Issues

Unable to train the model

Hi,

I downloaded the data set from the Financial Phrase Bank from Malo et al. (2014). And created train.csv using the data.
train_data = finbert.get_data('train')
But for the above code snippets in "finbert_training"-notebook, an error message was generated as follows.

Is there any method to resolve this issue.

Thanks..


UnicodeDecodeError Traceback (most recent call last)
in
7 #print(cl_data_path)
8 # Get the training examples
----> 9 train_data = finbert.get_data('train')

~\Documents\FIN_BERT\finBERT-master\finbert\finbert.py in get_data(self, phase)
192 self.num_train_optimization_steps = None
193 examples = None
--> 194 examples = self.processor.get_examples(self.config.data_dir, phase)
195 self.num_train_optimization_steps = int(
196 len(

~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in get_examples(self, data_dir, phase)
89 Name of the .csv file to be loaded.
90 """
---> 91 return self._create_examples(self._read_tsv(os.path.join(data_dir, (phase + ".csv"))), phase)
92
93 def get_labels(self):

~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in _read_tsv(cls, input_file)
66 reader = csv.reader(f, delimiter="\t")
67 lines = []
---> 68 for line in reader:
69 if sys.version_info[0] == 2:
70 line = list(unicode(cell, 'utf-8') for cell in line)

D:\Python\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):_

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7919: character maps to

Unable to download models

I am unable to download files through git-lfs because
"" This repository is over its data quota. Account responsible for LFS bandwidth should purchase
more data packs to restore access. ""

Could you please provide the models through an alternate resource or upgrade the LFS data packs? If that is not possible could you provide the train/test/validation sets used to train the classifier? Thank you!
image

model not valid

The first model to download "Language model trained on TRC2" is not valid. I get wrong predictions with it.
The second one seems to be fine.

torch error / environment file on windows

Hi,

When installing the environment file the following error occurs:

Pip subprocess output:
Collecting joblib==0.13.2
  Using cached joblib-0.13.2-py2.py3-none-any.whl (278 kB)
Collecting pytorch-pretrained-bert==0.6.2
  Using cached pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123 kB)
Collecting scikit-learn==0.21.2
  Using cached scikit_learn-0.21.2-cp37-cp37m-win_amd64.whl (5.9 MB)
Collecting spacy==2.1.4
  Using cached spacy-2.1.4-cp37-cp37m-win_amd64.whl (29.0 MB)

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement torch==1.1.0 (from -r C:\Users\Matth\condaenv.xiuu97lv.requirements.txt (line 5)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch==1.1.0 (from -r C:\Users\Matth\condaenv.xiuu97lv.requirements.txt (line 5))


CondaEnvException: Pip failed

As a consequence packages can not be loaded successfully in the created environment.

The error seems common for windows users ( #21 ). However, updating python did not do the trick for me here.

I use Win 10, python version(3.7.7), torch version (1.5.1) and conda version (4.8.3).

Hopefully you can provide me with some assistance to get FinBert up and running on windows.

Training division by 0 error

Thanks for sharing your code on this matter. I have used your trained model. however, I want to try and train the model on my own with the help of the datasets which you have mentioned.
But with running the finbert_training.ipynb I hit to an error on trained_model = finbert.train(train_examples = train_data, model = model)
Screenshot from 2020-02-13 12-38-50

I have tried to debug the code, for some reasons step is always 0. I was wondering if you can give me some hints on the hint on how to fix this issue :)

Does it support paragraph prediction?

I am predicting paragraph based on this model.
The output separates sentences randomly.
Certainly it has something to do with the fast the my data text is not clean.
I wonder whether this is a need for some others too.
Thank you.

Is finBert cased or uncased?

Hi! Thanks for developing and sharing the codes.

I wonder which vanilla BERT model you used to post-training on financial domain text.

To be specific, I wonder whether this FinBERT model can tell the difference between uppercase and lowercase.

predict() in Predict.py function issue "only one element tensors can be converted to Python scalars"

When I call predict, I get the error "only one element tensors can be converted to Python scalars" on line 618 of finbert.py.
When I modify the line from:
logits = softmax(np.array(logits))
to
logits = softmax(np.array(logits[0]))

I get no error, but the predictions and sentiment scores do not seem right when I tested it on examples.csv. The logit looks like [small number, .99..., small number], so the labels are all negative and the scores are all around -.99...

For reference, I copied predict.py, finbert.py, and util.py into a jupyter notebook and used the following as my model
model = BertForSequenceClassification.from_pretrained("ipuneetrathore/bert-base-cased-finetuned finBERT",num_labels=3,cache_dir=None)

pytorch_model.bin file

Hi!

I just downloaded this .bin file and was wondering how to get the following two files(Language model trained on TRC2
& Sentiment analysis model trained on Financial PhraseBank) from it?

Thanks!!

Error in _read_tsv when trying to read in the data

Hey there!
I found that when trying to train the model there was an error, because the _read_tsv's call to open on line 63 in Utils.py didn't have the encoding specified.

To fix it I changed it from:
def _read_tsv(cls, input_file): with open(input_file, "r") as f:
To:
def _read_tsv(cls, input_file): with open(input_file, "r",encoding='utf-8') as f:
Now it works great! Thanks!

loading pytorch_model.bin

I downloaded and placed the language model (pytorch_model.bin) and config.json in the same directory and when i run the piece of code: configuring training parameters, get below error:

INFO - pytorch_pretrained_bert.modeling - loading archive file /directory/pytorch_model.bin
INFO - pytorch_pretrained_bert.modeling - extracting archive file /directory/pytorch_model.bin to temp dir /tmp/tmppjpupb
Readerror: not a gzip file

Looking more granular:
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1645 fileobj.close()
1646 if mode == 'r':
-> 1647 raise ReadError("not a gzip file")
1648 raise
1649 except:

Preprocessing using TRC2

Hello, You mentioned in one of your response that you used 50 finance keywords to extract the finance related text. Do you mind sharing the keywords you used ?

Thanks

Sentence Representation Layer

Is it possible to acquire each sentence representation by removing your last few layers in the model?
If yes, which layer output sentence representation?

Thanks!

Update the environment file

Hi,

I am trying to setup the environment. It always pops up with pip & torch version error. Can you please update the yml for me?
Wanted to check your solution.

script for further pre-training on Reuters TRC2

Hi,

Thank you for making the code available.

As per the readme file, I understand that there are two models:

  1. language_model that has been further pre-trained on Reuters TRC2
  2. classifier_model that has been fine-tuned on Financial Phrasebank.

finbert_training.ipynb is used to load the language_model and fine-tune it on Financial Phrasebank.

I was wondering if you could make the script used for further pre-training the language_model available too.

Thanks!

lm_fine_tuning

Hi,

With reference to my previous issue posted here, I followed what you suggested but I am still couldn't fine_tune on my sample domain. I was wondering if you could take a look at this.

Thanks!

Error running the configuring parameters cell

Good morning,

I am running the configuring parameters cell and I am getting the below error:


UnpicklingError Traceback (most recent call last)
in
5 pass
6
----> 7 bertmodel = BertForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)
8
9

~/anaconda3/envs/finbert/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
601 if state_dict is None and not from_tf:
602 weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
--> 603 state_dict = torch.load(weights_path, map_location='cpu')
604 if tempdir:
605 # Clean up temp dir

~/anaconda3/envs/finbert/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
385 f = f.open('rb')
386 try:
--> 387 return _load(f, map_location, pickle_module, **pickle_load_args)
388 finally:
389 if new_fd:

~/anaconda3/envs/finbert/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
562 f.seek(0)
563
--> 564 magic_number = pickle_module.load(f, **pickle_load_args)
565 if magic_number != MAGIC_NUMBER:
566 raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, 'v'.

Moreover, can you kindly explain how I can construct the files train.csv, validation.csv, test.csv?

Regards,
Bernard

TypeError: unsupported operand type(s) for /: 'str' and 'str' at trained_model = finbert.train(train_examples = train_data, model = model)

I'm facing an issue at code

trained_model = finbert.train(train_examples = train_data, model = model)

Error is

TypeError                                 Traceback (most recent call last)
<ipython-input-11-2ebf0cb3d4e8> in <module>
----> 1 trained_model = finbert.train(train_examples = train_data, model = model)

~\finBERT-master\finbert\finbert.py in train(self, train_examples, model)
    482                     print('No best model found')
    483                 torch.save({'epoch': str(i), 'state_dict': model.state_dict()},
--> 484                            self.config.model_dir / ('temporary' + str(i)))
    485                 best_model = i
    486 

TypeError: unsupported operand type(s) for /: 'str' and 'str'

Vocabulary

Hello,

Could you include the vocab.txt for finBERT? I don't see it in the model's directory and it seems like you are using bert-base-uncased vocabulary for constructing the tokenizer ( https://github.com/ProsusAI/finBERT/blob/master/finbert/finbert.py)

self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=self.config.do_lower_case)

Thank you,
Andrei

Where is the config.json for Sentiment analysis model trained on Financial PhraseBank

Please provide the information where to download the config.json for Sentiment analysis model trained on Financial PhraseBank

The instruction says to place a copy of config.json. The link to download the Sentiment analysis model trained on Financial PhraseBank is the bin file only, not include the config.json.

For both of these model, the workflow should be like this:

Create a directory for the model. For example: models/sentiment/
Download the model and put it into the directory you just created.
Put a copy of config.json in this same directory. <------------------------------
Call the model with .from_pretrained()

The config.json in the huggingface looks to be for TRC2.

{
  "_name_or_path": "/home/ubuntu/finbert/models/language_model/finbertTRC2",   <-----
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "positive",
    "1": "negative",
    "2": "neutral"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "positive": 0,
    "negative": 1,
    "neutral": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "type_vocab_size": 2,
  "vocab_size": 30522
}

DataFrame id overlap

On line 628 of finbert.py you use result = pd.concat([result,batch_result]) when it should be result = pd.concat([result,batch_result], ignore_index=True).

In your result DataFrame the when you concatenate multiple batches together you will have id's that are the same. e.g. 2 batches of 3 items the indexes in result will be 0,1,2,0,1,2.

If you were to convert the DataFrame to a dictionary the results override each other as multiples keys of the same value exist.

pip install transformers is necessary to Dockerfile

Hello. I just would like to report you that running container created by image built from Dockerfile in current master repository would cause an ModuleNotFoundError.

I solved this issue by changing code in Dockerfile.

Before:
RUN pip install pytorch_pretrained_bert numpy pandas nltk Flask flask-cors

After:
RUN pip install pytorch_pretrained_bert numpy pandas nltk Flask flask-cors transformers

I would appreciate if you check it out.

pre-training script on trc2

Hi there, great work and thanks for sharing star!

I am currently trying to reproduce pre-training of bert using trc2 corpus for research purposes to which I have access to, so this is not a request for data. Instead, could you please share the pre-processing code you used for pre-training bert to produce finBERT, how you ingested the data into bert pre-training etc.?

Best regards-

Using Finbert for 240 multilabel multiclass classification

I have label of dimension 240, it is multi label classification problem.

I downloaded the Finbert model:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("/home/pratik/finbert", use_fast=True)

train_tokenizer_texts = list(map(lambda t: tokenizer.tokenize(t,add_special_tokens=True,max_length=512,padding='max_length'), tqdm(train_sentences)))

np.array(train_tokenizer_texts[0])
%%time
#Inititaing a BERT model
model = AutoModelForSequenceClassification.from_pretrained("/home/pratik/finbert", num_labels = 240)
model.cuda()

This gives me error:

RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
	size mismatch for classifier.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([240, 768]).
	size mismatch for classifier.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([240]).

sample set of TRC2

Hi,

Since the TRC2 dataset cannot be publicly made available, I was wondering if some similar format sample training dataset can be provided.

Thanks!

How to force Predict.py to use GPU ? Please help.

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.cuda(device)

gives this error!

File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 989, in forward
_, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 730, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 267, in forward
words_embeddings = self.word_embeddings(input_ids)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

AxisError when call predict via REST API on Flask

I succeeded in training model with executing finbert_training.ipynb. Then I ran a docker conatiner from Dockerfile and threw POST request to container(localhost:8080), which showed me the following error.

08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling -   loading archive file /src/models/classifier_model/finbert-sentiment
08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "_name_or_path": "c:\\Users\\user\\projects\\finbert\\finBERT\\models\\language_model\\finbertTRC2",
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "type_vocab_size": 2,
  "vocab_size": 30522
}

08/27/2021 17:03:45 - INFO - pytorch_pretrained_bert.modeling -   Weights from pretrained model not used in BertForSequenceClassification: ['bert.embeddings.position_ids']
 * Serving Flask app 'main' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
08/27/2021 17:03:45 - WARNING - werkzeug -    * Running on all addresses.
   WARNING: This is a development server. Do not use it in a production deployment.
08/27/2021 17:03:45 - INFO - werkzeug -    * Running on http://172.17.0.2:8080/ (Press CTRL+C to quit)
08/27/2021 17:10:41 - INFO - filelock -   Lock 140200106040192 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
Downloading: 100% 28.0/28.0 [00:00<00:00, 10.4kB/s]
08/27/2021 17:10:42 - INFO - filelock -   Lock 140200106040192 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
08/27/2021 17:10:43 - INFO - filelock -   Lock 140200106041056 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
Downloading: 100% 570/570 [00:00<00:00, 406kB/s]
08/27/2021 17:10:43 - INFO - filelock -   Lock 140200106041056 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
08/27/2021 17:10:44 - INFO - filelock -   Lock 140200106040720 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
Downloading: 100% 232k/232k [00:00<00:00, 490kB/s]
08/27/2021 17:10:45 - INFO - filelock -   Lock 140200106040720 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
08/27/2021 17:10:46 - INFO - filelock -   Lock 140200106040768 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
Downloading: 100% 466k/466k [00:00<00:00, 595kB/s]
08/27/2021 17:10:47 - INFO - filelock -   Lock 140200106040768 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
['The Federal Reserve is committed to using its full range of tools to support the U.S. economy in this challenging time, thereby promoting its maximum employment and price stability goals.']
08/27/2021 17:10:49 - INFO - finbert.utils -   *** Example ***
08/27/2021 17:10:49 - INFO - finbert.utils -   guid: 0
08/27/2021 17:10:49 - INFO - finbert.utils -   tokens: [CLS] the federal reserve is committed to using its full range of tools to support the u . s . economy in this challenging time , thereby promoting its maximum employment and price stability goals . [SEP]
08/27/2021 17:10:49 - INFO - finbert.utils -   input_ids: 101 1996 2976 3914 2003 5462 2000 2478 2049 2440 2846 1997 5906 2000 2490 1996 1057 1012 1055 1012 4610 1999 2023 10368 2051 1010 8558 7694 2049 4555 6107 1998 3976 9211 
3289 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils -   label: None (id = 9090)
[<finbert.utils.InputFeatures object at 0x7f82e1849d90>]
08/27/2021 17:10:49 - INFO - root -   tensor([ 2.1882, -2.1247, -0.7895])
[ 2.1882384  -2.1246738  -0.78948754]
08/27/2021 17:10:49 - ERROR - main -   Exception on / [POST]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/conda/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/src/main.py", line 21, in score
    return(predict(text, model).to_json(orient='records'))
  File "/src/finbert/finbert.py", line 615, in predict
    logits = softmax(np.array(logits))
  File "/src/finbert/utils.py", line 215, in softmax
    e_x = np.exp(x - np.max(x, axis=1)[:, None])
  File "<__array_function__ internals>", line 5, in amax
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2705, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
08/27/2021 17:10:49 - INFO - werkzeug -   172.17.0.1 - - [27/Aug/2021 17:10:49] "POST / HTTP/1.1" 500 -
['The Federal Reserve is committed to using its full range of tools to support the US economy in this challenging 
time, thereby promoting its maximum employment and price stability goals.']
08/27/2021 17:22:27 - INFO - finbert.utils -   *** Example ***
08/27/2021 17:22:27 - INFO - finbert.utils -   guid: 0
08/27/2021 17:22:27 - INFO - finbert.utils -   tokens: [CLS] the federal reserve is committed to using its full range of tools to support the us economy in this challenging time , thereby promoting its maximum employment and price stability goals . [SEP]
08/27/2021 17:22:27 - INFO - finbert.utils -   input_ids: 101 1996 2976 3914 2003 5462 2000 2478 2049 2440 2846 1997 5906 2000 2490 1996 2149 4610 1999 2023 10368 2051 1010 8558 7694 2049 4555 6107 1998 3976 9211 3289 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils -   label: None (id = 9090)
[<finbert.utils.InputFeatures object at 0x7f82e1004d90>]
08/27/2021 17:22:27 - INFO - root -   tensor([ 2.0840, -2.0827, -0.8532])
[ 2.0839536  -2.0827212  -0.85315543]
08/27/2021 17:22:27 - ERROR - main -   Exception on / [POST]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/conda/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/src/main.py", line 21, in score
    return(predict(text, model).to_json(orient='records'))
  File "/src/finbert/finbert.py", line 615, in predict
    logits = softmax(np.array(logits))
  File "/src/finbert/utils.py", line 215, in softmax
    e_x = np.exp(x - np.max(x, axis=1)[:, None])
  File "<__array_function__ internals>", line 5, in amax
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2705, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
08/27/2021 17:22:27 - INFO - werkzeug -   172.17.0.1 - - [27/Aug/2021 17:22:27] "POST / HTTP/1.1" 500 -

I have solved this issue by changing code in finbert.py
Before:

with torch.no_grad():
            logits = model(all_input_ids, all_attention_mask, all_token_type_ids)[0]

After:

with torch.no_grad():
            logits = model(all_input_ids, all_attention_mask, all_token_type_ids)

I will appreciate if you check this out.

Train model

Dear all,

When I train the model, the following error occurred. Is there any method to solve this issue.

Thanks.

image

no code for FiQA sentiment classification task?

I did not find code for FiQA aspects-based sentiment task, and wonder how roberta model handle aspects-based sentiment task which is different from vanilla sentiment classification task. Thanks a lot

The vocab.txt for finbertTRC2 model

Hi Sir,
Thank you so much for sharing the code.
I notice that in finbertTRC2 folder, the vocab.txt file is missing, could you tell me where I could find this file?

Thanks!

finBERT pretained model giving issues while calling it

Hi, I have ubuntu version 16. And I have cloned finBERT in my machine. I have also created a folder and have put the required files from hugging face website. Everything was working fine I had got the environment created by running environment.yml file provided in GitHub.

But when I am calling AutoModelForSequenceClassification.from_pretrained(file_path, cache_dir=None, num_labels=3)

1. I am getting Unable to open file (file signature not found) error.

2. Also, when I am checking for version of torch. It's 1.1.0 in my terminal whereas it is 1.6.0 in the jupyter notebook.

All this code is being executed in Jupyter Notebook

Here is the image of error I am getting
image

Incorrect prediction

Hi,

Used pre-trained models on financial news of a company for text classification. Given below are the three sentences which should have been predicted as positive. Is there a more refined model available or we need to further fine tune it ourselves ?
issue

Thanks,

error using predict.py

hi, I tried to run predict.py by:
!python predict.py --text_path="test.txt" --output_dir="output/" --model_path="pytorch_model.bin"
,as you said in readme

but I got the following error:
usage: predict.py [-h] [--data_path DATA_PATH]
predict.py: error: unrecognized arguments: --text_path=test.txt --output_dir=output/ --model_path=pytorch_model.bin

what is the problem? it looks like it is using another file but I can't figured out which and why!
would you please help me?
regards

Dot at the end of short sentences

I was playing around with the finBERT model a bit and I noticed that for short sentences having a period at the end makes a big difference on the model's predictions (see Figures 1-2 below).

Any idea why that is the case? Could it be that the model was fine-tuned on the sentences with a dot at the end and that's why it makes such a difference? Or does it have to do with BERT embeddings, i.e. is there a special embedding for a dot?

Figure 1 (short sentence, no period at the end):

image

Figure 2 (short sentence, a period at the end):

image

pytorch_pretrained_bert not found

Thank you for your great work!

When I run the example predict.py I got below errors. Should you add pytorch_pretrained_bert to your environment.yml?

Traceback (most recent call last):
File "predict.py", line 1, in
from finbert.finbert import predict
File "C:\Projects\Python\GitHub\finBERT\finbert\finbert.py", line 6, in
from pytorch_pretrained_bert.tokenization import BertTokenizer
ModuleNotFoundError: No module named 'pytorch_pretrained_bert'

How to use finBERT using Hugging Face model?

I tried this code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

but get this error.

OSError: Can't load 'ProsusAI/finbert'. Make sure that:

- 'ProsusAI/finbert' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'ProsusAI/finbert' is the correct path to a directory containing a 'config.json' file

Cannot load the model

When trying to load the model I get the following error. Do I need to download any additional model files? Thank you.

---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-8-19ad01fc2649> in <module>
      5     pass
      6 
----> 7 bertmodel = BertForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)
      8 
      9 

/anaconda/envs/bert10k/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    601         if state_dict is None and not from_tf:
    602             weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
--> 603             state_dict = torch.load(weights_path, map_location='cpu')
    604         if tempdir:
    605             # Clean up temp dir

/anaconda/envs/bert10k/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
    424         if sys.version_info >= (3, 0) and 'encoding' not in pickle_load_args.keys():
    425             pickle_load_args['encoding'] = 'utf-8'
--> 426         return _load(f, map_location, pickle_module, **pickle_load_args)
    427     finally:
    428         if new_fd:

/anaconda/envs/bert10k/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
    601             f.seek(0)
    602 
--> 603     magic_number = pickle_module.load(f, **pickle_load_args)
    604     if magic_number != MAGIC_NUMBER:
    605         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, 'v'.

Unidecode error when trying to load model saved locally

Hello, I trained the model with my own parameters, and saved it.
However, whenever I try to use it, I get the following error:

UnicodeDecodeError Traceback (most recent call last)
in
4 tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
5
----> 6 model = AutoModelForSequenceClassification.from_pretrained("C:/Users/Verena/Documents/finbert_new/models/classifier_model/finbert-sentiment.bin")
7 label_list = label_list=['positive','negative','neutral']

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\models\auto\modeling_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
1237 if not isinstance(config, PretrainedConfig):
1238 config, kwargs = AutoConfig.from_pretrained(
-> 1239 pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
1240 )
1241

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\models\auto\configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
339 {'foo': False}
340 """
--> 341 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
342
343 if "model_type" in config_dict:

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
387 )
388 # Load config dict
--> 389 config_dict = cls._dict_from_json_file(resolved_config_file)
390
391 except EnvironmentError as err:

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\configuration_utils.py in _dict_from_json_file(cls, json_file)
470 def _dict_from_json_file(cls, json_file: str):
471 with open(json_file, "r", encoding="utf-8") as reader:
--> 472 text = reader.read()
473 return json.loads(text)
474

~\anaconda3\envs\finbert\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

The same happens when I try to load the language model, even though both models are downloaded locally. I was only able to use finbert through transformers.

Can you please help me? Thanks!

Error when calling finbert.train()

I'm trying to run FinBERT for stock market prediction based on SEC filings. I am using the finbert_training notebook as a reference.

When running:

trained_model = finbert.train(train_examples = train_data, model = model) I get the following error:

Screen Shot 2021-04-12 at 17 23 41

I ran the dataset.py script, and my dataset looks like this:
Screen Shot 2021-04-12 at 17 27 09

Could you please guide me as for what I could do?

Thank you!

Sentiment classifier finetuning Input Format

Hi,

Thanks for sharing the work.
In order to run notebook "finbert_training.ipynb" for finetuning a sentiment classifier, I could not understand the train.csv (test,validation) format.
Financial Phrase Bank dataset has files where the sentences and labels are separated by @.

Can you tell me the format of these csv files in which it should be prepared ?

I see the following code for reading .csv in utils.py but could not get it.

    def _read_tsv(cls, input_file):
        """Reads a tab separated value file."""
        with open(input_file, "r") as f:
            reader = csv.reader(f, delimiter="\t")
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
        return lines```

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.