jyonn / gnrs Goto Github PK

3.0 1.0 1.0 266 KB

news representation learning

Python 100.00%

gnrs's Introduction

GreenRec: Green AI Benchmarking for News Recommendation

Environment

pip install -r requirements.txt

Data Processing

Please specify the path to the data in python file

cd process/mind
python processor.py

Configuration

Data

Please refer to config_v2/data/mind.yaml for the data configuration.

Model

We support the following models on both MIND small and large datasets:

	NAML	LSTUR	NRMS	DCN	DIN	BST
ID-based	ID-NAML	ID-LSTUR	ID-NRMS	DCN	DIN	BST
text-based	NAML	LSTUR	NRMS	text-DCN	text-DIN	text-BST
PLMNR	PLMNR-NAML	PLMNR-LSTUR	PLMNR-NRMS	PLMNR-DCN	PLMNR-DIN	PLMNR-BST
BERT	BERT-NAML	BERT-LSTUR	BERT-NRMS	BERT-DCN	BERT-DIN	BERT-BST
MFT	MFT-NAML	MFT-LSTUR	MFT-NRMS	MFT-DCN	MFT-DIN	MFT-BST

Training and Testing

python worker.py 
    --config config/data/mind.yaml 
    --model config/model/nrms.yaml 
    --exp config/exp/tt-nrms.yaml
    --embed config/embed/null.yaml
    --version small-v2

gnrs's People

Contributors

Stargazers

Watchers

Forkers

cadobe

gnrs's Issues

How to load bert embeddings

I tried to load bert embeddings of news texts with 'bert-token.yaml' and use 'dcn.yaml' as the recommend model. After preprocess the data with bert_processor.py, i realize it only tokenize the text. When load the data.npy in embedding_loader.py, i print out the embedding and realize there are only tokens and no bert embeddings. How can i extract the bert embeddings and load it to the model?

print out the embedding variable
`{'nid': array([0, 1, 2, ..., 65235, 65236, 65237], dtype=object), 'cat': array([list([9580]), list([2740]), list([2739]), ..., list([2739]), dtype=object),
'title': array([list([1996, 9639, 3035, 3870, 1010, 3159, 2798, 1010, 1998, 3159, 5170, 8415, 2011]),..., list([3901, 1997, 4916, 2237, 5998, 2007, 3571, 2044, 9288]),dtype=object),
'abs': array([list([4497, 1996, 14960, 2015, 1010, 17764, 1010, 1998, 2062, 2008, 1996, 15426, 2064, 1005, 1056, 2444, 2302, 1012]), list([2122, 9428, 19741, 14243, 2024, 3173, 2017, 2067, 1998, 4363, 2017, 2013, 8328, 4667, 2008, 18162, 7579, 6638, 2005, 2204, 1012]),...,list([])], dtype=object)}

the error
Traceback (most recent call last):
File "/Users/chuanqijiao/GNRS-master/worker.py", line 395, in
worker = Worker(config=configuration)
File "/Users/chuanqijiao/GNRS-master/worker.py", line 54, in init
self.config_manager = ConfigManager(
File "/Users/chuanqijiao/GNRS-master/loader/config_manager.py", line 196, in init
self.embedding_manager.load_pretrained_embedding(**Obj.raw(embedding_info))
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_manager.py", line 66, in load_pretrained_embedding
self.pretrained[vocab_name] = EmbeddingInfo(**kwargs).load()
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 39, in load
self.embedding = getter(self.path)
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 21, in get_numpy_embedding
return torch.tensor(embedding, dtype=torch.float32)
TypeError: can't convert np.ndarray of type numpy.object. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.`

Besides, the configs look a little bit confusing to me. If i try to load bert embedding and not use image features, can i use the following config?
mind.yaml-->dcn/din/bst/pnn.yaml-->tt.yaml-->bert-token.yaml

TypeError: model.operator.attention_operator.AttentionOperatorConfig() got multiple values for keyword argument 'hidden_size'

I do follow the lead in the README exactly, but after run the worker.py with the same confs, i got an TypeError as shown in the picture. Could you please tell me how to fix it? @Jyonn Thanks.
Traceback (most recent call last): File "/Users/chuanqijiao/GNRS-master/worker.py", line 395, in <module> worker = Worker(config=configuration) File "/Users/chuanqijiao/GNRS-master/worker.py", line 54, in __init__ self.config_manager = ConfigManager( File "/Users/chuanqijiao/GNRS-master/loader/config_manager.py", line 219, in __init__ self.recommender = self.recommender_class( File "/Users/chuanqijiao/GNRS-master/model/recommenders/base_neg_recommender.py", line 26, in __init__ super().__init__(**kwargs) File "/Users/chuanqijiao/GNRS-master/model/recommenders/base_recommender.py", line 76, in __init__ self.user_config = self.user_encoder_class.config_class( TypeError: model.operator.attention_operator.AttentionOperatorConfig() got multiple values for keyword argument 'hidden_size'

How to get original text data in the training process

I noticed that the data in a batch is represented as Unitok objects, which is the result of tokenization using Unitok (after processor.py). I'm wondering if there is a way to map these tokenized results back to the original text data. For example, if a nid token is 234 that can be mapped to N25648 in the original dataset, then original title data can be found using the N25648 index? Is there a way to do that?

jyonn / gnrs Goto Github PK

gnrs's Introduction

GreenRec: Green AI Benchmarking for News Recommendation

Environment

Data Processing

Configuration

Data

Model

Training and Testing

gnrs's People

Contributors

Stargazers

Watchers

Forkers

gnrs's Issues

How to load bert embeddings

TypeError: model.operator.attention_operator.AttentionOperatorConfig() got multiple values for keyword argument 'hidden_size'

How to get original text data in the training process

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent