Code Monkey home page Code Monkey logo

Comments (4)

kkalouli avatar kkalouli commented on July 16, 2024 2

Ok, so I figured out the answer myself, so posting it here in case it helps someone else:

the array float[][] that one gets back by running e.g. embeddings[0] will always have the size 128 because this configuration (max_seq_len) is loaded when loading a pretrained model. (in my case: bert-uncased-L-12-H-768-A-12). The first position of this array is always occupied by [CLS] and [SEP] is also included in the right place. The rest of the positions that do not "correspond" to the given sentence are filled with paddings. See here for how this looks like: https://github.com/hanxiao/bert-as-service#getting-elmo-like-contextual-word-embedding
(check Getting ELMo-like contextual word embedding Section) and here: https://github.com/google-research/bert (check Tokenization Section). This means that you cannot really do a one-to-one translation of the 128-size array to your original sentence.

There are 2 options as I see it:

  1. Change the max_seq_len from the original configuration when loading the model to just fit the exact size of your sentence ("Set it to NONE for dynamically using the longest sequence in a (mini)batch." from https://github.com/hanxiao/bert-as-service#getting-elmo-like-contextual-word-embedding ) This is more straightforward in the original python implementation. In the java implemtentation, the maxSequenceLength has to be an integer or a String, so by currently being an integer one would need to overwrite the current value. But I think this is easy. In the getInputs() method of Bert class we could add something like:
maxSequenceLength = tokens.length + 2 ;  // +2 is needed for the reserved CLS and SEP at the beginning and end of the sequence 

@robrua would you consider adding this? According to https://github.com/hanxiao/bert-as-service, it is more efficient (faster) to have smaller size maxSequenceLength. (see question "How about the speed? Is it fast enough for production?")

  1. Write a piece of code to map the BERT positions of the array to the original tokens of the sentence. This is also proposed here: https://github.com/google-research/bert. I converted it to java and I am pasting it here, in case somebody else wants to use it. It gives you a mapping from the position of your original tokens to the positions of the tokens within the bert array, similarly to the link above.
public HashMap<String,Integer> matchOriginalTokens2BERTTokens(String[] originalTokens ){
		ArrayList<String> bertTokens = new ArrayList<String>();
		HashMap<String,Integer> orig2TokenMap = new HashMap<String,Integer>();
		// create a wordpiece tokenizer
		FullTokenizer tokenizer = new FullTokenizer(new File("/path/to/file/vocab.txt"), true);
		// bert tokens start with CLS
		bertTokens.add("CLS");
		// go through the original tokens
		for (String origToken :originalTokens ){
			orig2TokenMap.put(origToken,bertTokens.size());
			// tokenize the current original token with the wordpiece tokenizer
			String[] tokToken = tokenizer.tokenize(origToken);
			// add each of those new tokens to the bertTokens, so that the latter increases its size
			for (String tok : tokToken){
				bertTokens.add(tok);	
			}
		}
		// bert tokens end with SEP
		bertTokens.add("SEP");
		return orig2TokenMap;
	}

from easy-bert.

robrua avatar robrua commented on July 16, 2024 1

Reopening this to remind myself to add this in the future.

On (2): Right then, I hadn't read it closely enough and failed to notice you were tracking the start indices for each token. I'll probably end up including something very similar to this, just integrated into the tokenizer itself to avoid needing to run it twice on each sequence.

from easy-bert.

robrua avatar robrua commented on July 16, 2024

Hey, thanks for the research and the detailed issue.

For (1) that's an excellent idea to add here and I'll look into allowing dynamic max sequence length on both the Python and Java ends next time I sit down and do some work on this project.

For (2) there's an extra complication involved here in matching the output token vectors back to the original source: BERT uses a wordpiece vocabulary which may split a single word from your sequence into multiple subtokens before inputting it to the model. Because of this, the output size doesn't necessarily match the number of words in the input sequence (even after considering the [CLS] and [SEP] tokens); you'd need to inject some "tracking" logic into the tokenizer to keep track of any words that are getting subdivided during tokenization. There's no reason this wouldn't work, and I think providing a way to match the output vectors to each word in the input sequence would be useful, so I'll also take a look at this in the future.

from easy-bert.

kkalouli avatar kkalouli commented on July 16, 2024

Hi Rob!

Thanks for considering adding (1) to the code.

About (2): you are right that this is not straightforward because of this special wordpiece tokenizer that bert is using but the code included in the link I posted (https://github.com/google-research/bert) and which I converted in java takes this into account. This means that it uses the same special wordpiece tokenizer to tokenize the words and keeps track of how each word is tokenized: e.g. the verb "faked" is tokenized as "fake" + "d", each of these two tokens matching to its own vector. In this case, the above code keeps track of the "real" word, in this case "fake" and gives you back the vector of "fake" and not "d" as the representation of "fake". In other words, the code above always tracks the first token of each word, which is also the base form of the word.

Thanks!

from easy-bert.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.