bminixhofer,nlprule

bminixhofer commented on May 16, 2024 1

Please rebase this onto main, there's now a simple bench for rules and tokenizer loading speed, and a CI fix.

The Tagger is not well documented at the moment, I'll try to get around to better documentation a bit later so you know roughly what's going on if you're working on it :)

from nlprule.

bminixhofer commented on May 16, 2024 1

This is somewhat addressed by #66 and #70. I'll keep this open because there's currently (at least) two more things worth investigating in this area:

Track loading speed (and possibly some other metrics) over time using e.g. https://github.com/rhysd/github-action-benchmark to prevent regressions.
Investigate parallelization of deserialization across structs. For example, the Tagger and the three Chunker models are all quite expensive to deserialize, and all of that happens sequentially at the moment although it could be done in parallel. I'm not sure how difficult this would be, there's some related discussion in bincode-org/bincode#139 but I don't think this will land in serde anytime soon.

from nlprule.

bminixhofer commented on May 16, 2024

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

nlprule/nlprule/src/utils/regex.rs

Lines 70 to 80 in afc2d3e

    
           fn regex(&self) -> &regex_impl::Regex { 
        
               if let Some(regex) = self.regex.borrow() { 
        
                   regex 
        
               } else { 
        
                   let regex = regex_impl::Regex::new(&self.regex_str).unwrap_or_else(|_| { 
        
                       panic!("regex string should be pre-tested: {}", self.regex_str) 
        
                   }); 
        
                   self.regex.fill(regex).ok(); 
        
                   self.regex.borrow().unwrap() 
        
               } 
        
           }

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

nlprule/nlprule/src/tokenizer/tag.rs

Line 102 in afc2d3e

impl From<TaggerFields> for Tagger {

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

from nlprule.

drahnr commented on May 16, 2024

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

Note the constraint: cli - and there the load time is very significant for the end user.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

I've had a quick peek into the source of regex, and that task is actually non-trivial unfortunately.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

nlprule/nlprule/src/utils/regex.rs

Lines 70 to 80 in afc2d3e

fn regex(&self) -> &regex_impl::Regex {

if let Some(regex) = self.regex.borrow() {

regex

} else {

let regex = regex_impl::Regex::new(&self.regex_str).unwrap_or_else(|_| {

panic!("regex string should be pre-tested: {}", self.regex_str)

});

self.regex.fill(regex).ok();

self.regex.borrow().unwrap()

}

}

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

nlprule/nlprule/src/tokenizer/tag.rs

Line 102 in afc2d3e

impl From<TaggerFields> for Tagger {

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

Yeah, I'll add some criterion benches eventually next week.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

Just making sure I am not investing time + effort for the 🗑️ 😉

from nlprule.

bminixhofer commented on May 16, 2024

Note the constraint: cli - and there the load time is very significant for the end user.

Right, for CLI that is definitely true.

Yeah, I'll add some criterion benches eventually next week.

That would be great! Some very simple benchmark + cargo-flamegraph should be enough to track down the biggest contributors, at least that's how I did it for the core.

from nlprule.

drahnr commented on May 16, 2024

I took the lazy path and below is the current flamegraph of cargo-spellcheck w/ 0.5.3 and you are correct, most time is spent in deserialization which is quite unfortunate.

flamegraph.svg

Not quite sure how to approach this, I'd like to just do all the hard work in build.rs and serialize to smth that is zer0 copy and deserialize that in the actual application, i.e. https://lib.rs/crates/rkyv or https://capnproto.org/ - which both are clearly out of scope for this crate.

So my question here is, how much one can expose for deserialization purposes to make that easier to swap out as needed.
I'll think a bit about this at some point this week.

from nlprule.

bminixhofer commented on May 16, 2024

Thanks for the graph! It seems almost all of the work is indeed done in the Tagger deserialization. In general, deserializing from an FST is necessary since otherwise size of the binary blows up, but it makes sense that it's slow.

Here's the problematic code:

nlprule/nlprule/src/tokenizer/tag.rs

Lines 102 to 153 in 12bd52f

    
           impl From<TaggerFields> for Tagger { 
        
               fn from(data: TaggerFields) -> Self { 
        
                   let word_store_fst = Map::new(data.word_store_fst).unwrap(); 
        
                   let word_store: BiMap<String, WordIdInt> = word_store_fst 
        
                       .into_stream() 
        
                       .into_str_vec() 
        
                       .unwrap() 
        
                       .into_iter() 
        
                       .map(|(key, value)| (key, WordIdInt(value as u32))) 
        
                       .collect(); 
        
                   let mut tags = DefaultHashMap::new(); 
        
                   let mut groups = DefaultHashMap::new(); 
        
                   let tag_fst = Map::new(data.tag_fst).unwrap(); 
        
                   let mut stream = tag_fst.into_stream(); 
        
                   while let Some((key, value)) = stream.next() { 
        
                       let word = std::str::from_utf8(&key[..key.len() - 1]).unwrap(); 
        
                       let word_id = *word_store.get_by_left(word).unwrap(); 
        
                       let value_bytes = value.to_be_bytes(); 
        
                       let inflection_id = WordIdInt(u32::from_be_bytes([ 
        
                           value_bytes[0], 
        
                           value_bytes[1], 
        
                           value_bytes[2], 
        
                           value_bytes[3], 
        
                       ])); 
        
                       let pos_id = PosIdInt(u16::from_be_bytes([value_bytes[6], value_bytes[7]])); 
        
                       let group = groups.entry(inflection_id).or_insert_with(Vec::new); 
        
                       if !group.contains(&word_id) { 
        
                           group.push(word_id); 
        
                       } 
        
                       tags.entry(word_id) 
        
                           .or_insert_with(IndexMap::new) 
        
                           .entry(inflection_id) 
        
                           .or_insert_with(Vec::new) 
        
                           .push(pos_id); 
        
                   } 
        
                   Tagger { 
        
                       tags, 
        
                       tag_store: data.tag_store, 
        
                       word_store, 
        
                       groups, 
        
                       lang_options: data.lang_options, 
        
                       ..Default::default() 
        
                   } 
        
               } 
        
           }

I can give parallelizing this to some degree a go, that's not trivial here but it should help.

(also, I just noticed the into_iter + map + collect in L111 above can be easily avoided so that should already measurably improve things, but it seems the bulk of the work is done in the loop starting at L119)

from nlprule.

drahnr commented on May 16, 2024

tl;dr need a good test case to improve this for real.

My measurements really were only a few runs of cargo-spellcheck on the same file with different impls, and comparing the total runtime percentage of the Tagger::from impl. So take it with a grain of salt.

The above code snippet seems slow due to the additional allocations, in reality, the change is insignificant and is slightly faster as is (probably due to some cache locallity) compared to:

        let word_store_fst = Map::new(data.word_store_fst).unwrap();
        let mut word_store = FastBiMap::<String, WordIdInt>::with_capacity_and_hashers(
            word_store_fst.len(),
            Default::default(),
            Default::default(),
        );
        let mut stream = word_store_fst.into_stream();
        while let Some((key, value)) = stream.next() {
            if let Some(key) = std::str::from_utf8(key).ok() {
                word_store.insert(key.to_owned(), WordIdInt(value as u32));
            }
        };

fnv hasher is supposed to be pretty fast for small keys, which is really all there is, but it appears hashbrown beats fnv, maybe due to SMID lookup that is advertised.

from nlprule.

drahnr commented on May 16, 2024

Some more digging revealed that the data really is very sparse. We do 2 allocations for structure and the intermediate combinator key, where everything holds 1 element only in 99% of all cases. So at this point, I am tempted to say that using Vec for the inner 2 layers should already give a pretty good speedup.
Another approach would be to eliminate the intermediate layer,

                .entry(inflection_id)

which also barely used, but in get_raw afaiu.
Using a Vec<(WordIdInt, PosIdInt)> should yield significantly better perf given our data sizes of mostly 1, up to 3 items.

from nlprule.

bminixhofer commented on May 16, 2024

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

from nlprule.

drahnr commented on May 16, 2024

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

One retains the current 2 layer inner nesting structure, the other flattens the IndexMap<_,Vec<_> to a single Vec

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

That could work, but the most significant part is not the insertion, but just calling next() on the Stream - I tried to write a safe wrapper as an iterator to then be able to use rayon with a parallel iterator adapter (which suffers the same issues).

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

I think fixing the topology of the Tagger should be the first step, I am not very deep into the code yet, so I don't know if that already implies adding multiple FSTs or not.

from nlprule.

drahnr commented on May 16, 2024

That'd be much appreciated, I'll have some time tomorrow night to look into this further.

from nlprule.

bminixhofer commented on May 16, 2024

Alright, there's better doc comments for the tagger on main now. I think I covered the important parts, let me know if anything is unclear!

from nlprule.

Improve loading speed (of regex?) - cli usecase about nlprule HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	fn regex(&self) -> &regex_impl::Regex {
	if let Some(regex) = self.regex.borrow() {
	regex
	} else {
	let regex = regex_impl::Regex::new(&self.regex_str).unwrap_or_else(\|_\| {
	panic!("regex string should be pre-tested: {}", self.regex_str)
	});
	self.regex.fill(regex).ok();
	self.regex.borrow().unwrap()
	}
	}

	impl From<TaggerFields> for Tagger {
	fn from(data: TaggerFields) -> Self {
	let word_store_fst = Map::new(data.word_store_fst).unwrap();
	let word_store: BiMap<String, WordIdInt> = word_store_fst
	.into_stream()
	.into_str_vec()
	.unwrap()
	.into_iter()
	.map(\|(key, value)\| (key, WordIdInt(value as u32)))
	.collect();

	let mut tags = DefaultHashMap::new();
	let mut groups = DefaultHashMap::new();

	let tag_fst = Map::new(data.tag_fst).unwrap();
	let mut stream = tag_fst.into_stream();

	while let Some((key, value)) = stream.next() {
	let word = std::str::from_utf8(&key[..key.len() - 1]).unwrap();
	let word_id = *word_store.get_by_left(word).unwrap();

	let value_bytes = value.to_be_bytes();
	let inflection_id = WordIdInt(u32::from_be_bytes([
	value_bytes[0],
	value_bytes[1],
	value_bytes[2],
	value_bytes[3],
	]));
	let pos_id = PosIdInt(u16::from_be_bytes([value_bytes[6], value_bytes[7]]));

	let group = groups.entry(inflection_id).or_insert_with(Vec::new);
	if !group.contains(&word_id) {
	group.push(word_id);
	}

	tags.entry(word_id)
	.or_insert_with(IndexMap::new)
	.entry(inflection_id)
	.or_insert_with(Vec::new)
	.push(pos_id);
	}

	Tagger {
	tags,
	tag_store: data.tag_store,
	word_store,
	groups,
	lang_options: data.lang_options,
	..Default::default()
	}
	}
	}