Code Monkey home page Code Monkey logo

Comments (13)

jhashemi avatar jhashemi commented on May 27, 2024 1

Also on that subject that's a reason to seperate the analyzer and tokenizer into seperate interfaces, especially in regards to concept identification where the concept can span/encapsulate multiple terms. Using an analyzer with a tightly coupled tokenization function makes this a nightmare

from resin.

jhashemi avatar jhashemi commented on May 27, 2024 1

check out https://github.com/jhashemi/resin/tree/master/src/Resin.Analyses.Concept

ideally concepts are represented as graphs. Most definitely a separate concept index will be needed. A implementation of IVocabulary that depends on a resin index will be needed. This is very very rudimentary and untested, I just wanted to get my ideas to paper

from resin.

mdissel avatar mdissel commented on May 27, 2024

Can this also be used for synonyms?

from resin.

kreeben avatar kreeben commented on May 27, 2024

This is instead of synonyms. "King" and "emperor" should both be part of the same (oppressive and undemocratic ruler) concept.

from resin.

jhashemi avatar jhashemi commented on May 27, 2024

Concepts also span multiple terms, then you have to deal with disambiguation ;-)

I'll give a very very basic go and submit a PR

from resin.

kreeben avatar kreeben commented on May 27, 2024

@jhashemi you're giving this a go?

I could tell you about my ideas but I'm not going to. You seem to have an itch. Just a basic proof of concept will do :)

from resin.

jhashemi avatar jhashemi commented on May 27, 2024

Before i go deeper into it wanted to chat about knowledge base or ontology base or api based.

Maybe ill uml it out and push an architecture project.

from resin.

kreeben avatar kreeben commented on May 27, 2024

Sounds good.

Some thought on this issue. When I added it I was thinking about (1) how to implement word2vec, simplified in the same way the vector space model is simplified in Resin and in Lucene. But also (2) how to produce word vectors at indexing time. What if you add one document at a time to your index, how would you then be able to produce word vectors? It seems not possible. So perhaps "concepts" or word vectors or sentiment analysis or whatever you want to call it is an operation you do on an existing index. The sentiment analysis operation could produce a new concept-based index that complements the term-based one.

The concept-based index would contain pointers into the term-based index which in turn has pointers into the postings and document store.

Having a concept-based index would mean you could make more directed lookups into the term-based index instead of large scans.

All in theory and a bit diffuse in my mind at the moment.

Also, we would need a new tree to represent words instead of just characters. Does it have to be a B+ tree? I mean sure, all devs should roll a B+ tree once in their lives, I guess. Maybe it'll be fun?

from resin.

kreeben avatar kreeben commented on May 27, 2024

This looks pretty good: https://github.com/asengupta/BPlusTree/blob/master/BPlusTree/BTreeNode.cs

from resin.

jhashemi avatar jhashemi commented on May 27, 2024

Also for graphs, typically a sparse adjacency matrix implementation works best. with each axis being your node ID's and relationships established as 0 or 1. You can use a Bitmap Index to make traversal extremely fast.

from resin.

kreeben avatar kreeben commented on May 27, 2024

I will check this out shortly. I ran through the code and it looked very promising.

from resin.

kreeben avatar kreeben commented on May 27, 2024

This issue is still open but needs a new strategy because of the new type of index introduced here: 5f85425

from resin.

kreeben avatar kreeben commented on May 27, 2024

Will be solved at a later time.

from resin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.