Comments (13)
Also on that subject that's a reason to seperate the analyzer and tokenizer into seperate interfaces, especially in regards to concept identification where the concept can span/encapsulate multiple terms. Using an analyzer with a tightly coupled tokenization function makes this a nightmare
from resin.
check out https://github.com/jhashemi/resin/tree/master/src/Resin.Analyses.Concept
ideally concepts are represented as graphs. Most definitely a separate concept index will be needed. A implementation of IVocabulary that depends on a resin index will be needed. This is very very rudimentary and untested, I just wanted to get my ideas to paper
from resin.
Can this also be used for synonyms?
from resin.
This is instead of synonyms. "King" and "emperor" should both be part of the same (oppressive and undemocratic ruler) concept.
from resin.
Concepts also span multiple terms, then you have to deal with disambiguation ;-)
I'll give a very very basic go and submit a PR
from resin.
@jhashemi you're giving this a go?
I could tell you about my ideas but I'm not going to. You seem to have an itch. Just a basic proof of concept will do :)
from resin.
Before i go deeper into it wanted to chat about knowledge base or ontology base or api based.
Maybe ill uml it out and push an architecture project.
from resin.
Sounds good.
Some thought on this issue. When I added it I was thinking about (1) how to implement word2vec, simplified in the same way the vector space model is simplified in Resin and in Lucene. But also (2) how to produce word vectors at indexing time. What if you add one document at a time to your index, how would you then be able to produce word vectors? It seems not possible. So perhaps "concepts" or word vectors or sentiment analysis or whatever you want to call it is an operation you do on an existing index. The sentiment analysis operation could produce a new concept-based index that complements the term-based one.
The concept-based index would contain pointers into the term-based index which in turn has pointers into the postings and document store.
Having a concept-based index would mean you could make more directed lookups into the term-based index instead of large scans.
All in theory and a bit diffuse in my mind at the moment.
Also, we would need a new tree to represent words instead of just characters. Does it have to be a B+ tree? I mean sure, all devs should roll a B+ tree once in their lives, I guess. Maybe it'll be fun?
from resin.
This looks pretty good: https://github.com/asengupta/BPlusTree/blob/master/BPlusTree/BTreeNode.cs
from resin.
Also for graphs, typically a sparse adjacency matrix implementation works best. with each axis being your node ID's and relationships established as 0 or 1. You can use a Bitmap Index to make traversal extremely fast.
from resin.
I will check this out shortly. I ran through the code and it looked very promising.
from resin.
This issue is still open but needs a new strategy because of the new type of index introduced here: 5f85425
from resin.
Will be solved at a later time.
from resin.
Related Issues (20)
- Implement a custom storage engine to test the contracts/abstractions HOT 1
- Split IScoringScheme in two: factory and scheme
- Write char constants as either character-literals or integer-literals cast to char HOT 1
- Implement TruncateOperation HOT 1
- Refactor IDocumentStoreWriter into factory and session
- Re-balance search tree upon insert HOT 1
- Implement collation
- store datetime as utc HOT 1
- Parse query into doubly-chained linked list HOT 2
- Increase phrase search relevance by storing term positions HOT 1
- Benchmarks HOT 5
- Replace log4net with microsoft.extensions.logging.abstractions HOT 7
- GetTicks/GetNextChronologicalFileId question HOT 5
- Version for .NETFramework,Version=v4.5.2
- status? HOT 1
- Search results aren't good HOT 5
- Linux version in C/C++ HOT 10
- Demo page (didyougogo.com) fails to load HOT 2
- Application startup exception: System.PlatformNotSupportedException: The named version of this synchronization primitive is not supported on this platform. HOT 9
- Vector Space Search Guide not accessible HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from resin.