goliasz / pio-template-text-similarity Goto Github PK

10.0 1.0 5.0 25 KB

Text similarity based on Word2Vec vectors.

Scala 99.51% Shell 0.49%

word2vec-model machine-learning text-classification

pio-template-text-similarity's Introduction

Text Similarity Based on Word2Vec

Text similarity engine based on Word2Vec algorithm. Builds vectors of full documents in training phase. Finds similar documents in query phase.

Model Training Modes

Similarity model can be trained using 2 sources of information.

Training modes are switched using engine.json configuration file. The change can’t be done using PubNub queues.

Basic Training

Basic training is based only on "text" field from training event. In this mode score equal 0 means that we are lacking training information for query phrase. Results and matches in such case are random.

Composite Training

In this mode we have two sources of information. We are concatenating fields "text" and "extTrainWords". This gives much more flexibility. If we have a case that we want to distinguish two quite similar phrases we can use second field to add additional desiding information allowing to match precisely and according to our needs.

Note that both fields can contain free texts. We don’t need to use single words there.

Engine Configuration

File engine.json contains configuration of engine.

minTokenSize

Only tokens of size equal and bigger than minTokenSize will be used to traing similarity model.

showText

If set to true query result displays "text" field from training events.

showDesc

If set to true query result displays "desc" field from training events.

useExtTrainWords

If set to true composite training is used. Concatenated "text" and "extTrainWords" are source of training data. If set to false only "text" field is used as source of data for training.

storeClearText

Default: false. If set to true all texts used for training are stored inside the model together with text vectors. Model is kept in memory. If training set is huge we are in danger of filling up significant portion of memory. If set to false model is very memory efficient. Only vectors of doubles are stored. By default vectors have 100 dimentions. This is configurable using "vectorSize".

Docker Part

docker pull goliasz/docker-predictionio
docker run --hostname tc1 --name tc1 -it goliasz/docker-predictionio /bin/bash

PIO Part

root@tc1:/# pio-start-all
root@tc1:/# mkdir MyEngine
root@tc1:/# cd MyEngine
root@tc1:/MyEngine# pio template get goliasz/pio-template-text-similarity --version "0.9.1" textsim
root@tc1:/MyEngine# cd textsim
root@tc1:/MyEngine/textsim# vi engine.json

Set application name to “textsim”

root@pio1:/MyEngine/textsim# pio build --verbose
root@pio1:/MyEngine/textsim# pio app new textsim --access-key 1234
root@pio1:/MyEngine/textsim# sh ./data/import_test.sh 1
root@pio1:/MyEngine/textsim# pio train
root@pio1:/MyEngine/textsim# pio deploy --port 8000 &

Test

Event Server Status

curl -i -X GET http://localhost:7070

Event Server: get all events

curl -i -X GET http://localhost:7070/events.json?accessKey=[YOUR ACCESS KEY FROM "pio app new textsim" output]

Query similarity score for a text a little bit similar to id:6

curl -X POST -H "Content-Type: application/json" -d '{"doc": "DJs flock by when MTV ax quiz prog. Five quacking zephyrs jolt my wax bed.", "limit", 3}' http://localhost:8000/queries.json

Requirements

Template in version >= 0.8 requires Spark 1.6.1. You can get PredictionIO bundled with Spark 1.6.1 here https://hub.docker.com/r/goliasz/docker-predictionio-dev/

License

This Software is licensed under the Apache Software Foundation version 2 licence found here: http://www.apache.org/licenses/LICENSE-2.0

pio-template-text-similarity's People

Contributors

Stargazers

Watchers

Forkers

rajapalla accentureapis kioco piotr-sikora-v proximator

pio-template-text-similarity's Issues

fine tune or make more acurate

hello..
i have below Extected result sentence in mine engine but insteadof i get another.

Expected result :
Hi Anuj, if you can afford to study in USA then no doubt its a best option. Also you canhave a lot ofopportunities in Europe in field of Robotics, especially inGermany(for free highereducation) and France and Denmark (but you have to pay for studies and accommodation). ", "extTrainWords": "USA option lot ofopportunities Europe field Robotics highereducation France Denmark studies accommodation"

Getting result:

below is my posted URI request:
http://localhost:8000/queries.json
{"doc": "robotics USA Europe opportunities", "limit": 5}

{
"docScores": [
{
"score": 0.7516887446185359,
"id": "217454",
"similarText": "In my opinion, the difference between these two choices are not as big as I would have said even two years ago. Forbes did a story that said pretty much the same that nursing is overestimated as a career choice (Source:https://www.forbes.com/sites/alisongriswold/2012/06/18/has-nursing-been-overhyped-as-a-career-choice/#77c3ece73555).In the US, there will be a demand for nursing but I dont believe that you will be able to recapture your investment in tuition as half will quit nursing. At the one year mark, 20% will quite (http://www.rwjf.org/en/library/articles-and-news/2014/09/nearly-one-in-five-new-nurses-leave-first-job-within-a-year--acc.html) and the annual rate is 17.2% (http://strategicprogramsinc.com/nursing-shortage-statistics/). Nursing is demanding and many dont realize how difficult this career field is.You might lke business because it is a much less demanding field but you get to stay in business much longer than nursing.",
"textDesc": "desc for 1"
},
{
"score": 0.6727393563685875,
"id": "369330",
"similarText": "Well the requirements for pursuing master in supply chain management vary from university to university. Normally they required bachelor in business administration or in economics or may be some credit must be studied in these fields. But if u have enough experience in supply chain so u might get admission without studying these subjects ..Hope it helps u...",
"textDesc": "desc for 1"
},
{
"score": 0.6593310504608081,
"id": "148663",
"similarText": "After finish bachelor you can check universities that match your subject and start sending email to professors.",
"textDesc": "desc for 1"
},
{
"score": 0.6551619888373317,
"id": "168868",
"similarText": "Hello Mahdi,At first you have to decide which country you want to go. In my opinion, you can try in Japan. There are many good universities for computer science students.They are technically economically very good. There many universities give half full scholarship(MEXT Scholarship). For getting scholarship you have to send research proposal to professor of that universities and you can get scholarship easily. Their medium of instruction is English Japanese. You can choose English as a medium of instruction.In Japan, you can do part time job and their getting part time job is very easy. You can do part time job 28 hours in a week. So, I think Japan will be best for you. Bye.",
"textDesc": "desc for 1"
},
{
"score": 0.6398356787598534,
"id": "297521",
"similarText": "Hi, Aditya. A friend asked me to answer this question for you. I didnt go to Caltech. I have a degree from Stanford in English. But Ill do my best.Caltech is a school that specializes in math and science. So if you want to go there you should want a career in math and science. In your preparation for the school, you should take lots of math and science classes, and you should work very hard and do very well in them. But you should also do well in your other courses, as Im sure Caltech wants students who are good at academics in general.You will have to do such things as take the SAT, take an SAT Subject Test in Mathematics Level 2, and take 1 SAT science subject test: biology (ecological), biology (molecular), chemistry, or physics. On this page you can see a complete list: http://www.admissions.caltech.edu/content/how-apply-first-year-applicant. I imagine you will have to do very well in them.I do know someone who has a bachelors and masters from Caltech. He said that in his opinion most of the people who apply to Caltech have great math and science scores. So he thought maybe the way Caltech decides who to admit is based on their extracurriculars. I asked him what extracurriculars he thought might be good. So far he hasnt replied, but when he does Ill add it to this answer. I would guess you want extracurriculars that are really solid, where youre spending your free time doing something challenging and worthwhile. What do you do now in your free time when youre not in school? Are you involved in any clubs at your school? Do you play sports? Study music? I would think some of your extracurriculars should probably involve math and science, but perhaps you should do some that arent math and science so you dont look too one-dimensional.These are your various deadlines related to applying: http://www.admissions.caltech.edu/content/deadlines-and-forms-first-year-applicants Heres a general statement as to what Caltech looks for: http://www.admissions.caltech.edu/content/admissions-process-first-year-applicantsI think Caltech only admits about 230 new freshmen each year. So it may be difficult. But if you think you have a chance, go ahead and apply.",
"textDesc": "desc for 1"
}
]
}

So How i get accurate result ?

[ERROR] [Engine$] No engine found. Your build might have failed. Aborting.

Hello,

I've tried building this template on steveny2k/docker-predictionio but I get this error:

[ERROR] [Engine$] No engine found. Your build might have failed. Aborting.

Do you know to fix this?

feature or help for build suggest answer

in that i want to do similar like quora feature (How does Quora decide how to suggest users to answer?).
-get suggestion of answer based on users past answer, skill, no of likes (upvote)

So how can we do this with this template ?

Thanks

Id with Match Text

Hello...
I Have Implement below api but i get id no. of match text but i want whole match sentence on that ID.
SO how i get that ?

curl -X POST -H "Content-Type: application/json" -d '{"doc": "DJs flock by when MTV ax quiz prog. Five quacking zephyrs jolt my wax bed.", "limit", 3}' http://localhost:8000/queries.json

{"docScores":[{"score":0.0,"id":"8","similarText":"","textDesc":""},{"score":0.0,"id":"14","similarText":"","textDesc":""},{"score":0.0,"id":"7","similarText":"","textDesc":""}]}

Thanks

similarText and textDesc always get empty

hello..
i have train this engine template with enable "showText": true, "showDesc": true,
but still i get empty result of similarText and textDesc .
SO how get i result in that also?
Thanks

pio train

[WARN] [BLAS] Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
[WARN] [BLAS] Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
[INFO] [Engine$] org.template.similarity.TSModel does not support data sanity check. Skipping check.

template requires at least PredictionIO 0.10.0-incubating

[ERROR] [Template$] This engine template requires at least PredictionIO 0.10.0-incubating. The template may not work with PredictionIO 0.9.5.

Train error

after "pio train" command I get this error:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0