spencermountain / compromise Goto Github PK
View Code? Open in Web Editor NEWmodest natural-language processing
Home Page: http://compromise.cool
License: MIT License
modest natural-language processing
Home Page: http://compromise.cool
License: MIT License
Is it possible to add an exception for the following regex?
/c\.(\ ?[0-9]+)/
Right now I'm using a small script to pre-process the text that I want to analyze with nlp_compromise. The current solution that I am using looks like this:
raw = raw.replace(/c\.(\ ?[0-9]+)/g, 'circa $1');
Basically, any c. YEAR
will be replaced by circa YEAR
so nlp doesn't mess up with that c.
. While only c.
might not be good enough to be added to abbreviations since it's not significant enough, this expression matches c. NUMBER
, which I think it's unambiguous enough. What do you think? Is there a way to add this or other similar, case-specific abbreviations?
(I am not proposing to change c. for circa, this is just my solution, I am proposing to add, if possible, an exception for c. YEAR
and not break sentences in that period).
I'd consider these words common enough to be included in the lexicon.
Title says it all.
tried on my home and work computer, same error.
Also looks like the version on npm is still 0.0.7
Uncaught TypeError: Cannot read property 'match' of undefined
if (w.match(/^(over|under|out|-|un|re|en).{4}/)) {
var attempt = w.replace(/^(over|under|out|.*?-|un|re|en)/, '')
return parts_of_speech[lexicon[attempt]]
}
Scenario: using the library for natural language processing for a calendar assistant. Doesn't recognise "schedule" as a verb.
Would it be possible to pass in some configuration when instantiating the library, eg an array of verbs + nouns etc to allow users to inject extra words.
In my case I might extend verbs by passing in an array of my own verbs:
["schedule"]
That seems to work for me if I hack the code and add 'schedule' to the list of verbs...but I don't grok grammar well enough to know if it's completely correct (it becomes an infinitive verb, VBP)
Example: nlp.sentences('How are you! That is great.') returns one sentence not two.
The README mentions it, but I don't see it in the exports nor does the current version published to npm have it.
Yet to go over the code but that specific example does not yield a result
see lexidates:
res.dayS = '\b('.concat(Object.keys(res.days).join('|'), ')\b');
When a string becomes a regex, in javascript, you must quote stuff with special regex-meaning double.
So \b
should be \\b
here - see my original code...
If you want to use it like on top you need to pass it to a quote function, e.g.
Mozilla:
function escapeRegExp(string){
return string.replace(/([.*+?^=!:${}()|\[\]\/\\])/g, "\\$1");
}
or in dojo see .string ...
Very nice library.
When playing with some text pulled from a web article, noticed that the sentence boundary does not always work.
For example, the text below does not split sentences correctly.
The man who tried to kill former Pope John Paul II 33 years ago showed up at the Vatican on Saturday to put white roses on his tomb and said he wanted to meet Pope Francis.Mehmet Ali Agca, a Turk, left John Paul critically injured after firing several shots in the failed assassination attempt in St. Peter's Square on May 13, 1981.The former pope forgave Agca, once a member of a Turkish far right group known as the Grey Wolves, and went to meet him in 1983 in the Rome prison where he had been sentenced to life imprisonment for the attack.Agca called the Italian daily la Repubblica on Saturday to announce he had arrived in the Vatican, his first visit since the assassination attempt and exactly 31 years after John Paul met him in prison.The visit was confirmed to Reuters by Father Ciro Benedettini, the Vatican's deputy spokesman, who said Agca stood for a few moments in silent meditation over the tomb in St. Peter's Basilica before leaving two bunches of white roses.Agca, 56, was pardoned by Italy in 2000 and extradited to Turkey where he was imprisoned for the 1979 murder of a journalist and other crimes. He was released from jail in 2010.The attack against John Paul, who died in 2005, has remained clouded by unanswered questions over who may have been behind it. An Italian investigative parliamentary commission said in 2006 it was "beyond reasonable doubt" that it was masterminded by leaders of the former Soviet Union.The Vatican on Saturday gave a cool response to Agca's request to meet with Pope Francis. "He has put his flowers on John Paul's tomb; I think that is enough," Vatican spokesman father Federico Lombardi told la Repubblica.
var text = "She was dead. He was ill."
nlp.sentences(text)
// returns only ["She was dead."]
I think it's because of the abbreviation regex is picking up ill.
as an abbreviation, rather than the end of the sentence.
Similarly, nlp.sentences("It was Sunday. He attended mass.")
only returns ["It was Sunday."]
too
In the default mode [without {dont_combine:true}
] it would be nice to have phrasal verbs recognized – as they can have a totally new meaning. For example
My grandfather likes to look back on his childhood.
``look back`
[taken from http://www.englisch-hilfen.de/grammar/phrasal_verbs.htm]
Hm - the last commit does not work properly because
in pluralize_rules
we have rules for both singular to plural AND plural to plural
while
in singularize_rules
it is only plural to singular
(???)
In general I am working on a factory method called "dictionary" based on the "words" and "rules" and this can be autotranslated by our database to several languages covering the ngram and metrics etc.
And write code with this mode.
Example, in file client_side/nlp.js:5652 (release 1.1.0):
uncountable_nouns = uncountables.reduce(function(h, a) {
h[a] = true
return h
}, {})
ReferenceError: uncountable_nouns is not defined
But stated otherwise at https://github.com/spencermountain/nlp_comprimise#named-entity-recognition
We are using nlp_compromise to parse requests for data pulls. In many cases, a product or retailer will get parsed in an undesirable fashion, i.e. "Stop & Shop" will not be thought of as a noun.
Is it possible today, or would it be possible, to allow double-quotes to group words together and default them to a particular part of speech, like NN?
Regarding
a7f1e68 -> src/parents/noun/conjugate/inflect.js
please see e.g.
http://stackoverflow.com/questions/5717126/var-or-no-var-in-javascripts-for-in-loop
Usually it is "better practice" to use var in each for loop...
Personal opinion. Just saying.
Hey there,
when I initially compared the data it ignored the prepositions (IN) due to a typo and our db splits in pre- and postpositions. Got that now and found some prepositions which are listed in other categories:
"before": is also CC,
"round": is also JJ,
"apart": is also RB (but can be preposition "apart from this" OR postposition "this apart")
And we list some prepositions which are not in the array yet (did NOT check other categories):
[
{en: 'a'},
{en: 'an'},
{en: 'abaft'},
{en: 'abeam'},
{en: 'aboard'},
{en: 'absent'},
{en: 'afore'},
{en: 'alongside'},
{en: 'amidst'},
{en: 'amongst'},
{en: 'anenst'},
{en: 'apropos'},
{en: 'apud'},
{en: 'aside'},
{en: 'astride'},
{en: 'athwart'},
{en: 'atop'},
{en: 'barring'},
{en: 'beneath'},
{en: 'beside'},
{en: 'beyond'},
{en: 'but'},
{en: 'chez'},
{en: 'circa'},
{en: 'concerning'},
{en: 'excluding'},
{en: 'failing'},
{en: 'following'},
{en: 'for'},
{en: 'forenenst'},
{en: 'given'},
{en: 'including'},
{en: 'inside'},
{en: 'like'},
{en: 'mid'},
{en: 'midst'},
{en: 'minus'},
{en: 'modulo'},
{en: 'near'},
{en: 'next'},
{en: 'notwithstanding'},
{en: 'opposite'},
{en: 'outside'},
{en: 'pace'},
{en: 'past'},
{en: 'plus'},
{en: 'pro'},
{en: 'qua'},
{en: 'regarding'},
{en: 'sans'},
{en: 'save'},
{en: 'times'},
{en: 'toward'},
{en: 'underneath'},
{en: 'unto'},
{en: 'worth'},
{en: 'together', description: 'questionable'},
{en: 'vis-à-vis', description: 'questionable'},
{en: 'thru', description: 'informal', meta: {entitySubstitution: ['en']}},
{en: 'thruout', description: 'informal', meta: {entitySubstitution: ['en']}},
{en: 'till', description: 'same as "until", wikipedia: "with prosodic restrictions"'},
{en: 'versus', description: 'NAB conflict: commonly abbreviated as "vs.", or (law or sports) as "v."'},
{en: 'vice', description: 'used as "in place of"'},
{en: 'with', description: 'sometimes written as "w/"'},
{en: 'w/', meta: {entitySubstitution: ['en']}},
{en: 'within', description: 'sometimes written as "w/in" or "w/i"'},
{en: 'w/in', meta: {entitySubstitution: ['en']}},
{en: 'w/i', meta: {entitySubstitution: ['en']}},
{en: 'without', description: 'sometimes written as "w/o"'},
{en: 'w/o', meta: {entitySubstitution: ['en']}},
{en: 'o\'', description: 'apocopic form of "of"', meta: {entitySubstitution: ['en']}}
]
btw - a nice one: https://www.youtube.com/watch?t=108&v=MHX-CiJBVy0
I think maybe this is not working correctly, but as it seems broken for a bunch of verbs, maybe I'm missing something...
nlp.verb('study').to_past()
"studyed"
nlp.verb('apply').to_past()
"applyed"
I'm writing angularjs module for this - https://github.com/Kroid/angular-nlp-compromise, if someone need.
I'm terrible with GitHub, and I'll probably screw stuff up trying to do this myself. But anyway, these need adding. Thanks!
Are you using a own internal dictionary / algorithms to do the transformations etc? If so, and I seem to think this is the case, there is something off with the conjugation
of the verb "load"
:
{ infinitive: 'loa',
present: 'loads',
past: 'loaded',
gerund: 'loading',
doer: 'loaer',
future: 'will loa' }
Now if I try something else, like "to load"
:
{ infinitive: 'to load',
present: 'to loads',
past: 'to loaded',
gerund: 'to loading',
doer: 'to loader',
future: 'will to load' }
This doesn't seem right either. Am I doing something wrong with entry of the string word(s) - some form I am missing? Or is this a corner case in the algorithm perhaps? Thought I'd at least report it 📦
Otherwise, great solution: much thanks!
for example:
[Orig]
They are based on different physical effects use to guarantee a stable grasping between a gripper and the object to be grasped.
[negate]
They are not based on different physical didn't effects use to doesn't guarantee a stable grasping between a gripper and the object to be grasped.
Maybe just negate the first verb would be sufficient.
for normalizing the input:
How about all normalizing all typographic stuff like curly and special quotes
to well the normalized ones ?
maybe useful for e.g.
O’Reilly to O'Reilly etc.
see http://practicaltypography.com/straight-and-curly-quotes.html
and note to me :
Maybe it would be useful to write a "preprocess" test, testing if everything in .js and .min.js ("expanded") is the same.
Hey,
sorry for opening a new one (trying to separate the different questions) :
This is about posessive-pronounsPP
which were not covered in the initial comparison (same reason: because they split to different categories in our db - I made the db compatible now, when you look for PP it'll join all 3 categories) - I am pasting the question I just added to the code (not sure if there are better expressions for the cases):
// TODO - this covers more than the original :
// posessive-pronouns (should) have 3 forms :
// as possessive (adjective) determiner pronoun (my) OR
// as possessive (noun) pronoun (mine) OR
// as a reflexive pronoun (myself)
What do you think?
btw: some changes you proposed in contributing.md also came to the fork.
E. g. it has now JSdoc documentation (WIP, standard template for now) ...
nlp.pos("he's eating a veggie burger").sentences[0].negate().text();
'he's isn't eating a veggie burger'
adding a quick fix...
First off, this is such a great project!
Do you have any thoughts returning an array of dates from a parsed sentense? Or more advanced logic like ranges?
It looks like this does a lot of what Ive done in https://github.com/silentrob/normalizer but I suspect much faster (basic normalization and commonwelth => american conversation).
I also have some code that deals with numbers and parsing math expressions here https://github.com/silentrob/superscript/blob/master/lib/math.js.
On version 1.1.3
If you try typing in something like
require("nlp_compromise").britishize("color");
> "color"
require("nlp_compromise").britishize("favorite");
> "favorite"
require("nlp_compromise").britishize("internationalization");
> "internationalization"
It just returns whatever the input is.
The americanize function works perfectly fine, though.
Below is the stack trace --
npm http GET https://registry.npmjs.org/nlp_comprimise
npm http 304 https://registry.npmjs.org/nlp_comprimise
npm http GET https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm http 404 https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm ERR! fetch failed https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm ERR! Error: 404 Not Found
npm ERR! at WriteStream.<anonymous> (/usr/local/Cellar/node/0.10.25/lib/node_modules/npm/lib/utils/fetch.js:57:12)
npm ERR! at WriteStream.EventEmitter.emit (events.js:117:20)
npm ERR! at fs.js:1596:14
npm ERR! at /usr/local/Cellar/node/0.10.25/lib/node_modules/npm/node_modules/graceful-fs/graceful-fs.js:103:5
npm ERR! at Object.oncomplete (fs.js:107:15)
npm ERR! If you need help, you may report this *entire* log,
npm ERR! including the npm and node versions, at:
npm ERR! <http://github.com/isaacs/npm/issues>
npm ERR! System Darwin 13.0.0
npm ERR! command "/usr/local/Cellar/node/0.10.25/bin/node" "/usr/local/bin/npm" "install" "nlp_comprimise" "--save"
npm ERR! cwd /Users/WS/nlp/natural
npm ERR! node -v v0.10.25
npm ERR! npm -v 1.3.24
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /Users/WS/nlp/natural/npm-debug.log
npm ERR! not ok code 0
From Kroid/angular-nlp-compromise#1 :
nlp.spot("joe carter loves toronto");
From docs:
nlp.spot("joe carter loves toronto")
// ["joe carter", "toronto"]
I checked it from chrome console in example page http://rawgit.com/spencermountain/nlp_compromise/master/client_side/cute_demo/index.html
Please note that he's
and she's
becomes ['he', 'is']
and ['she', 'is']
but it could also be ['he', 'has']
and ['she', 'has']
stackexchange
• how about `ìt``?
• shouldn't the negative contractions be handled here too?
"cannot": ["can", "not"] is the only one.
But how about stuff like
"shouldn't": ["should", "not"]
This would affect logic_negate, I assume.
using this from node gives an error:
Error: Cannot find module './dates'
Steps to reproduce:
nlp = require('nlp_compromise');
Amazing library, thanks guy!
Please, add your lib to bower registry http://bower.io/ http://bower.io/docs/creating-packages/
Heya,
This is a wonderful library. Hoping to use it to extract dates in a project, but I noticed that January consistently fails to be properly extracted in tests. I'm wondering if this is a subtle bug with indexes / accidental type coercion of 0 to false in date_extractor.coffee.
Would be happy to help you track down the issue if you have trouble.
Here is an example I just tried on master:
nlp.value("Today is January 7, 2015").date()
{ month: null,
day: 7,
year: 2015,
to_day: null,
to_year: 2015,
to_month: null }
Twenty five should register as one number,
Sixteen one should not.
Hey there,
again : this is not an issue.
The changes recently done are totally fine but let me explain why I make (made or am planning to make) which changes in the fork https://github.com/redaktor/nlp_compromise
As a European I would love this project to be as multilingual as possible ;)
The changes made have these goals :
• for contributing be totally explanative and readable
• for transport be browser-friendly and thus very small
• completely separate data / language logic / project logic
Three new files in src/data
: dictionary.js
The file where we can contribute multilingual words in the categories like in the readme.
: dictionary_rules.js (tba)
The file where we can contribute multilingual rules.
: _build.js
To build the data modules for one/some/all languages.
This could also be the first grunt step.
It will generate or overwrite a folder like 'en'.
Check it out node _build -l
Basically I am planning to let the build script generate a customized client side file and additional AMD browser modules.
See for instance the module.exports
lines, there are more than 30 but they are useless in the browser and apart from that I'd optimize the compressing for browser a bit further.
I do also try to avoid duplicates further. For example in phrasal verbs : Some verbs are already in the verb data module and some adjectives are already in the adj. module ...
When it is complete:
• each module e.g. in /parent should only be a littlebit 'project logic'.
• our database can autotranslate
• I could attach our web interface to encourage translators even more ;)
Hey,
please see https://github.com/spencermountain/nlp_compromise/blob/master/src/parents/noun/index.js
referenced_by
uses the var posessives
(typo ?). It is defined in the scope of the module and is
{
"his": "he",
"her": "she",
"hers": "she",
"their": "they",
"them": "they",
"its": "it"
}
while reference_to
uses the var var possessives
defined in the scope of the function which is just
{
"his":"he",
"her":"she",
"their":"they"
}
Shouldn't both be the same and maybe
{
mine: 'i',
yours: 'you',
his: 'he',
her: 'she',
its: 'it',
our: 'we',
their: 'they',
them: 'they'
}
?
Hey there,
contributing from my fork doesn't make sense because the structure will change to 'only the 3 dictionary files and a factory' soon.
However - let me ask some perfomance questions.
Maybe I missed something, hidden in the code, but
--> several 'autoclosure' functions run every time when a module is required.
Let's take an example - the conjugation of verbs which is used quite often.
I'll use simple console.log to demonstrate it.
put some logs in the conjugate function
the.conjugate = function() {
console.log( 'BEWARE! conjugate is conjugating' );
verb_conjugate = require('./conjugate');
var conjugated = verb_conjugate(the.word);
console.log( 'conjugate result', conjugated );
return conjugated; //verb_conjugate(the.word);
}
and in the 'autoclosure' form function
the.form = (function() {
console.log( 'BEWARE! the.form is conjugating' );
verb_conjugate = require('./conjugate');
// don't choose infinitive if infinitive == present
var order = [
'past',
'present',
'gerund',
'infinitive'
];
var forms = verb_conjugate(the.word);
console.log( 'forms result', forms );
for (var i = 0; i < order.length; i++) {
if (forms[order[i]] === the.word) {
return order[i];
}
}
})()
When I do
console.log( nlp.verb('last') );
it will conjugate
and when I do
console.log( nlp.verb('last').conjugate() );
it will conjugate twice
Sentence:
joe carter plays patiently in toronto
Steps to reproduce:
Result:
joe carter didn't playe patiently in toronto
Currently there is no way to use nlp.ngram() to perform a simple word frequency calculation (i.e. ngrams with size equal to one). Setting the max_size option to 1 produces ngrams with size 2. Setting max_size to 0 also gives the same result. I suspect that these two are the lines responsible for this issue: https://github.com/spencermountain/nlp_compromise/blob/master/src/methods/tokenization/ngram.js#L11 (where max_size is incremented - why?) and https://github.com/spencermountain/nlp_compromise/blob/master/src/methods/tokenization/ngram.js#L6
(where since 0 is false, max_size is assigned the value 5).
Pretty self explanatory, but if there are multiple white space characters between a word, the sentence detected collapses these characters together
Hey I am quite new to meteor and computer science basically. I am linguistic trying to learn computational linguistics somehow. I guess I am struggling still. I was wondering how it would be possible to use this on my own corpus ? Let's say that I have a list of sentences and whenever I choose the sentence I want to see the properties of the sentence. Would be possible ?
I'm new to nlp and am weak on my grammar, so maybe I'm barking up the wrong trees.
I'm using nlp_compromise
to switch the verb-tense in sentences from past to present, or present to past, using nlp.verb(vb).to_present()
or nlp.verb(vb).to_past()
as required.
It's working great, for the most part, except for when I try to swap the tense of "They are friends" or "They were friends".
Is there some other way I should be going about this, am I using the wrong tools, or is this something that can be extended with some new rules?
Hi,
in date_extractor.js, line 24 to 35, the replace regex replaces dates in the format "Feb. 14, 1969" to "February14, 1969" (no space between the month and the date), leading the parser to skip the date and only match the year.
Fixed by surrounding the replaced month names by spaces:
text = text.replace(/ Feb\.? /g, ' February ');
text = text.replace(/ Mar\.? /g, ' March ');
text = text.replace(/ Apr\.? /g, ' April ');
text = text.replace(/ Jun\.? /g, ' June ');
text = text.replace(/ Jul\.? /g, ' july ');
text = text.replace(/ Aug\.? /g, ' august ');
text = text.replace(/ Sep\.? /g, ' september ');
text = text.replace(/ Oct\.? /g, ' october ');
text = text.replace(/ Nov\.? /g, ' november ');
text = text.replace(/ Dec\.? /g, ' december ');
I looked through a variety of files, but haven't found either a list or a method of where I can append known people/organization names for recognition using .spot -- does this exist and I'm just missing it?
@spencermountain
Please see this demo http://expresso-app.org/tutorial ...
I made the same demo with your nice project.
More or less lazily by porting the "python metrics logic" to .js.
The advantages are : .js only and onKeyPress ... Think of a better http://www.hemingwayapp.com ;))
Will work on it later today. Also pointed the author of expresso to your project.
The method could either be contributed as a .metrics() function to the "root level" used in a demo or as a standalone demo. Just tell me if you are interested by writing to @redaktor (I'll close this directly) .
Thank you for starting to produce this missing javascript-puzzlepiece !
Hey,
just commited nearly the last changes to the fork
https://github.com/redaktor/nlp_compromise
before I could do a pull request.
I need to eliminate the
• 'hardcoded' dups in lexicon generation
• last 37/1360(?) tests failing
The lexicon will be at least 10% smaller then and I really think starting with this structure
language dependent contributing can become easy.
Just because I saw you were recently active ...
Hi,
I'm getting this error when trying to parse certain strings.
I've put in a hack for the function to always return null as i'm not using date extraction, but it's not a fix.
/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:224
h[k] = arr[places[k]];
^
TypeError: Cannot read property '1' of null
at /node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:224:21
at Array.reduce (native)
at Object.regexes.process (/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:223:36)
at main (/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:334:20)
at the.date (/node_modules/nlp_compromise/src/parents/value/index.js:13:11)
at /node_modules/nlp_compromise/src/parents/value/index.js:38:11
at new Value (/node_modules/nlp_compromise/src/parents/value/index.js:45:4)
at Object.parents.value (/node_modules/nlp_compromise/src/parents/parents.js:22:10)
at /node_modules/nlp_compromise/src/pos.js:366:47
at Array.map (native)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.