aigents / aigents-java Goto Github PK

Aigents Java Core Platform

License: MIT License

Java 44.01% HTML 19.02% Shell 0.04% JavaScript 5.23% CSS 0.12% PHP 8.01% M4 7.85% Makefile 0.05% OpenEdge ABL 0.04% Roff 12.47% Assembly 2.86% Yacc 0.25% Perl 0.05%

aigents-java's People

Contributors

Stargazers

Watchers

Forkers

beranitservice rssdev10 dagiopia alexei-gl akolonin vasvl123 socioprophet extremely-professional-bot-development zeionara quarymen

aigents-java's Issues

Prevent forgetting of manually authored things without topics

In extension to #9
Make sure forgetting does not remove authored news without topics (by 'authors' link)
Options:

setting default topic author's topic for the news without explicit topics (use area or author by default)
enforcing the author's topic top be selected by the author (with default as above)
use author-authors property - will work if is (topic) is not set (best option !!!???)
1.1. set author true when authoring (needed by #6 )
1.1. make sure referral by complex id is working - need to have Thing.getQuery() generating unique reference by key attributes
1.1. forgetting - check if author is present and refers to trusted authors
1.1. scope by the author as it were a thing!?

Provide images more relevant to texts and headers of the news items

Real Problem:
Currently, the value of image supplied for news items with values of title, text, and sources (link) may be relevant to the text and title or not. This is because the image is located with ContentLocator based on logic found in Matcher:
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Matcher.java#L170
The logic expects proximity of the image to the located text in terms of raw HTML text, and not on spatial proximity in terms of visual appearance in HTML browser or semantic proximity from human point of view.

Need to search for a way to improve the current behavior, while we can't have HLAI exposed to virtual pages generated by a virtual browser and pretending the HLAI is seeing the texts and images the same way as humans do.

Possible Solutions:

Evaluate image proximity by title, if present, and only if the title is not present, then use the text.
Give precedence to larger images, so if there are two images that are close to text (or title), use the larger image. Possibly, use complex metric of "applicability" of an image where "applicability" = "size" / "distance", so closer are and larger images are appearing more applicable - but this will need to load and analyze images or image attributes at least (bearing in mind that attributes may be missed in HTML).
Try to use proximity based om positions is parsed/stripped text, instead of proximity based on positions in eaw HTML.
Disregard wide and tall images, one where width > height * 2 or height > width * 2 - but this will need to load and analyze images or image attributes at least (bearing in mind that attributes may be missed in HTML).
Simulate 2D layout computation algorithm employed by web browser, with account to HTML and CSS specifications so every matched text and ever image on a page are given 2D coordinates, then we can do proximity computation based on visual distance. Make sure the distance is computed in regard to image boundaries and not image centers (otherwise smaller images may be gaining precedence).
Consider relying on extra hints in HTML structure even though this is expected to be very unreliable, being obscured with css styling policies.
TBD any other options that would come to mind...

Let “public area sharers” to share news without explicit "shares" to peers

Need: Make it possible for “public channel owners” to share news to those who “trust” them with no need of the explicit setting to “share” news to each of the “trusters” individually (based on presence of "shares" option)

Task: Make news item available for trusting peers in case if sharing peer either
a) is sharing it to trusters (like it is done now)
b) has shared areas in shares and areas list (TODO)

Smart Web Page Analysis

Goal
There is a need to refactor/extend existing HTML stripper to have textual and semantic information extraction more reliable than it is currently happening in legacy HtmlStripper https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/cat/HtmlStripper.java
Each of the following sub-tasks may be considered as a separate issue and respective project.

Sub-tasks

There is a need to extract schema.org embeddings in any possible representation (JSON-LD/microdata/RDFa)
There is a need to extract structural information from the HTML markup
There is a need to extract spatial html+css information from the loaded web page
There is a need to extract DOM representation from web pages dynamically created by javascript/DHTML
There is a need to extract semantic relationships from web pages, same as would be encoded with 1 (above) but using NLP and text mining techniques accompanied with 2, 3, 4 (above)

Sub-task details
1. There is a need to extract schema.org embeddings in any possible representation (JSON-LD/microdata/RDFa)
Many of modern web pages contain lots of semantic information not visible to the human eye of a web user, according to specification in https://schema.org/ - the parser may be capable to extract this information when loading the page and apply the monitoring/extraction policies to the explicit semantic graph data rather than plain text.

2. There is a need to extract structural information from the HTML markup
The existing HTML stripper blindly removes HTML tags, having some of them replaced with periods which makes it possible to do account for sentence and paragraph boundaries when doing the text pattern matching - in some cases. However, the use of HTML tags is site-specific and developer-specific, so this may not work in some cases. Fore more precise identification of the sentence boundaries, the hierarchical structure of a HTML document should be preserved in the stripped text, so the sentence/paragraph boundaries should be detected based on the structure of the hierarchical text and not on the presence of the tags.

3. There is a need to extract spatial HTML+CSS information from the loaded web page
In some cases, the above maybe not enough because the relevance of particular pieces of texts to the images, links and even each other may be based not on spatial relationships between them in the HTML text body and even not in the hierarchical structure of it, but rather on 2-dimensional spatial proximity, provided by HTML+CSS markup rendered by the browser (with the account to screen resolution and layout). That means, the Ideal Web Page Analyser would simulate the real web browser computing coordinates of pixels for every element and scarping the screen elements the same way as a human eye would do.

4. There is a need to extract DOM representation from web pages dynamically created by JavaScript/DHTML
All of the above may not work in case of WebPages generated by DHTML (such as https://aigents.com/ for instance), so there is a need to simulate the browser executing the complete suite of WWW technologies including CSS and JavaScript like it is done by Selenium WebDriver and WebKit - the simplex example of how it could be done is provided by https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/util/WebKiter.java

5. There is a need to extract semantic relationships from web pages, same as would be encoded with 1 (above) but using NLP and text mining techniques accompanied with 2, 3, 4 (above)
Since we can extract semantic relationships from the raw web page according to 1 (above), the entire process of Aigents web monitoring may be changed so the framework expects WepPage to be stripped down not to the plain text (like the HtmlStripper currently does), but rather to a subgraph of semantic relationship (like the Matcher is expected to do) - involving all of the techniques 2,3,4 (above). In such a case, we would end up with a design with semantic parsing of every web page and then subgraph monitoring and extraction applied to the page.

Aigents Chrome Plugin

Aigents Chrome Plugin can be created for seamless integration with Web browsing and transparent reinforcement learning. Substantial part of existing Aigents client JavaScript code is expected to get re-used: https://github.com/aigents/aigents-web/tree/master/html/ui
Like in #25 , the user should have the ability to change the "home Aigents Server" destination in the plugin settings, so the same plugin can be used to access some public "Aigents Servers" as well as private "Aigents Servers" owned by user or user's company, for example.
The important possibility of such plugin is that user would be able to explicitly point to particular pieces of the content in the browser to the Aigetns while the Aigents would be able to do the same for the user so the efficiency of Aigents' help to user as well as user's ability to train Aigents would get enormously increased.

Reputationer - predictiveness internals are rounded up to 1

The following Python code (based on https://github.com/singnet/reputation) is showing the individual ratings by period as well as predictiveness values used for blending are being rounded up to 1.0, which have to be fixed eventually:

import unittest
from datetime import datetime, date
import time
import logging
import pandas as pd
import numpy as np
from reputation_service_api import *
from reputation_calculation import *
from reputation_base_api import *
from aigents_reputation_api import AigentsAPIReputationService

rs = AigentsAPIReputationService('http://localtest.com:1180/', '[email protected]', 'q', 'a', False, 'test', True)
#rs = PythonReputationService() ###Change
rs.clear_ranks()
rs.clear_ratings()
dt1 = date(2018, 1, 1)
dt2 = date(2018, 1, 2)
dt3 = date(2018, 1, 3)
dt4 = date(2018, 1, 4)
rs.set_parameters({'default':0.5,'decayed':0.5,'conservatism':0.25,
'fullnorm':False,'logratings':False,'liquid':True,'rating_bias':False,'predictiveness':1,
'aggregation':True})
rs.put_ratings([{'from':'1','type':'rating','to':'4','value':0.5,'weight':10,'time':dt1}])
rs.put_ratings([{'from':'2','type':'rating','to':'5','value':1.0,'weight':10,'time':dt1}])
rs.put_ratings([{'from':'3','type':'rating','to':'6','value':0,'weight':10,'time':dt1}])
rs.put_ratings([{'from':'2','type':'rating','to':'5','value':1.0,'weight':10,'time':dt1}])
rs.update_ranks(dt1)
ranks = rs.get_ranks_dict({'date':dt1})
ranks#,{'4': 90.0, '5': 100.0, '6': 14.0})
rs.put_ratings([{'from':1,'type':'rating','to':'5','value':0.75,'weight':10,'time':dt2}])
rs.put_ratings([{'from':2,'type':'rating','to':'6','value':0.25,'weight':10,'time':dt2}])
rs.put_ratings([{'from':3,'type':'rating','to':'4','value':0.75,'weight':10,'time':dt2}])
rs.update_ranks(dt2)
ranks = rs.get_ranks_dict({'date':dt2})
print("my ranks:",ranks)

Aigents Web Client Plus

The existing Aigents Web Demo User Interface present in https://github.com/aigents/aigents-web and available https://aigents.com/ is 3 years old and does not seem attractive enough for many. We consider changing it which would take 2-3 person-months project, presumably keeping use of jQuery used by current Web UI so the most of JavaScript code may get re-used: https://github.com/aigents/aigents-web/tree/master/html/ui

Aigents Lite Client App for iPhone

Like #25 for Android, Lite Client app can be created for iPhone - either re-using existing JavaScript https://github.com/aigents/aigents-web/tree/master/html/ui or porting it to native Objective C

Question-Answering Engine based on Natural Language Generation

Overall task and design:
Based on #22, we need to provide an extended version of the Question Answering to replace or texted the current placeholder:
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/peer/Answerer.java
The code may go to org.aigents.nlp.qa or to respective package of the Aigents Platform Core.
There are few things to be done, written in the following pseudo-code to be refined during the implementation phase:

interface Indexer {
    void clear();//clears the current index
    void index(String text);//indexes text in the internal model where the model can be any
    Linker retrieve(String query);//retrieve the ranked list of relevant words based on the single query applied to the scope of all texts indexed by date, see https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/Linker.java 
}

//Candidate implementation of the Indexer relying on the existing code
class GraphIndexer implements Indexer {
    Graph graph;//see https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/Graph.java 
    int Mskip = 2;//width of skipping window to build word pairs
    // will be used to index any number of input texts in a graph object
    @Override
    index(String text){
        // tokenise text with Parser.parse https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/Miner.java#L580
        // build word-word links based on per-sentence word pairs co-occurring in a distance of Mskip using link types "pred" and "succ" and store them in a graph with link weight set as W = Mskip / distance (so the closer words are given larger weight, the closest word weighted as Mskip and the most distant word weighted as 1) 
    }
    @Override
    Linker retrieve(String query){
        // tokenize query with Parser.parse https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/Miner.java#L580
        // compute the ranks of nodes in the graph using algorithm GraphOrder.directed https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-graph.js#L537 (need to add this function to Graph class) initialized with word nodes found in the query, with every word node weight to be 1 denominated with word frequency from https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/LangPack.java#L85.  
        // retrieve the computed ranks of words from Graph and return in Linker implementation such as https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/Counter.java or https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/Summator.java having it returned  
    }
}

class AnswerGenerator extends Answerer { //to be re-used in https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/peer/Answerer.java 
    Indexer indexer; //see above
    Generator generator; //see above 
    in max words;//configured hard cap limit on number of words to be used to build the reply
    String answer(String query){
        Linker words = indexer.retrieve(query);
        if (words == null || words.size() ==0)
            return "No.";
        Collection<String> top = getTopWordsFromLinker(linker);
        String response = generator.generate(top); //see #22 
        return response;
    }
}

Task outline:

Complete #22
Implement the above
Find the baseline/train/test set for Question Answering from Kaggle or papers online
Fine-tune the design, implementation, and parameters to provide results reasonable according to item 3 above
Integrate with Aigents chat-script functionality
5.1. Extend, replace or override the existing Aigents Answerer https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/peer/Answerer.java using Intenter plugin replacement design https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/agent/Demo.java#L82
5.1.1. Solve the simplest summarization problem so given a single text as an input and few words as a seed, a brief summary out of the larger text body is created like with public static String summarize(java.util.Set words, String text) function in https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/peer/Answerer.java#L163
5.1.2. Solve the more complex answering problem where multiple texts are given and need to extract the relevant summary answering the question from the combination of the multiple text bodies, like with Collection searchSTMwords(Session session, final SearchContext sc) function in https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/peer/Answerer.java#L82
5.2. Extend unit test such as https://github.com/aigents/aigents-java/blob/master/php/agent/agent_chat.php
5.3. Test in Telegram chat-bot
5.4. Consider if some code should be moved to Aigents Core Platform from the org.aigents.nlp.qa
5.5. TBD
TBD

References:
https://blog.singularitynet.io/an-understandable-language-processing-3848f7560271

Refactor Siter so extensions can override it and add custom plugins/adapters for online/social media processing

What is done so far:

Imager is moved to separate class and renamed to ContentLocator
Siter has constructors changed and init function changed with respective argument revamping

What will come next:
3. Siter will be split in Siter and WebCrawler.
4. Siter will hold overall crawling framework and be configurable at Body level so one can create/extend/override it
5. WebCrawler will do actual web crawling and implement Crawler interface (this interface will be also implemented by Redditer, Twitterer and Discourser), so you could extend override the WebCrawler itself or add custom Crawlers
6. The current readChannel method of Redditer, Twitterer and Discourser will be moved to Crawled interface renamed to "crawl"
7. RSSer will be created and implementing Crawled interface as an example of custom crawler (do can do Arxiv and PsyArxiv plugins)

Aggregated content generation - summarisation in digests

Currently, Aigents extracts pattern-based "news items" on per-topic (is attribute) and per-url (sources attribute) basis for a specific day (times attribute), which are represented as short excepts.

Task: Create aggregated content generation based on the "news items" found above

Level 1: Simple aggregation: can be defined as a user-specific property on how to cook the news for every specific user - using "news aggregation" property with 5 values (none, summary, overview, digest, history)

no aggregation - "none"
aggregation per day+url+topic - "summary"
aggregation per day+url - "overview"
aggregation per day+topic - "digest" (similarly to the format of the digests currently sent by Aigents email notifications)
aggregation per topic (across days) - history

Level 2: Complex formation: In addition to the above, combinations of the topics corresponding to each other and clusters of related topics can be used together with LinkGrammar-based formal grammar (and possibly some underlying ontology) to generate literary content generation describing novel (salient and "surprising") combinations of topics - based on progress with #22 .

Aigents Android Lite Client

As a simplified version of #24, we may have lightweight Aigents client with native Android user interface exposing the functions present in https://github.com/aigents/aigents-web and exposed at https://aigents.com/
The lightweight client may be optionally done in JavaScript instead of the Java.
In this case, the client data will be stored in the cloud (as opposed to #24 where the data is stored on the mobile device), but the user should have ability to change the "home Aigents Server" destination in the application settings, so the same client application can be used to access some public "Aigents Servers" as well as private "Aigents Servers" owned by user or user's company, for example.

Telegram reputation and reports

Reputation computation (DONE)
Reports (DONE)
Profiling news creating topics (TODO)
SSO login (TODO)

Ethereum billing support

Need to provide integration to Ethereum, so the payments can be conducted in ETH and accuonted by billing - the same way it is done to PayPal
https://github.com/aigents/aigents-java/tree/master/src/main/java/net/webstructor/comm/paypal
https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-wui.js#L1055

May be integrated with existing Infura-based Ethereum logging and analysis support
https://github.com/aigents/aigents-java/tree/master/src/main/java/net/webstructor/comm/eth

GooglePay support

Need to provide integration to GooglePay - the same way it is done to PayPal
https://github.com/aigents/aigents-java/tree/master/src/main/java/net/webstructor/comm/paypal
https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-wui.js#L1055

Natural language production based on formal grammar

Overview:
In the end, ideally, we want the natural language text to be produced in a quality higher than provided by modern conversational intelligence chatbots (such as https://replika.ai/ ) however we want the AI to be "explainable" ("interpretable"), like presented in https://blog.singularitynet.io/an-understandable-language-processing-3848f7560271

The language production should be based on underlying ontology plus formal grammar, even though we may use ML/DL to create these underlying ontology and formal grammar and we may use NN (such as graph networks) to operate with these underlying ontology and formal grammar. It is intended to serve an extended solution for tasks #34 and #21.

Goals:
Anyhow, as part of the whole NLP pipeline, we should be able, given a finite list of words (or semantic concepts associated with these words) combined with a formal grammar for a natural language (such as English or Russian), produce a grammatically valid sentence or series of sentences - that is the scope of this particular task.

Tentative TODO items:

Decide with formal grammar to use - it should be both human-readable and machine-readable, be adopted by the community, and must have language models for at least English and Russian. Link Grammar (LG) is the first candidate but other options may be considered). - Decided to use LG.
Implement a loader of the formal grammar (e.g. Link Grammar) dictionary file format (or find an existing implementation in Java or port an exiting implementation in other language) so any of the existing dictionaries can be loaded into java memory or internal database for further processing. Initial implementation should be done in Java (so it can be incorporated in the Aigents project) but later it can be ported to Python for other applications. The implementation should be accompanied by unit tests and me be placed in "aigents-java" repository or separate "aigents-java-nlp" repository under the "agents" project. As a result of this task item, we would get an "internal" API to get the LG rules given a word as input (like function Collection getRules(String word);).
2.1. Start with the grammar file https://github.com/opencog/link-grammar/blob/master/data/en/4.0.dict and read it along with the manuals until have a solid understanding of how it works; - DONE
2.2. Design Java structures/classes/containers to keep the loaded LG dictionary in memory; - DONE
2.3. Implement a simplified version of LG loader capable to parse http://langlearn.singularitynet.io//test/nlp/poc-english_5C_2018-06-06_0004.4.0.dict.txt referring to JavaScript parser https://github.com/aigents/aigents-java/blob/master/html/graph.html#L157 which can be tested in a web browser at "View Link Grammar" button http://langlearn.singularitynet.io/graph.html; - DONE
2.4. Implement a full-blown version of LG loader capable to parse English grammar https://github.com/opencog/link-grammar/blob/master/data/en/4.0.dict (including support for "macros" like "<post-nominal-u>"); - DONE
2.5. Add a unit test for full-blown version of LG loader capable to parse English grammar involving parse of the same sentences that we used in 2.3, but relying on complete English LG.
2.6. Make sure the full-blown version of LG loader works to parse Russian grammar https://github.com/opencog/link-grammar/blob/master/data/ru/4.0.dict and confirm this with unit parsing "мама мыла раму" and "папа сидел на диване". (will do later or defer to separate task because of the need to handle morphology)
2.7. TBD
Implement the language production engine which would take input as a list of words plus loaded formal grammar dictionary and produce the sentence including all of the words. In order to do this, the approach similar to the Link Grammar parsing or MST-parsing would get applied, so we get all rules involving all referenced words, build all possible sentence trees and then select the tree satisfying some criteria or combination of criteria (like maximum overall mutual information, minimum length, minimum tree depth, etc.). As a possibility, "SAT solver" approach may be employed ( https://sahandsaba.com/understanding-sat-by-implementing-a-simple-sat-solver-in-python.html ).
3.1. Have the minimally viable functionality working and passing the following test - DONE :
3.1.1. Load dictionary http://langlearn.singularitynet.io/data/clustering_2018/POC-English-2018-12-31/POC-English-Amb_LG-English_dILEd_gen-rules/dict_30C_2018-12-31_0006.4.0.dict
3.1.2. Write the test script which can fo the following, having the dictionary loaded and file http://langlearn.singularitynet.io/data/poc-english/poc_english.txt applied as an input:
3.1.2.1. Load every sentence form individual line;
3.1.2.2. Disassemble (tokenize) the sentence into individual words;
3.1.2.3. Use the loaded LG dictionary to create a grammatically valid sentence from the word with one of the following approaches;
3.1.2.3.1. Read and understand the concept of "SAT-solver" and apply this idea to it to implement sentence generator building sentences from list of words and loaded grammatical rules connecting these words;
3.1.2.3.2. Re-use some existing "SAT-solver" code and adapt it to given task;
3.1.2.3.3. Do everything from the scratch - THAT'S HOW IT WAS DONE
3.1.2.3.4. Lookup OpenCog Scheme code doing this and borrow ideas from there
3.1.2.3.5. Port OpenCog Scheme code to Java
3.1.2.3.6. Any combination of the above
3.1.2.3. Compare the generated sentence against the input sentence and provide diagnostics on a mismatch.
3.1.3. Keep fixing bugs till the amount of mismatches is minimized.
3.1.4. If there are still any mismatches, analyze the reasons of them and suggest solutions and directions for further exploration.
3.2. Test on SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project
3.2.1. Use the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/
3.2.2. Create "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE"
3.2.3. Evaluate the accuracy and the other metrics fo the entire "extra-cleaned" corpus seeing if we can generate sentences correctly from the words.
3.3. Test on the sentences randomly found on Wikipedia using full-blown English LG dictionary
3.4. Test on the sentences from some (TBD) corpus for QuestionAnswering challenge (need to google for such corpora or lookup on Kaggle)
3.5. Test on the words extracted from graph/network model learned from Wikipedia or a QuestionAnswering challenge corpus mentioned above - according to #33
3.6. Make sure the capitalization is handled properly, so the text can be generated regardless of the case of the input words - TBD
3.6. TBD
Handle the following problems that will arise along the way:
4.1. It might get possible, that no one sentence might be built due to the words missed in the input so no complete sentence can be built. In this case, the engine should be able to provide lists of words that could be used to fill all of the gaps needed to fill and capable to ask callback of the caller to rank the suggested word options (it may be an iterative process, so one the most critical gap is filled, the list of remaining options may get changed).
4.2. It might get possible, that multiple sentences might be built so the engine should be able to provide its own ratings to these candidate sentences as well as ask callback of the caller to rank the suggested sentence options.
Given there is no control test set for such task, may need to come up with a control set to be used for hyper-parameter tuning, according to the "Baby Turing Test" paradigm: https://arxiv.org/abs/2005.09280
5.1. Simplest case - use: http://langlearn.singularitynet.io/data/poc-english/poc_english.txt
5.2. More complex case - use same as above, but having some words removed base on some test configuration
5.3. See if there are some existing "baseline" test sets for Natural Language Generation or Question Answering challenges... TBD
Integrate the engine into the Aigents chat-bot framework available for Web, Telegram, Facebook Messenger, and Slack (related task issue will be created).
There are many issues expected to arise along the way so the scope of the work is expected to be adjusted along the way (related task issues will be created it needed).
Recommended package name org.aigents.nlp

References:
https://blog.singularitynet.io/an-understandable-language-processing-3848f7560271
http://aigents.com/papers/2019/ExplainableLanguageProcessing2019.pdf
https://www.youtube.com/watch?v=ABvopAfc3jY
https://www.youtube.com/watch?v=cwgtcOfA3KI
https://arxiv.org/abs/1401.3372
https://arxiv.org/abs/2005.09280
http://langlearn.singularitynet.io/data/docs/

In case if Link-Grammar (LG) is chosen:

https://en.wikipedia.org/wiki/Link_grammar
https://github.com/opencog/link-grammar
Reference LG dict files can be taken from here https://github.com/singnet/language-learning/tree/master/tests/test-data/dict/poc-turtle
More dict files may be found under subfolders of "tests" folder here https://github.com/singnet/language-learning/tree/master/tests
Some Python code for reading and writing LG dict files may be found here https://github.com/singnet/language-learning/tree/master/src
For the LG questions, join the mailing list https://groups.google.com/forum/#!forum/link-grammar
Testing LG parser for Russian: http://sz.ru/parser/

On Natural Language Generation with Link Grammar:
https://books.google.ru/books?id=HwW6BQAAQBAJ&pg=PA459&lpg=PA459&dq=link+grammar+language+generation&source=bl&ots=Lnj2CmORKC&sig=ACfU3U3QjcHw-ruEN0hh95hVZ32Mu78yfg&hl=ru&sa=X&ved=2ahUKEwj628PW57zqAhX1wsQBHTIcB7AQ6AEwBHoECAkQAQ#v=onepage&q=link%20grammar%20language%20generation&f=false
https://wiki.opencog.org/w/Natural_language_generation
http://www.frontiersinai.com/turingfiles/December/lian.pdf

On SAT-solver and Grammars:
https://www.hf.uio.no/iln/om/organisasjon/tekstlab/aktuelt/arrangementer/2015/nodalida15_submission_91.pdf
https://books.google.ru/books?id=xBJVDQAAQBAJ&pg=PA67&lpg=PA67&dq=sat+solver+grammar&source=bl&ots=IOSARwDh2b&sig=ACfU3U0IooczXG8sDnK5K2yr9jmY0pRHzQ&hl=ru&sa=X&ved=2ahUKEwjW5IfwlqHqAhUNEJoKHVg1AzQQ6AEwAnoECAUQAQ#v=onepage&q=sat%20solver%20grammar&f=false
https://www.semanticscholar.org/paper/Analyzing-Context-Free-Grammars-Using-an-SAT-Solver-Axelsson-Heljanko/0fd33fd35fc8a8b32287d906cf6d3576d0a294b2
https://books.google.ru/books?id=-jVxBAAAQBAJ&pg=PA35&lpg=PA35&dq=language+generation+sat+solver&source=bl&ots=V1hzzi1xJA&sig=ACfU3U3CL00HJVknvEUADMWvucLkvefMEw&hl=ru&sa=X&ved=2ahUKEwi3_dbll6HqAhWswqYKHY-mB-sQ6AEwDHoECAwQAQ#v=onepage&q=language%20generation%20sat%20solver&f=false

Telegram moderation and analytics

Subtasks:

Send notifications to users for certain user-configured content settings (topic templates) in groups as we have (DONE)
Draw reputation charts and graphs for users (DONE)
Provide sentiment-basis feedback on posts (DONE, TODO configuration)
Remove posts for a certain level of admin-configured content restrictions (DONE, TODO configuration)
Notify admins for a certain level of admin-configured content restrictions (DONE, TODO configuration and indication of a source group in alert)
Ban post authors for a certain level of admin-configured content restrictions
Provide private warnings for a certain level of admin-configured content restrictions
Provide a thematic search for community members
Draw content preferences charts and graphs for users (group "semantic core", user "semantic core", dynamics, etc.)
Do all sorts of reputation, content, etc. charts and graphs like we have for Twitter, Reddit, etc.
Write digests or reports for admins or users on specific users or topics for specific periods
Improve usability - make everything above increasingly less stupid, less ugly, more intelligent, and more friendly
Make the options above configurable by chat

References:
For content restrictions (5, 6), can use the following datasets and corpora:
a) Obscene lexicon for Russian https://github.com/odaykhovskaya/obscene_words_ru/blob/master/obscene_corpus.txt
b) Bad words in English https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/

Structured rich text stripping and matching

In order to better understand boundaries of the matching text spots, both HTML and PDF (and DOC, ODT, etc. in the future) rich texts should be stripped not to text (like HtmlStripper.convert does now), but to intermediate hierarchical representation preserving both structure of text organization and links, images and titles (kind of internal unified DOM representation).

Actions:

Add StructuredText class
Change HtmlStripper.convert to HtmlStripper.convertToStructuredText
Add PdfStripper.convertToStructuredText (instead of using PDFTextStripper)
Refactor the HttpFileReader and net.webstructor.self.Cacher so they get the structured data in StructuredText instead of "String text" in unified way
Fix/extend entire pattern matching kitchenery to use StructuredText instead of String
Make pattern matching kitchenery to use structure to understand the text spot boundaries
Make sure that unit texts are still passing and maybe fix them if needed

Note:
Current HtmlStripper.convert inserts periods "." in the places of structured HTML tags but this is not done for PDF. Now it is the time to do this consistently for any rich text source, not breaking the other working parts.

Improve purity of RSS format

Based on warnings seen in https://validator.w3.org/feed/

line 13, column 4: Missing enclosure attribute: length (7 occurrences) [help]
<enclosure url="https://www.youtube.com/yts/img/pixel-vfl3z5WfW.gif" typ ...
line 13, column 4: type attribute of enclosure must be a valid MIME type (7 occurrences) [help]
<enclosure url="https://www.youtube.com/yts/img/pixel-vfl3z5WfW.gif" typ ...

Need to EITHER
A) identify size and mime type of the enclosed image properly (see https://stackoverflow.com/questions/705224/how-do-i-add-an-image-to-an-item-in-rss-2-0 saying "The length attribute doesn't need to be completely accurate but it's required for the RSS to be considered valid")
OR
B) include image as <img ... /> tag in the having the content of the description framed into ![CDATA[...]], like ![CDATA[<img src="https://my.site.com/my_image.jpg"]]My text (see https://www.aitrends.com/feed/ for example)

line 17, column 2: item should contain a guid element (13 occurrences) [help]

Need to have every news item to have a guid (better being a permlink) associated with a feed, like
https://www.aitrends.com/?p=18403

line 132, column 0: Missing atom:link with rel="self" [help]

Need to have RSS feed url to be part of the feed channel, like:

xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
>

...
<atom:link href="https://www.aitrends.com/feed/" rel="self" type="application/rss+xml" />

See https://www.aitrends.com/feed/ for example.

Web paths formation improvements

The Problem:
The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues:

Redundant bath entires are formed sometimes (which causes over-consumption of memory and CPU cycles)
Empty path enties are formed sometimes (which causes exceptions like the following):

Fri Jun 05 13:47:30 UTC 2020:Site crawling failed unknown https://blog.wechat.com/category/news/ java.lang.ArrayIndexOutOfBoundsException: 0,:0
java.lang.ArrayIndexOutOfBoundsException: 0
        at net.webstructor.al.Set.get(Set.java:35)
        at net.webstructor.self.PathTracker.run(PathTracker.java:136)
        at net.webstructor.self.PathTracker.run(PathTracker.java:110)
        at net.webstructor.self.PathTracker.run(PathTracker.java:96)
        at net.webstructor.self.PathTracker.run(PathTracker.java:58)
        at net.webstructor.self.WebCrawler.crawl(WebCrawler.java:66)
        at net.webstructor.self.Siter.read(Siter.java:171)
        at net.webstructor.self.Spider$1.call(Spider.java:191)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

We need to solve both.

Extra:
In addition to that, for each of the "sites" configured for crawling, we may have the option "crawl mode" (SMART|FIND|TRACK) set other than default "SMART" so the "path" can not be modified and always re-used as configured manually ("TRACK" mode) or never used so the exhaustive crawl applies every time ("FIND" mode).

Profiling native user's texts obtained from the higher-level integrations like browser plugins

DONE:

Need to let AL interface to accept user's actions related to texts:
1.1. Searches of texts/images
1.2. Clicks on texts/images
1.3. Selections of texts/images
1.4. Copypastes of texts/urls
Building Aigents report on overall trusts to the news items with social reporting and profiling under "aigents" social provider

TODO:

Explicitly ranks of selections of text for the following
1.1. relevance
1.2. positive or negative sentiment - sentiment mining (with either "there text 'good stuff', is good." or "there text 'good stuff', good true." !?)
1.3. any other categories
Involve all of the above into social reporting
Need to involve all of the above along with "trust true" relationships between given user as an author and other users as readers and the other way around (may be done later)
Reputation Graphs based on the above

Aigents Desktop Plus for Linux, Mac OSX and Windows

There is an existing old Aigents Desktop App in Java based on the Aigents Core https://github.com/aigents/aigents-java/tree/master/src/main/java/net/webstructor/gui
which is using java.awt framework and can work under Linux, Mac OSX and Windows
However, its functionality is pretty much outdated and does not contain many latest features present in Aigents Web User interface https://github.com/aigents/aigents-web available at https://aigents.com/
Also, it makes sense to have the Aigents Desktop App to be based on java.fx framework instead of java.awt with built-in Web browser, like it is done in the existing Android App https://github.com/aigents/aigents-android so more tight integration between Aigetns Core functionality and Web browsing operations can be achieved, like in case of Chrome browser plugin per #27

Fix support for robots.txt patterns

Example:
https://www.joom.com/robots.txt
...
Disallow: */q.*
...
not matched for
https://www.joom.com/ru/search/q.xiaomi
in
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/cat/HttpFileReader.java#L211
and file is tried to get read with error 400

Add "title" to "text" in news items

Wanted
Have news items supplied with "title" property, in addition to currently existing "text", "sources", "times" and "image".

One way to solve this is do the same trick as it is done with images and links - provide another container to the html stripper so it collects all tags that you have identified and keeps them with indexes to the original positions and then when the text is matched it can lookup back for the closest title candidate.

Here is where the image indexing happens:
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/cat/HtmlStripper.java#L202

Here where is is used:

aigents-java/src/main/java/net/webstructor/self/Siter.java

Line 636 in 34a507f

String image = imager.getAvailableImage(path,textPos);

I guess one can just re-use the Imager class for the purpose. Then one just needs two hacks nearby the points that I have indicated:

Index all "title", "h1", "h2", "h3" tags plus may be some other collecting their interiors in the collector structure same as called to collect the image urls.
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/cat/HtmlStripper.java#L202
When the news item is created, lookup the closest indexed title candidate occurring before, like it is done when attaching image urls:

aigents-java/src/main/java/net/webstructor/self/Siter.java

Line 636 in 34a507f

String image = imager.getAvailableImage(path,textPos);
Put the found title candidate into the "title" property of the news item

4. Optionally: If no title candidate found, we MAY don't leave no title or create blank "title", but may use alternative strategy like using the most salient/interesting words in a text placed in the title in the same order as they appear in the text)

Sentiment analysis support

Need to provide sentiment analysis support for English, Russian and Chinese

Sources:

English:

Russian:

RuSentiLex: https://www.labinform.ru/pub/rusentilex/index.htm
Linis-Crowd: http://www.linis-crowd.org/

DONE:

basically functionality
compute sentiment on n-grams first, if any found
Make sentiment analysis exposed to AL interface

TODO:

weight sentiment features by inverse frequency of words/terms in news agenda!? ('вылечились от короновируса' = 'cured from coronavirus')
custom user-specific lexicons (hierarchy of custom "subgraphs" with lexicons in hierarchy of extensions)
context-specific lexicons based on broad topic and/or sentence contexts
context-specific sentiment based on topic location and its surroundings within a sentence
More: https://blog.singularitynet.io/aigents-sentiment-detection-personal-and-social-relevant-news-be989d73b381

P.S.:
File merging tips: https://stackoverflow.com/questions/4366533/how-to-remove-the-lines-which-appear-on-file-b-from-another-file-a

Restructure storage of lexicons

At the moment, lexicons are stored in the root:
lexicon_english.txt
lexicon_negative_english.txt
lexicon_negative_russian.txt
lexicon_positive_english.txt
lexicon_positive_russian.txt
lexicon_rude_english.txt
lexicon_rude_russian.txt
lexicon_russian.txt

We want to change it, along with adding support for cognitive distortions, to be like this:

data
  dict
    en
       lexicon.txt
       negative.txt
       positive.txt
       rude.txt
       mentalfiltering.txt
       magnification.txt
       ...
    ru
       lexicon.txt
       negative.txt
       positive.txt
       rude.txt
    zh

Aigents Cloud Storage

For the high-performance and high-capacity Aigents Servers supporting thousands and millions of users, we would need to change the existing storage design of the Aigents.
Currently, it involves:
A) in-memory custom graph DB https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/core/Storager.java (stored in al.txt snapshots)
B) "temporal graphs" for indexing source-specific historical graph data https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/GraphCacher.java
C) "long-term memory" storage of the object instances https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/core/Archiver.java
D) cache of the web data https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Cacher.java
While the above work fine for single-user Aigents instances and up to few-hundred-user instances, it may not scale well if we get thousands and millions of concurrent Line Clients (such as per #25 , #26 #27 and #28 ) so the following would have to get done:

Redesign and refactor the above so we use interfaces instead of classes and the implementations of those interfaces are served by factories obtained at the Body level https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/agent/Body.java (see getPublisher singleton factory for instance)
Choose the Graph/SQL/Object DB engine for alternative implementation of these interfaces (like Neo4J/PostgreSQL/MongoDB)
Have the job done :-)

Smarter formation of topic patterns

At the current time to topics are generated along with patterns created as topi names via the TextMiner class
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/TextMiner.java
and its underlying clustering implementation
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/data/Miner.java
The patterns created this way are just disjunctive sets of words, missing the few things

conjunctions
ordered conjunctions ("frames")
regular expressions (to handle suffixes)

It should be improved with more complex pattern formation involving symbolic pattern regression producing hierarchical patterns like discussed here:
https://www.youtube.com/watch?v=FzKMtNILmDk

Aigents Self-Server App for Android

The complete and self-contained Aigents application with server capabilities, built-in privacy protection and peer-to-peer capabilities already exists in Java:
https://github.com/aigents/aigents-android
with latest build scrips in Graddle:
https://github.com/aigents/aigents-android-graddle
The functionality is pretty outdated and does not include all the latest features present in the Aigents Core.
We are looking forward to have the new version created with all bells and whistles from the Aigents Core exposed to Android User interface

RSS support

Need to provide RSS channels support like it is done for Reddit subreddits and user activity logs and will be done for Twitter (#4 )

For entry point, you will need new class RSSeer - see:
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Siter.java#L295
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/Reddit.java#L99
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/Reddit.java#L169

For file reading and content type checking - look up
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Cacher.java#L118
lines 118-123

A) reader.allowedForRobots(path) and if allowed
B) Use reader.canReadDocContext(path,context) or reader.readDocData(path," ",context) or something like that to

check if file is either RSS or Atom
AND if so
process RSS/Atom items one by one

Support both:
https://sawv.org/2019/11/12/rss-vs-atom-vs-json-feed-vs-hfeed-vs-whatever.html
https://www.saksoft.com/rss-vs-atom/
https://problogger.com/rss-vs-atom-whats-the-big-deal/

RSS Feed Example:
https://www.feedforall.com/sample.xml


<?xml version="1.0" encoding="windows-1252"?>
--
  | <rss version="2.0">
  | <channel>
  | <title>FeedForAll Sample Feed</title>
  | <description>RSS is a fascinating technology. The uses for RSS are expanding daily. Take a closer look at how various industries are using the benefits of RSS in their businesses.</description>
  | <link>http://www.feedforall.com/industry-solutions.htm</link>
  | <category domain="www.dmoz.com">Computers/Software/Internet/Site Management/Content Management</category>
  | <copyright>Copyright 2004 NotePage, Inc.</copyright>
  | <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  | <language>en-us</language>
  | <lastBuildDate>Tue, 19 Oct 2004 13:39:14 -0400</lastBuildDate>
  | <managingEditor>[email protected]</managingEditor>
  | <pubDate>Tue, 19 Oct 2004 13:38:55 -0400</pubDate>
  | <webMaster>[email protected]</webMaster>
  | <generator>FeedForAll Beta1 (0.0.1.8)</generator>
  | <image>
  | <url>http://www.feedforall.com/ffalogo48x48.gif</url>
  | <title>FeedForAll Sample Feed</title>
  | <link>http://www.feedforall.com/industry-solutions.htm</link>
  | <description>FeedForAll Sample Feed</description>
  | <width>48</width>
  | <height>48</height>
  | </image>
  | <item>

Atom Feed Example:
https://validator.w3.org/feed/docs/atom.html

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Example Feed</title>
  <link href="http://example.org/"/>
  <updated>2003-12-13T18:30:02Z</updated>
  <author>
    <name>John Doe</name>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>

  <entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
  </entry>

</feed>

Use XML:
https://www.viralpatel.net/java-xml-xpath-tutorial-parse-xml/

RSS test feeds:
http://feeds.reuters.com/reuters/businessNews
http://feeds.reuters.com/reuters/technologyNews
http://feeds.reuters.com/reuters/politicsNews
http://feeds.reuters.com/news/wealth
https://blog.feedspot.com/bitcoin_rss_feeds/
https://blog.feedspot.com/reuters_rss_feeds/
https://gist.github.com/hamzamu/5c2fa2907ec507f4aba3ba6fcce2d21b

Aigents Self-Server App for iPhone

Like #24 for Android, complete Self-Server with peer-to-peer capabilities and secure storage app can be created for iPhone - porting original Aigents Java code https://github.com/aigents/aigents-java/tree/master/src/main/java/net/webstructor to native Objective C.
Expected to be minimally half human-year project.

Conversational interface for Reputation bot

As it has been suggested by Ibby Benali:
I think if the bot would respond with something like this when you do /start :
Hi! I am the SingularityNET Aigents Reputation Bot. I calculate xyz for you. I can provide you personal reputation reports. In order to start, please tell me your name.
next message
Thanks! Nice to meet you name. Can you please tell me your email, I need that for xxx.
Next message.
Awesome! Now let’s look at your reputation. If you would like to get a reputation report for yourself, please type /reputation @your_username. Let’s try it out!
Provide report. Next message
Isn’t that cool? If you would like to use me in groups, just add me to your chatgroup. If you would like to know the reputation of a user, just reply to their message with /reputation.
For now, it is great to meet you. You can follow my progress here and here. If you would like to opt-in for updates to my software, just type /updates and I will ping you when I learned a new trick.

the above is just an idea but maybe it will guide the conversation and interaction a bit more smoothly with the above

and perhaps as a fallback:
Uh oh, I am not sure what you mean. Please type /help to see what I can do, or let’s pick up where we left off: (insert the thing where you left off.. e.g. “I wanted to know your email for xxx”)

TODO:

Implement recommendations per https://core.telegram.org/bots - DONE in b900408
/start
/help
/settings
Make sure bot provides registration prompt after the first encounter - DONE in b900408
Make sure bot provides GDPR-compliant prompt before registration, like discussed in #12 - DONE in b900408
Enable free-text conversations configurable - TODO
TBD

EOS/CyberWay support

Need to have EOS and CyberWay/Golos.io integrated similarly to Ethereum, Steemit and Golos.id
Resources:
https://developers.eos.io/manuals/eosjs/latest/index

Arxiv and PsyArchiv PDF parsing as a custom plugin(s)

Need separate Socializer-derived plugin(s) for Arxiv and PsyArchiv PDF parsing

Arxiv - follow one of the options:
1.1. Use Arxiv search API with results returned in Atom Feed format, see: https://arxiv.org/help/api#using , https://arxiv.org/help/api/user-manual, https://arxiv.org/help/api/user-manual#query_details and http://export.arxiv.org/api/query?search_query=all:agi
1.1.1. In custom version, can use query parameters "start" and "max_results" to iterate over the full document collections "search_query=anton kolonin&id_list=&start=0&max_results=10" (can be also done as a hack in RSSer translation URLs containing "arxiv.org" into API calls like "http://export.arxiv.org/api/query?search_query=agi&start=2&max_results=2")
1.1.2. In custom version extra fields of the feed can be used, see https://arxiv.org/help/api/user-manual#query_details
1.2. Implement custom crawling with custom crawler plugin (like RSS) on Aigents side, based on #5
1.3. Implement Aigents-side URL filtering logic per site/user/instance for A) URLs not crawled and B) URLs not used to create news items
PsyArchiv:
2.1. TODO
Random Issues:
3.1. pdfs not read from site in agi channel
3.2. https://arxiv.org/list/cs.AI/recent
3.3. enable scope=web as default ?
3.4. missed https://arxiv.org/list/cs.AI/recent for 'knowledge representation'

P.S.: Suggestions from Eyob:

For arxiv, we need to crawl only “abs” links (eg. https://export.arxiv.org/abs/2005.05255). Some weird pages like formats are being crawled. I think the crawler needs to crawl smartly, in a site specific way (although hard coded for now, future AGI-sh implementation of this would take care of this automatically : D). Only article like pages should be shown to the user.
Eg. bad contents crawled in the current feed setup on the staging site (staging.xcceleran.do) are
https://export.arxiv.org/list/cs.SY/pastweek?skip=65&show=25 https://export.arxiv.org/format/2005.04589
The above links have ‘list’ and ‘format’ in their url instead of ‘abs’
Titles that come from arxiv are not properly crawled. E.g the title from https://export.arxiv.org/abs/2005.05178 is “other ] title: reinforced rewards framework for text style transfer learning ( cs.lg ) [ 110 ] arxiv:2005.05178 ( cross-list from cs.ro ) [ pdf” . We need to have a mechanism to get this done correctly. Eg. using the title tag of the page?

Text fragmentation/segmentation based on formal grammar

Base on the progress with issue #22 , we want to use the formal grammar to identify boundaries of sentences in the token (word) streams in two cases:

When the token (word) stream is provided by the speech recognition engine.
When the token (word) stream is provided by the HTML stripper applied to HTML texts where the natural language sentences are split not by conventional periods, explanations and question marks, but with some weird HTML tags with some custom styles applied to them.

The solution would have at least two applications:
A) Split the stream of tokens/words into sentences for further linguistic processing such as parsing and entity extraction
B) Split the stream of tokens/words into sentences for selecting the "featured" sentences containing some "hot" keywords for summarization purposes.

Initial progress has been reached with
https://github.com/aigents/aigents-java-nlp/blob/master/src/main/java/org/aigents/nlp/gen/Segment.java
in
aigents/aigents-java-nlp#11

Still, there is more work to do to improve the accuracy.

For testing purposes, we can use (for example) the SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project, using the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/
then creating "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE"
then gluing sentences together on per-file or per-chapter basis and evaluate the accuracy based on the number of correctly identified sentence boundaries.

Any alternative corpora for testing against any baseline results achieved by any other authors may be considered as well.

References:
https://www.researchgate.net/publication/321227216_Text_Segmentation_Techniques_A_Critical_Review
https://www.google.com/search?q=natural+language+segmentation%20papers

Discourse support

Task:

Make it possible for single Aigents installation configure one Discourse site (like https://community.singularitynet.io), so the following should be possible:

Maintain profiles per-user based on what users' like and what they post/comment
Maintain connection (with social reports) between users based on what users' like and comment
Maintain input for reputation system (in temporal graphs) based on what users' like and comment
Provide news monitoring and content extraction for posts under: A) entire site, B) category, C) topic based on what is set as a site, D) user
Provide Single Sign On (SSO) for Discourse (1 week?)

Tips:

Data model:
category<-topic<-post(ordered by local numbers within topics)

API:

All categories:
https://community.singularitynet.io/categories.json
Topics in category:
https://community.singularitynet.io/c/66.json
Get topics:
https://community.singularitynet.io/latest.json
Get topic with post stream:
https://community.singularitynet.io/t/2753.json
Get user:
https://community.singularitynet.io/users/akolonin.json
Get users:
https://community.singularitynet.io/admin/users/list/active.json
{"errors":["The requested URL or resource could not be found."],"error_type":"not_found"}
Get posts:
https://community.singularitynet.io/posts.json
https://community.singularitynet.io/posts.json?before=8098
Get post:
https://community.singularitynet.io/posts/8099.json
Get likes:
https://meta.discourse.org/t/getting-who-liked-a-post-from-the-api/103618/3
curl 'https://community.singularitynet.io/post_action_users?id=8098&post_action_type_id=2' -H 'Accept: application/json'
{"post_action_users":[{"id":118,"username":"Patrik_Gudev","name":null,"avatar_template":"/user_avatar/community.singularitynet.io/patrik_gudev/{size}/430_2.png","post_url":null,"username_lower":"patrik_gudev"},{"id":24,"username":"akolonin","name":null,"avatar_template":"/user_avatar/community.singularitynet.io/akolonin/{size}/146_2.png","post_url":null,"username_lower":"akolonin"}]}
Get user actions:
curl https://community.singularitynet.io/user_actions.json?username=akolonin
https://github.com/discourse/discourse_api/blob/master/lib/discourse_api/api/user_actions.rb
action_type:
1 - liked by me
2 - liked by other
3 - unknown TODO?
4 - my topic posts
5 - my reply posts
6 - reply posts on my reply posts (except reply posts on my topic post) TODO?
7 - mentions of me

Work Items

Add peer discourse id, self discourse id, self discourse key, self discourse url fields (1 day, DONE)
Add social plugins for the following:
2.1. social analytics and reports (1 weeks, DONE)
2.2. reputation data extraction (1 weeks, DONE)
2.3. reporting and graph visualisation (1 weeks, DONE)
Add plugin for content analysis and news extraction like Reddit or RSS (1 week, DONE)
Refactor Steemit to use the same SocialFeederHelper base class (1 week)
Add plugin for SSO (1 week)

SSO Resources:

https://www.discourse.org/plugins/oauth.html
https://meta.discourse.org/t/official-single-sign-on-for-discourse-sso/13045
https://meta.discourse.org/t/using-discourse-as-a-sso-provider/32974
https://www.jokecamp.com/blog/examples-of-creating-base64-hashes-using-hmac-sha256-in-different-languages/#java

API Resources:

Here are some resources to check out for the Discourse API + some extras:
https://docs.discourse.org/ for the API
https://meta.discourse.org/t/data-explorer-plugin/32566 an official discourse plugin that allows for live database queries
https://meta.discourse.org/t/discourse-voting/40121 voting functionalities
On badges and communities:
https://meta.discourse.org/t/what-are-badges/32540
https://meta.discourse.org/t/how-to-grant-a-custom-badge-through-the-api/103270 (e.g. a badge for creating an aigents feed with x readers perhaps?)
Our own badges (using the standard ones: https://community.singularitynet.io/badges
https://blog.discourse.org/2018/06/understanding-discourse-trust-levels/
https://meta.discourse.org/t/discourse-moderation-guide/63116
https://blog.discourse.org/2014/08/building-a-discourse-community/
On discobot (although our integration needs a bit more work to make it autogenerated text):
https://meta.discourse.org/t/how-to-customize-discobot/103633
https://blog.discourse.org/2017/08/who-is-discobot/
Zapier + Discourse:
https://zapier.com/apps/discourse/integrations

Twitter support

Need to provide Twitter support, including

OAuth2
User profiling (like in Facebook, Steemit, VKontakte and Golos)
News monitoring (like in Reddit, Steemit and Golos)

See:
1)
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/Reddit.java#L80
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/Redditer.java#L61
2)
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/Reddit.java#L74
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/Reddit.java#L185
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/RedditFeeder.java#L116
3)
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/comm/reddit/Reddit.java#L99

Pay Attention:
"If you need to share Twitter content you obtained via the Twitter APIs with another party, the best way to do so is by sharing Tweet IDs, Direct Message IDs, and/or User IDs, which the end user of the content can then rehydrate (i.e. request the full Tweet, user, or Direct Message content) using the Twitter APIs. This helps ensure that end users of Twitter content always get the most current information directly from us.
We permit limited redistribution of hydrated Twitter content via non-automated means. If you choose to share hydrated Twitter content with another party in this way, you may only share up to 50,000 hydrated public Tweet Objects and/or User Objects per recipient, per day, and should not make this data publicly available (for example, as an attachment to a blog post or in a public Github repository)."
Source: https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases

Search support

Task:

Add option to search web for A) texts with urls, B) image urls, C) videos with urls - using one of the following options. - DONE
Make search results subject to monitoring - DONE
Make search available in chat mode as a Question Answering, based on #22 - PROGRESS

Primary options:

https://developers.google.com/custom-search/v1/overview#api_key - paid API ($5 per 1000 queries), "You can fine-tune the ranking, add your own promotions and customize the look and feel of the search results", with image search (Option 1) - DONE
1.2. Need to add pagination support - TODO
https://serpapi.com/ - paid API ($50/month+), Google scalping, seems like the best choice, need to confirm with lawyer, with image search and video search (Option 2)
2.2. Basically - DONE
2.2. Need to add pagination support - TODO
https://www.gigablast.com/ - paid API (Min $5, $1 per 1000 queries), https://www.gigablast.com/searchfeed.html (Option 3) - TODO

Secondary options:

https://www.bing.com/ - has API, use is restricted explicitly: https://docs.microsoft.com/en-us/azure/cognitive-services/bing-web-search/use-display-requirements
https://duckduckgo.com https://duckduckgo.com/ - no API (existing API is limited to "instant answers")
https://lite.qwant.com https://lite.qwant.com/ - unofficial API https://github.com/NLDev/qwant-api/, discontinued, rate limit, blocking with captcha
https://searx.me https://searx.me/ - metasearch, hard rate limit, blocking with captcha
https://www.hotbot.com https://www.hotbot.com - no API, based on Bing
https://gibiru.com https://gibiru.com/ - no API, IP blocked in Russia?
https://swisscows.com https://swisscows.com - no API
https://www.ask.com https://www.ask.com/ - no API
https://www.dogpile.com https://www.dogpile.com - no API
https://www.startpage.com https://www.startpage.com/ - no API
https://metager.org https://metager.org/ - no API
https://search.norton.com https://search.norton.com/ - no API
https://yacy.net/ - decentralized p2p, API, demo http://yacy.searchlab.eu/ is glitchy, bad coverage for Russian
https://contextualwebsearch.com/, paid API, coverage is limited, no Russian results, "semantic search" results make no sense
https://seekstorm.com/ (was faroo.com), claims to have paid API, but the site is not operable now, says 2018 "Coming soon"
https://swiftype.com/ - paid API, site search - crawl only specified sites
https://www.cludo.com/ - paid API, site search - crawl only specified sites
https://grader.algolia.com/ - paid API, site search and structure analysis - TODO explore!

GitHub found 6 vulnerabilities on aigents/aigents-java's default branch (6 moderate)

Antons-MacBook-Pro:aigents-java akolonin$ git commit -m "update version and year 2024"
[master feb5d93] update version and year 2024
2 files changed, 4 insertions(+), 4 deletions(-)
Antons-MacBook-Pro:aigents-java akolonin$ git push

Counting objects: 12, done.
Delta compression using up to 12 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (12/12), 813 bytes | 813.00 KiB/s, done.
Total 12 (delta 6), reused 0 (delta 0)
remote: Resolving deltas: 100% (6/6), completed with 6 local objects.
remote:
remote: GitHub found 6 vulnerabilities on aigents/aigents-java's default branch (6 moderate). To find out more, visit:
remote: https://github.com/aigents/aigents-java/security/dependabot
remote:
To https://github.com/aigents/aigents-java.git
bd4b073..feb5d93 master -> master

Support Link-Grammar-based parsing

We want to be able to do parsing of any language supported by LinkGrammar, starting with English, to be available both internally in Aigents framework and via Aigents Language API.

Specs:

Integrate https://github.com/aigents/aigents-java-nlp into https://github.com/aigents/aigents-java as a dependency (the simpler the better, just having an extra jar file built from the former and required by the latter is fine).
1.1. Link Grammar dictionaries are assumed to be deployed in the same folder structure as in https://github.com/aigents/aigents-java-nlp/tree/master/ and https://github.com/opencog/link-grammar/tree/master (./data/en/*)
1.2. The aigents-java-nlp can be either A) built as a separate jar or B) just built as an external dependency from source files or C) cloning contents of "/aigents/aigents-java-nlp/src/main/java" to "/aigents/src/main/java" (having the package names fixed along the way to "org.aigents") - whichever is easier and more logical
1.3. Tests from aigents-java-nlp should not be part of the jar (A above) or Aigents build (B above)
Have internal https://github.com/aigents/aigents-java package responsible for NLP and parsing in particular, add a wrapper(s) to the Link Grammar loader and Link Parser to it (based on https://github.com/aigents/aigents-java-nlp ).
2.1. Parsing means "parsing", which is not a "generation" or "segmentation" from aigents-java-nlp
2.2. Parsing is what conventional LinkGrammar Parser (C++) does - takes the single sentence into a graph of linked words (it is close to what Segmentation code does, but it is different, so can look up the Segmentation but have different code).
2.3. Code should be placed in "net.webstructor.nlp" of aigents-java project and called LinkGrammarParser, being a wrapper of the new class org.aigents.nlp.Parser created as modified/extended version of main.java.org.aigents.nlp.gen.Segment
Do dictionary load only once per application startup in constructors or init function of the new LinkGrammarParser which should be implementor of GrammarParser interface. LangPack class should initialize it as member in LangPack constructor and it can be used later when doing parsing.
Setup default storage for Link Grammar dictionary for Aigents Server deployment, update project documentation respectively
Implement Link Grammar parser based parsing, extending the existing parsing API - tryParse - https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/peer/Conversation.java#L814 - will have extra "mode" option with "link-grammar"/"link grammar"/"lg" value for that
Add integration tests, extending the existing ones https://github.com/aigents/aigents-java/blob/master/php/agent/agent_cat.php#L404

Use existing LinkGrammar in Java implementation https://arxiv.org/pdf/2105.00830.pdf

Subtasks:

Basic porting without of account of cost - done in b2ae519
Assemble based on disjuncts - 2 weeks
Assemble with cost account - 2 weeks
Upgrade to support the latest Link Grammar? - ? weeks

Extension for segmentation and punctuation - subtasks:
5. Segmentation by sentence - 4 weeks
6. Adding punctuation - 4 weeks
7. Russian dictionary load - 2 weeks (need only for Russian)
8. Assemble with the account to morphology - 2 weeks (need only for Russian)

Support reputation computation to the users

Requirement:

Maintain information regarding the following:
-- sources of the comments made in respect to either news items or other comments, so the origin of the comments may be tracked, if needed
-- authors of the news items and comments so the ratings of the latter may be used to compute reputation levels of the former
Have reputation of the authors updated on a daily basis, based on the feedback provided to posts and comments that they have authored.
Provide reputation levels of the users on user-specific basis, giving reputation levels of those who shares the news only to those who shared that, having the public users sharing their “channels” (Aigents “areas”) as sharing to everyone by default.

Subtasks:

add "authors" field pointing to peers (and attached to them under the hood automatically) who either
a) create content
b) share content created based on setup of their own
add "parents" field pointing to the parents of a news item, so "comments" may refer to the parent items ("posts"), make it possible to link children to parents (like saying There text 'That is great news', times 2020-01-31 parents text 'Aigents news feed on Reddit', times 2020-01-30.)
make "liquid rank" reputation computed for authors based on "trusts" given to news items authored by them and comments associated with those items - like it is done in conventional Aigents social analytics but using Reputationer engine
Make reputation, along with relevance, retrieved for every trusted peer saying: What is peer, friend true, trust true email, relevance, reputation?

Cosmos API integration

We want to have the integration with Cosmos API and ecosystem over the REST API in the same way that we already have for Steemit, Golos and Ethereum.

REST API spec:
https://docs.cosmos.network/
https://cosmos.network/rpc/v0.37.9

Discourse forum:
https://forum.cosmos.network/t/integrating-aigents-with-cosmos-tendermint-and-this-discourse-forum/4233

Desktop/Server Graph Rendereder

Task: We need to have Aigents Graphs rendering framework
https://blog.singularitynet.io/graphs-part-3-aigents-graph-analysis-for-blockchains-and-social-networks-142fc8182389
present for Aigents Web version
https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-graph.js
https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-gui.js
https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-map.js
ported to the Aigents Desktop and Server version
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/gui/App.java
having some UI/UX design decisions discussed and updated along the way.
We may re-use UI/graph rendering Java code from Webstructor project
http://webstructor.net/ (which will have to be open-sourced along the way)

Reason: There is a capping limit on the number of transactions returned by the server to the web client (because web client just hangs rendering more than a few thousand transactions).

Design: It may be implemented as
A) Server library serving huge graphs rendering to any canvas (based on https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-graph.js and code from http://webstructor.net/)
B) Desktop GUI presenting canvas to the graph renderer and user interaction combining both graph rendering/interaction paradigms currently present in Aigents Web client (https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-gui.js and https://github.com/aigents/aigents-java/blob/master/html/ui/aigents-map.js)
C) Client-server protocol enabling to render huge graphs in png/svg file and rendering them on wen client as png/svg images. PNG/SVG should be likely a configurable option (because SVG would not work for huge graphs over the web anyway).

Option: Support for 3-dimensional graphs may be implemented based on http://webstructor.net/ code.