bzz / scholar-alert-digest Goto Github PK

View Code? Open in Web Editor NEW

39.0 5.0 5.0 1.87 MB

Aggregate unread emails from Google Scholar alerts

License: Apache License 2.0

Go 65.59% JavaScript 31.14% CSS 3.01% EJS 0.25%

google-scholar paper google-alerts ml-on-code

scholar-alert-digest's Issues

Refactor gmail API lib usage

Don't use deprecated API from Go gmail lib

gmail.go:91:14 gmail.New is deprecated: please use NewService instead. To provide a custom HTTP client, use option.WithHTTPClient. If you are using google.golang.org/api/googleapis/transport.APIKey, use option.WithAPIKey with NewService instead.  (SA1019)

Error on extracting papers from email \w 'Showing less relevant results'

I've noticed that any scholar alert emails that have been configured with 'all results' rather than 'most relevant' result in an error when processed by this tool. This might because each email starts with:

"Showing less relevant results because there are no great results

Update alert to receive fewer, more relevant results"

Am I correct in this, and if so would this be an easy fix to implement? Here is my code (note this happens in json/html or with just minimal flags):

go run main.go -l 'GScholar' -read -authors
2022/04/11 10:04:41 searching and fetching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 searching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 14 messages found (took 0 sec)
14 / 14 [-----------------------------------------------------] 100.00% ? p/s 1s
2022/04/11 10:04:42 14 messages fetched (took 0 sec)
2022/04/11 10:04:42 14 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 searching and fetching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 searching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 1 messages found (took 0 sec)
1 / 1 [-------------------------------------------------------] 100.00% ? p/s 0s
2022/04/11 10:04:42 1 messages fetched (took 0 sec)
2022/04/11 10:04:42 1 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 rendering 2 papers
# Google Scholar Alert Digest

**Date**: 2022-04-11T10:04:42&#43;01:00
**Unread emails**: 14
**Paper titles**: 2
**Uniq paper titles**: 2

## New papers

   
 - [Cerebellar Transcranial Magnetic Stimulation (TMS) Impairs Visual Working Memory](https://link.springer.com/article/10.1007/s12311-022-01396-2), <i>N Viñas</i> (1)
   <details>
     <summary>… As a precaution, the coil was positioned using the Brainsight navigator and the</summary>
     <div>experimenter monitored for potential deviation of the target, the “bullseye,” and maintained the coil position targeting the cerebellum targets if needed. Details of this …</div>
   </details>
   

   
 - [Short-term facilitation effects elicited by cortical priming through theta burst stimulation and functional electrical stimulation of upper-limb muscles](https://link.springer.com/article/10.1007/s00221-022-06353-3), <i>Update Alert To Receive Fewer, More Relevant Results</i> (1)
   <details>
     <summary>… The coil position and orientation were monitored throughout the experiment using a</summary>
     <div>neuronavigation system (Brainsight, Rogue Research, Montreal, Canada). Ten TMS stimuli, with approximately 5–7 s inter-stimulus intervals, were delivered for …</div>
   </details>
   

## Old papers

<details id="archive">
  <summary>Archive</summary>


</details>
2022/04/11 10:04:42 Errors: 13

Idea: Launch script

When following the instructions, under Ubuntu 20.04 the package (currently) gets downloaded to /home/USER/go/pkg/mod/github.com/bzz/[email protected]

To access it easier, I wrote a little script which can run from anywhere, starts the service and automatically opens the browser. No magic, but convenient:

export SAD_LABEL='XXX'
export SAD_GOOGLE_ID='YYY'
export SAD_GOOGLE_SECRET='ZZZ'

xdg-open http://localhost:8080
cd /home/USER/go/pkg/mod/github.com/bzz/[email protected]

go run ./cmd/server [-compact]

Not sure if this is the optimal solution, though. Maybe one could add this (or a better/different approach) to the Readme as a starting point for new users? Also, it could be stated that the credentials.json needs to be placed in the root directory, even though this was easy to figure out.

Missing header data in json response from server

In order for the front-end app to have access to a complete data model, please add the following to the json response from server:

datetime when the report was generated
total number of unread email Messages that were used to generate it
total number of Papers (by unique paper titles) cited/references in these unread emails

Feature request: add a web server

It would be nice to have an option to deploy it somewhere and get reports generated so there is no need to run it locally for every individual user.

This will require a shared Gmail API 3-legged Oauth2.0 app configuration from #7 .

Actual deployment is going to be handled by a separate issue.

choice: create new "Quickstart" project in G Console or use existing pre-authorized app

Right now for authorization \w server-side flow OAuth 2.0 from Google we are tell user to create new Quickstart project in his API console (under his own account).

It does have only a limited number of API requests and some permissions/scopes are severely capped (only 100 -modify calls) e.g those used to mark email as read.

May be a better idea could be to have a documented option of using pre-registered, verified app so that the user can skip API console configuration steps and avoid the hassle of app verification (can take days :/)

Feature Request: Make Web View Interactive

Hi,

Really liking this tool. Going through the papers in my alerts, I would highly appreciate an option to remove papers from the list (for example, a small button next to each paper that makes them disappear). I do not need that to be reflected in the actual emails, I would just like to click the papers away one-by-one as I go through them. Do you think that would be possible? If you point me to the right direction, I am happy to set that up and make a PR.

Thanks!

Interaction model

Modelling the state of individual paper on backend (and not only in Gmail) will open a lot of opportunities e.g. tagging papers by topics, using that as a source for training data for training classifiers that target sub-fields, etc that go beyond our current use-cases that, of course, we also want to keep supporting (for background on the current use-cases see #19).

In order to decide how to proceed, we will need to answer the question: how does one mark papers as 'read' and then gets back to those later? Our current approach with a single, ever-growing "read" section on the same page although works, does not seem to be very productive.

I see two main alternative interaction models for managing the state of the paper:

The "inbox" model
Very similar to what Gmail does: individual paper checkboxes (with bulk select) + tags. Then a "Read" section could be modeled by a dedicated tag.
The "report generation" model
A "generate a report" action that aggregates everything unread to a timestamped report, marking all the papers as "read" in a bulk + a new page with the history of all reports for every individual user.

Doc: add Input/Output examples in README

To grasp the idea of the tool in 5 seconds, some screenshots with Input/Output examples would come very handy.

Experiment: add auto-generated TLDRs

Trying out https://scitldr.apps.allenai.org manually on a small number of papers seems to provide quite good results.

The code is here https://github.com/allenai/scitldr the inference for 1 paper seems to take the order of seconds, but quite possibly could be batched efficiency.

feature request: visible authors

In the current implementation, authors are always hidden and are shown only when abstracts are expanded. If someone is interested in authors much more than in abstracts, it requires them to click a lot to see all of the authors.

As a workaround, maybe the authors could be shown in between the title and the abstract?

But I guess it makes the normal mode even less compact. :) So another way is to add the authors after the title like this:

Would this pattern introduce less noise?

web server improvements

Technical things that need to be done before thinking about public deployment.

TODOs:

redirect of requests to any URL without a valid token to /login (middleware in /json/*)
automatic OAuth token refresh on expiration
refactoring: add a router
refactoring: use html/template for the main page rendering (done in #25)
refactoring: extract a gmailutils.MessageFetcher interface

Backend: "cluster URL" support

Sometimes (8 from ~1200 papers), Scholar returns the results that look like http://scholar.google.com/scholar?cluster=.

Right now, they are are skipped as not matching papers.scholarURLPrefix.

Add -v CLI flag that prints all paper parsing errors

Right now on parsing the papers we only count all errors, silently skipping failures and thus hiding their root cause.

scholar-alert-digest/papers/papers.go

Lines 46 to 48 in 7d2e4de

    
           // Stats is a number of counters \w stats on paper extraction from gmail messages. 
        
           type Stats struct { 
        
           	Msgs, Titles, Errs int

This leads to difficulties reproducing bugs like #76 as one needs to modify source code (by adding log.Print(err) to

scholar-alert-digest/papers/papers.go

Lines 83 to 84 in 7d2e4de

    
           if err != nil { 
        
           	st.Errs++

) for debugging.

Instead, it would be nice to have:

a -v flag that would print all the errors
accumulated (e.g as map[Subject][]error) through paper parsing, as part of Stats

That would simplify the debugging, as users would be able to identify the offender emails and attach the specific HTMLs that causes errors.

Feature request: mark emails as read

Add a mode that on each report generation marks all the emails that we aggregated into current report as read on Gmail.

HT @m09

installation issues

When following the initial instruction (I think this is the go get step), I get

(base) kunal@kunal-hp:/tmp/tmp.84voCCm03m$ go get github.com/bzz/scholar-alert-digest
go: finding github.com/bzz/scholar-alert-digest latest
go: downloading github.com/bzz/scholar-alert-digest v0.0.0-20210307170856-bfb769eb3f3c
go: extracting github.com/bzz/scholar-alert-digest v0.0.0-20210307170856-bfb769eb3f3c
go get: github.com/bzz/[email protected] requires
	gitlab.com/golang-commonmark/[email protected] requires
	gopkg.in/russross/[email protected]: invalid version: unknown revision 000000000000

How can I debug it?

URL Regex doesn't match .co.uk scholar urls

If the emails use http://scholar.google.co.uk/scholar_url?url=.... then because the url has .co.uk rather than .com the regex doesn't match the url and hence ignores the paper. This adjusted url fixes it to match any .com and .co.uk (and theoretically any other one or two-part suffix).

Update/fix usage instructions

It took me about an hour to get this working, which is pretty silly. Mostly problems with authorization. I would like to update the README based on my experience. Just checking that the repo is maintained before I do this.

One question: I was only able to get the authorization code by copying it from the redirect URL—is this the intended mechanism? When I saw the "localhost refused to connect" page, I assumed something was wrong.

feature request: "mark as read" interaction for web

Would that be nice to be able to mark papers as "read" in the web UI?

That seems useful to me, and could trigger different actions:

move the paper to "Archive" section of the report immediately
Although it looks nice at the first thought, that would be inconsistent with the "read/unread" gmail message state that is the premises of putting papers in New/Archive sections of the report in the CLI mode.
keep the paper marked as "read" in the same section of the report
When all the papers from the same email are "marked" as read, only then trigger actual sync with gmail and mark the email message as read.

The idea is to keep the mechanic of interaction as simple as possible for now (only server-side HTTP form submission):

add a checkbox near every paper
add "submit" button that could be clicked any time and would trigger the sync \w gmail

Handle "clusters" on paper extraction

On extracting publications (papers) from emails, a class of papers that in email look like

https://scholar.google.com/scholar?cluster=14905208172666766997&hl=en&oi=scholaralrt&hist=KBiQzPUAAAAJ:3103465405719670724:AAGBfm3tO_7Uk2dTXZseJcyJq0Kjaug97Q&html=&folt=rel

are skipped (14 papers out of +2k) as ATM we use a regex to extract the pdf URL from such links and it fails to match.
Instead of the usual /scholar_url?url=<url-to-the.pdf> pattern, these links looks like /scholar?cluster=14905208172666766997&... and a way to get the URL to individual pdf (any from the cluster) is not obvious.

One option is too keep those links as-is, so the user will have to choose the PDF from a scholar page themselves.

Feature request: add paper abstracts

Right now, only a paper title and a url are included in the report that looks like this:

It would be nice to include paper abstract to the report as well, like this:

HT @EgorBu

Paper cache + reload button

Right now the backend is completely stateless: every request from the frontend triggers fetching from GMail.

The idea is

to cache papers per user session in-memory on the backend and serve only them from /json/messages
add explicit "Reload" action to the frontend that will trigger a new /json/messages/fetch

This will allow to experiment on real data much faster and will be a fist step towards introducing a proper state management on the backend.

Feature request: include read mails in the generated report

Cf discussion starting with #9 (comment)

Feature request: add option to generate HTML

Right now the report is in GH-flavored Markdown format (with some HTML tags) that might be difficult to render properly on local machine.

We could have a CLI option, say, -f that allows to choose the output format between HTML or Markdown.

HT @m09

infra: add CI for tests and release

Now, that the first 🐛 is found and fixed with #16 that brought in some tests it would be good to add a CI.

The idea is to try Github Actions for:

a test profile: run go test .
a release profile: build the CLI report generation binary (macOS, linux), create GH release and upload binary in there

An (upcoming #15) server binary and CD for it is going to be handled in a separate issue

HTMLRenderer crashes when special characters appear in digests

Hello,

I just tried to generate a report out of a large number (~5000) of emails and scholar-alert-digest crashed with the following backtrace:

panic: template: layout:29666: unexpected "\\" in command

goroutine 1 [running]:
html/template.Must(...)
        /usr/lib/go/src/html/template/template.go:372
github.com/bzz/scholar-alert-digest/templates.(*HTMLRenderer).Render(0xc006d3f230, 0xafe5e0, 0xc000010018, 0xc00642d5c0, 0xc0022c1470, 0x0)
        /home/simon/git/scholar-alert-digest/templates/templates.go:232 +0x48a
main.main()
        /home/simon/git/scholar-alert-digest/main.go:154 +0x542

Unfortunately I don't know which email actually caused this but it looks like some sort of escaping problem before passing things to the go templating engine.

Command was go run main.go -l scholaralert -authors -html -refs on the current master branch

Feature request: add referenced people

It would be super nice to see not only the count of the references each paper makes but also a list of names whos papers it references (the people we are subscribed to). For instance,

Paper's title (3: Author1, Author2, Author3)

switch to a better way of accessing gmail API

To access gmail though API we use a client library, that seems to be in the "maintenance mode" googleapis/google-api-go-client#435 (comment)

ATM it's quite hard to tell what would be a better way/another library for doing that, as it's an option 1 from https://github.com/googleapis/googleapis#overview.

But according to https://googleapis.github.io several other possibilities seem to exits:

access it though gRPC, but quick search googleapis/googleapis does not reveal anything Gmail related
https://github.com/googleapis/google-cloud-go looks like the latest API lib, but it does not include Gmail as well

Full duplicates in the output

Occasionally there are duplicates in the tool output:

Is it a bug or a feature? :)

Marking more than 1000 email as read fails

Could you please implement marking more that a thousand emails as read?

I had 2067 unread emails with the target label and run the tool with the '-mark' argument, which resulted in the following message:

failed to batch-delete label UNREAD from 2067 messages: googleapi: Error 400: Number of ids cannot exceed 1000, invalidArgument

The full command in this case was go run main.go -l alerts -mark -html -refs -compact.

	// Stats is a number of counters \w stats on paper extraction from gmail messages.
	type Stats struct {
	Msgs, Titles, Errs int

bzz / scholar-alert-digest Goto Github PK

scholar-alert-digest's Issues

Recommend Projects

Recommend Topics

Recommend Org