bzz / scholar-alert-digest Goto Github PK
View Code? Open in Web Editor NEWAggregate unread emails from Google Scholar alerts
License: Apache License 2.0
Aggregate unread emails from Google Scholar alerts
License: Apache License 2.0
Don't use deprecated API from Go gmail lib
gmail.go:91:14 gmail.New is deprecated: please use NewService instead. To provide a custom HTTP client, use option.WithHTTPClient. If you are using google.golang.org/api/googleapis/transport.APIKey, use option.WithAPIKey with NewService instead. (SA1019)
I've noticed that any scholar alert emails that have been configured with 'all results' rather than 'most relevant' result in an error when processed by this tool. This might because each email starts with:
"Showing less relevant results because there are no great results
Update alert to receive fewer, more relevant results"
Am I correct in this, and if so would this be an easy fix to implement? Here is my code (note this happens in json/html or with just minimal flags):
go run main.go -l 'GScholar' -read -authors
2022/04/11 10:04:41 searching and fetching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 searching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 14 messages found (took 0 sec)
14 / 14 [-----------------------------------------------------] 100.00% ? p/s 1s
2022/04/11 10:04:42 14 messages fetched (took 0 sec)
2022/04/11 10:04:42 14 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 searching and fetching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 searching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 1 messages found (took 0 sec)
1 / 1 [-------------------------------------------------------] 100.00% ? p/s 0s
2022/04/11 10:04:42 1 messages fetched (took 0 sec)
2022/04/11 10:04:42 1 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 rendering 2 papers
# Google Scholar Alert Digest
**Date**: 2022-04-11T10:04:42+01:00
**Unread emails**: 14
**Paper titles**: 2
**Uniq paper titles**: 2
## New papers
- [Cerebellar Transcranial Magnetic Stimulation (TMS) Impairs Visual Working Memory](https://link.springer.com/article/10.1007/s12311-022-01396-2), <i>N Viñas</i> (1)
<details>
<summary>… As a precaution, the coil was positioned using the Brainsight navigator and the</summary>
<div>experimenter monitored for potential deviation of the target, the “bullseye,” and maintained the coil position targeting the cerebellum targets if needed. Details of this …</div>
</details>
- [Short-term facilitation effects elicited by cortical priming through theta burst stimulation and functional electrical stimulation of upper-limb muscles](https://link.springer.com/article/10.1007/s00221-022-06353-3), <i>Update Alert To Receive Fewer, More Relevant Results</i> (1)
<details>
<summary>… The coil position and orientation were monitored throughout the experiment using a</summary>
<div>neuronavigation system (Brainsight, Rogue Research, Montreal, Canada). Ten TMS stimuli, with approximately 5–7 s inter-stimulus intervals, were delivered for …</div>
</details>
## Old papers
<details id="archive">
<summary>Archive</summary>
</details>
2022/04/11 10:04:42 Errors: 13
When following the instructions, under Ubuntu 20.04 the package (currently) gets downloaded to /home/USER/go/pkg/mod/github.com/bzz/[email protected]
To access it easier, I wrote a little script which can run from anywhere, starts the service and automatically opens the browser. No magic, but convenient:
export SAD_LABEL='XXX'
export SAD_GOOGLE_ID='YYY'
export SAD_GOOGLE_SECRET='ZZZ'
xdg-open http://localhost:8080
cd /home/USER/go/pkg/mod/github.com/bzz/[email protected]
go run ./cmd/server [-compact]
Not sure if this is the optimal solution, though. Maybe one could add this (or a better/different approach) to the Readme as a starting point for new users? Also, it could be stated that the credentials.json needs to be placed in the root directory, even though this was easy to figure out.
In order for the front-end app to have access to a complete data model, please add the following to the json response from server:
It would be nice to have an option to deploy it somewhere and get reports generated so there is no need to run it locally for every individual user.
This will require a shared Gmail API 3-legged Oauth2.0 app configuration from #7 .
Actual deployment is going to be handled by a separate issue.
Right now for authorization \w server-side flow OAuth 2.0 from Google we are tell user to create new Quickstart project in his API console (under his own account).
It does have only a limited number of API requests and some permissions/scopes are severely capped (only 100 -modify
calls) e.g those used to mark email as read.
May be a better idea could be to have a documented option of using pre-registered, verified app so that the user can skip API console configuration steps and avoid the hassle of app verification (can take days :/)
Hi,
Really liking this tool. Going through the papers in my alerts, I would highly appreciate an option to remove papers from the list (for example, a small button next to each paper that makes them disappear). I do not need that to be reflected in the actual emails, I would just like to click the papers away one-by-one as I go through them. Do you think that would be possible? If you point me to the right direction, I am happy to set that up and make a PR.
Thanks!
Modelling the state of individual paper on backend (and not only in Gmail) will open a lot of opportunities e.g. tagging papers by topics, using that as a source for training data for training classifiers that target sub-fields, etc that go beyond our current use-cases that, of course, we also want to keep supporting (for background on the current use-cases see #19).
In order to decide how to proceed, we will need to answer the question: how does one mark papers as 'read' and then gets back to those later? Our current approach with a single, ever-growing "read" section on the same page although works, does not seem to be very productive.
I see two main alternative interaction models for managing the state of the paper:
To grasp the idea of the tool in 5 seconds, some screenshots with Input/Output examples would come very handy.
Trying out https://scitldr.apps.allenai.org manually on a small number of papers seems to provide quite good results.
The code is here https://github.com/allenai/scitldr the inference for 1 paper seems to take the order of seconds, but quite possibly could be batched efficiency.
In the current implementation, authors are always hidden and are shown only when abstracts are expanded. If someone is interested in authors much more than in abstracts, it requires them to click a lot to see all of the authors.
As a workaround, maybe the authors could be shown in between the title and the abstract?
But I guess it makes the normal mode even less compact. :) So another way is to add the authors after the title like this:
Would this pattern introduce less noise?
Technical things that need to be done before thinking about public deployment.
TODOs:
Sometimes (8 from ~1200 papers), Scholar returns the results that look like http://scholar.google.com/scholar?cluster=
.
Right now, they are are skipped as not matching papers.scholarURLPrefix.
Right now on parsing the papers we only count all errors, silently skipping failures and thus hiding their root cause.
scholar-alert-digest/papers/papers.go
Lines 46 to 48 in 7d2e4de
This leads to difficulties reproducing bugs like #76 as one needs to modify source code (by adding log.Print(err)
to
scholar-alert-digest/papers/papers.go
Lines 83 to 84 in 7d2e4de
Instead, it would be nice to have:
-v
flag that would print all the errorsmap[Subject][]error
) through paper parsing, as part of StatsThat would simplify the debugging, as users would be able to identify the offender emails and attach the specific HTMLs that causes errors.
Add a mode that on each report generation marks all the emails that we aggregated into current report as read on Gmail.
HT @m09
When following the initial instruction (I think this is the go get step), I get
(base) kunal@kunal-hp:/tmp/tmp.84voCCm03m$ go get github.com/bzz/scholar-alert-digest
go: finding github.com/bzz/scholar-alert-digest latest
go: downloading github.com/bzz/scholar-alert-digest v0.0.0-20210307170856-bfb769eb3f3c
go: extracting github.com/bzz/scholar-alert-digest v0.0.0-20210307170856-bfb769eb3f3c
go get: github.com/bzz/[email protected] requires
gitlab.com/golang-commonmark/[email protected] requires
gopkg.in/russross/[email protected]: invalid version: unknown revision 000000000000
How can I debug it?
If the emails use http://scholar.google.co.uk/scholar_url?url=....
then because the url has .co.uk
rather than .com
the regex doesn't match the url and hence ignores the paper. This adjusted url fixes it to match any .com
and .co.uk
(and theoretically any other one or two-part suffix).
It took me about an hour to get this working, which is pretty silly. Mostly problems with authorization. I would like to update the README based on my experience. Just checking that the repo is maintained before I do this.
One question: I was only able to get the authorization code by copying it from the redirect URL—is this the intended mechanism? When I saw the "localhost refused to connect" page, I assumed something was wrong.
Would that be nice to be able to mark papers as "read" in the web UI?
That seems useful to me, and could trigger different actions:
The idea is to keep the mechanic of interaction as simple as possible for now (only server-side HTTP form submission):
On extracting publications (papers) from emails, a class of papers that in email look like
https://scholar.google.com/scholar?cluster=14905208172666766997&hl=en&oi=scholaralrt&hist=KBiQzPUAAAAJ:3103465405719670724:AAGBfm3tO_7Uk2dTXZseJcyJq0Kjaug97Q&html=&folt=rel
are skipped (14 papers out of +2k) as ATM we use a regex to extract the pdf URL from such links and it fails to match.
Instead of the usual /scholar_url?url=<url-to-the.pdf>
pattern, these links looks like /scholar?cluster=14905208172666766997&...
and a way to get the URL to individual pdf (any from the cluster) is not obvious.
One option is too keep those links as-is, so the user will have to choose the PDF from a scholar page themselves.
Right now, only a paper title and a url are included in the report that looks like this:
It would be nice to include paper abstract to the report as well, like this:
HT @EgorBu
Right now the backend is completely stateless: every request from the frontend triggers fetching from GMail.
The idea is
/json/messages
/json/messages/fetch
This will allow to experiment on real data much faster and will be a fist step towards introducing a proper state management on the backend.
Cf discussion starting with #9 (comment)
Right now the report is in GH-flavored Markdown format (with some HTML tags) that might be difficult to render properly on local machine.
We could have a CLI option, say, -f
that allows to choose the output format between HTML or Markdown.
HT @m09
Now, that the first 🐛 is found and fixed with #16 that brought in some tests it would be good to add a CI.
The idea is to try Github Actions for:
go test .
An (upcoming #15) server binary and CD for it is going to be handled in a separate issue
Hello,
I just tried to generate a report out of a large number (~5000) of emails and scholar-alert-digest crashed with the following backtrace:
panic: template: layout:29666: unexpected "\\" in command
goroutine 1 [running]:
html/template.Must(...)
/usr/lib/go/src/html/template/template.go:372
github.com/bzz/scholar-alert-digest/templates.(*HTMLRenderer).Render(0xc006d3f230, 0xafe5e0, 0xc000010018, 0xc00642d5c0, 0xc0022c1470, 0x0)
/home/simon/git/scholar-alert-digest/templates/templates.go:232 +0x48a
main.main()
/home/simon/git/scholar-alert-digest/main.go:154 +0x542
Unfortunately I don't know which email actually caused this but it looks like some sort of escaping problem before passing things to the go templating engine.
Command was go run main.go -l scholaralert -authors -html -refs
on the current master branch
It would be super nice to see not only the count of the references each paper makes but also a list of names whos papers it references (the people we are subscribed to). For instance,
Paper's title (3: Author1, Author2, Author3)
To access gmail though API we use a client library, that seems to be in the "maintenance mode" googleapis/google-api-go-client#435 (comment)
ATM it's quite hard to tell what would be a better way/another library for doing that, as it's an option 1 from https://github.com/googleapis/googleapis#overview.
But according to https://googleapis.github.io several other possibilities seem to exits:
Could you please implement marking more that a thousand emails as read?
I had 2067 unread emails with the target label and run the tool with the '-mark' argument, which resulted in the following message:
failed to batch-delete label UNREAD from 2067 messages: googleapi: Error 400: Number of ids cannot exceed 1000, invalidArgument
The full command in this case was go run main.go -l alerts -mark -html -refs -compact
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.