Code Monkey home page Code Monkey logo

scholar-alert-digest's Introduction

Google Scholar alert digest

aggregate google scholar email alerts by paper

Simplifies scientific paper discovery by aggregating all unread emails under a Gmail label from the Google Scholar alerts, grouping papers by title and producing a report (Markdown/HTML/JSON).

How to use

To use this tool for generating a report on new papers from Google Scholar, do the following:

  1. Search on Google Scholar for a paper of an author
  2. Create an Alert (for citations, new or similar publications)
  3. Create a Gmail filter, moving all those emails under a dedicated Label
  4. Run this tool to get an aggregated report (in Markdown or HTML) of all the papers from the unread emails

For more details, please refer to the documentation.

Setup

Make sure you have a recent version of go. Then clone this repository:

git clone github.com/bzz/scholar-alert-digest

Building a binary (optional)

Alternatively, you can try to build a scholar-alert-digest binary and put it under $GOPATH/bin with:

cd "$(mktemp -d)" && go mod init scholar-alert-digest  && go get github.com/bzz/scholar-alert-digest

However this approach is known to yield errors and is not recommended.

Configure google cloud

Enable "Gmail API" Google Cloud Platform (GCP) project & download credentials.json following these steps.
That will guide you through creation of a new GCP project, enabling the Gmail API and geting "OAuth client ID" - authorization credentials for a desktop application that are needed in order to get access to your email messages at Gmail

After placing credentials.json in the project directory, you need to authenticate the application. You can do this by running

go run main.go

An accounts.google.com link will be printed (and possibly opened in your browser). Follow the login instructions, selecting the google account you used for the previous step if you have multiple. You will get a warning that google has not verified the app; click Continue, and then Continue again.

Oh no! This site can't be reached! You'll get a "refused to connect" message. That is fine! Just go to the url bar and look for a section like this:

&code=4/0AWtgzh78xyaMnEMdDBL5P-tX66J3Fsb_93XvRCJzmLXDplnByMZmaXZcFjde3hJIt3D1pA

Copy the part following the = sign (importantly not including the trailing &scope) and paste it into the terminal. Now the app is authenticated. In the future you won't need to repeat this step

CLI

The CLI tool is used to generate one-time Markdown/HTML reports.

To find your specific label name:

go run main.go -labels

To generate the report, either pass the label name though CLI:

go run main.go -l '<your-label-name>'

Or export it as an env var:

export SAD_LABEL='<your-label-name>'
go run main.go

Run

To output rendered HTML or JSON instead of the default Markdown, use

go run main.go -html
go run main.go -json

To mark all emails that were aggregated in the current report as read, use

go run main.go -mark

To include read emails in the separate section of the report, do

go run main.go -read

To only aggregate the email subjects do

go run main.go -subj | uniq -c | sort -dr

There is an optional more compact report template that may be useful for a large number of papers:

go run main.go -compact

To include authors in the paper details snippet, use

go run main.go -authors

To include references to original email into the report, do:

go run main.go -refs

Web Server

The Web UI exposes HTML report generation to multiple concurrent users.

Test

It is possible to test it locally, without Gmail app configuration from below, by using emails from ./fixtures by running:

go run ./cmd/server -test

Configure

It does not support the same OAuth client credentials as CLI from credentials.json.

It requires:

  • To create a new credentials in your API project https://console.developers.google.com/apis/credentials?project=quickstart-<NNN>
  • "Credentials" -> "Create credentials" -> "Web application" type
  • Add http://localhost/login/authorized value to Authorized redirect URIs field
  • Copy the Client ID and Client secret

Pass in the ID and the secret as env vars e.g by

export SAD_GOOGLE_ID='<client id>'
export SAD_GOOGLE_SECRET='<client secret>'

You do not need to pass the label name on the startup as it can be chosen at runtime at /labels.

Run

The report generation is exposed through a web server that can be started with

go run ./cmd/server [-compact]

to spin up a server at http://localhost:8080

Start by visiting http://localhost:8080/login to get the user OAuth access token. Visit http://localhost:8080/labels to chose your label name.

License

Apache License, Version 2.0. See LICENSE

scholar-alert-digest's People

Contributors

bzz avatar dependabot[bot] avatar fredcallaway avatar heruka-urgyen avatar marwahaha avatar robertkirk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

scholar-alert-digest's Issues

switch to a better way of accessing gmail API

To access gmail though API we use a client library, that seems to be in the "maintenance mode" googleapis/google-api-go-client#435 (comment)

ATM it's quite hard to tell what would be a better way/another library for doing that, as it's an option 1 from https://github.com/googleapis/googleapis#overview.

But according to https://googleapis.github.io several other possibilities seem to exits:

Feature request: add option to generate HTML

Right now the report is in GH-flavored Markdown format (with some HTML tags) that might be difficult to render properly on local machine.

We could have a CLI option, say, -f that allows to choose the output format between HTML or Markdown.

HT @m09

installation issues

When following the initial instruction (I think this is the go get step), I get

(base) kunal@kunal-hp:/tmp/tmp.84voCCm03m$ go get github.com/bzz/scholar-alert-digest
go: finding github.com/bzz/scholar-alert-digest latest
go: downloading github.com/bzz/scholar-alert-digest v0.0.0-20210307170856-bfb769eb3f3c
go: extracting github.com/bzz/scholar-alert-digest v0.0.0-20210307170856-bfb769eb3f3c
go get: github.com/bzz/[email protected] requires
	gitlab.com/golang-commonmark/[email protected] requires
	gopkg.in/russross/[email protected]: invalid version: unknown revision 000000000000

How can I debug it?

Paper cache + reload button

Right now the backend is completely stateless: every request from the frontend triggers fetching from GMail.

The idea is

  • to cache papers per user session in-memory on the backend and serve only them from /json/messages
  • add explicit "Reload" action to the frontend that will trigger a new /json/messages/fetch

Screenshot 2021-03-11 at 09 05 27

This will allow to experiment on real data much faster and will be a fist step towards introducing a proper state management on the backend.

Feature Request: Make Web View Interactive

Hi,

Really liking this tool. Going through the papers in my alerts, I would highly appreciate an option to remove papers from the list (for example, a small button next to each paper that makes them disappear). I do not need that to be reflected in the actual emails, I would just like to click the papers away one-by-one as I go through them. Do you think that would be possible? If you point me to the right direction, I am happy to set that up and make a PR.

Thanks!

Update/fix usage instructions

It took me about an hour to get this working, which is pretty silly. Mostly problems with authorization. I would like to update the README based on my experience. Just checking that the repo is maintained before I do this.

One question: I was only able to get the authorization code by copying it from the redirect URL—is this the intended mechanism? When I saw the "localhost refused to connect" page, I assumed something was wrong.

choice: create new "Quickstart" project in G Console or use existing pre-authorized app

Right now for authorization \w server-side flow OAuth 2.0 from Google we are tell user to create new Quickstart project in his API console (under his own account).

It does have only a limited number of API requests and some permissions/scopes are severely capped (only 100 -modify calls) e.g those used to mark email as read.

May be a better idea could be to have a documented option of using pre-registered, verified app so that the user can skip API console configuration steps and avoid the hassle of app verification (can take days :/)

Idea: Launch script

When following the instructions, under Ubuntu 20.04 the package (currently) gets downloaded to /home/USER/go/pkg/mod/github.com/bzz/[email protected]

To access it easier, I wrote a little script which can run from anywhere, starts the service and automatically opens the browser. No magic, but convenient:

export SAD_LABEL='XXX'
export SAD_GOOGLE_ID='YYY'
export SAD_GOOGLE_SECRET='ZZZ'

xdg-open http://localhost:8080
cd /home/USER/go/pkg/mod/github.com/bzz/[email protected]

go run ./cmd/server [-compact]

Not sure if this is the optimal solution, though. Maybe one could add this (or a better/different approach) to the Readme as a starting point for new users? Also, it could be stated that the credentials.json needs to be placed in the root directory, even though this was easy to figure out.

Error on extracting papers from email \w 'Showing less relevant results'

I've noticed that any scholar alert emails that have been configured with 'all results' rather than 'most relevant' result in an error when processed by this tool. This might because each email starts with:

"Showing less relevant results because there are no great results

Update alert to receive fewer, more relevant results"

Am I correct in this, and if so would this be an easy fix to implement? Here is my code (note this happens in json/html or with just minimal flags):

go run main.go -l 'GScholar' -read -authors
2022/04/11 10:04:41 searching and fetching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 searching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 14 messages found (took 0 sec)
14 / 14 [-----------------------------------------------------] 100.00% ? p/s 1s
2022/04/11 10:04:42 14 messages fetched (took 0 sec)
2022/04/11 10:04:42 14 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 searching and fetching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 searching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 1 messages found (took 0 sec)
1 / 1 [-------------------------------------------------------] 100.00% ? p/s 0s
2022/04/11 10:04:42 1 messages fetched (took 0 sec)
2022/04/11 10:04:42 1 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 rendering 2 papers
# Google Scholar Alert Digest

**Date**: 2022-04-11T10:04:42&#43;01:00
**Unread emails**: 14
**Paper titles**: 2
**Uniq paper titles**: 2

## New papers

   
 - [Cerebellar Transcranial Magnetic Stimulation (TMS) Impairs Visual Working Memory](https://link.springer.com/article/10.1007/s12311-022-01396-2), <i>N Viñas</i> (1)
   <details>
     <summary>… As a precaution, the coil was positioned using the Brainsight navigator and the</summary>
     <div>experimenter monitored for potential deviation of the target, the “bullseye,” and maintained the coil position targeting the cerebellum targets if needed. Details of this …</div>
   </details>
   

   
 - [Short-term facilitation effects elicited by cortical priming through theta burst stimulation and functional electrical stimulation of upper-limb muscles](https://link.springer.com/article/10.1007/s00221-022-06353-3), <i>Update Alert To Receive Fewer, More Relevant Results</i> (1)
   <details>
     <summary>… The coil position and orientation were monitored throughout the experiment using a</summary>
     <div>neuronavigation system (Brainsight, Rogue Research, Montreal, Canada). Ten TMS stimuli, with approximately 5–7 s inter-stimulus intervals, were delivered for …</div>
   </details>
   

## Old papers

<details id="archive">
  <summary>Archive</summary>


</details>
2022/04/11 10:04:42 Errors: 13

Missing header data in json response from server

In order for the front-end app to have access to a complete data model, please add the following to the json response from server:

  • datetime when the report was generated
  • total number of unread email Messages that were used to generate it
  • total number of Papers (by unique paper titles) cited/references in these unread emails

Feature request: add referenced people

It would be super nice to see not only the count of the references each paper makes but also a list of names whos papers it references (the people we are subscribed to). For instance,

Paper's title (3: Author1, Author2, Author3)

Add -v CLI flag that prints all paper parsing errors

Right now on parsing the papers we only count all errors, silently skipping failures and thus hiding their root cause.

// Stats is a number of counters \w stats on paper extraction from gmail messages.
type Stats struct {
Msgs, Titles, Errs int

This leads to difficulties reproducing bugs like #76 as one needs to modify source code (by adding log.Print(err) to

if err != nil {
st.Errs++
) for debugging.

Instead, it would be nice to have:

  • a -v flag that would print all the errors
  • accumulated (e.g as map[Subject][]error) through paper parsing, as part of Stats

That would simplify the debugging, as users would be able to identify the offender emails and attach the specific HTMLs that causes errors.

URL Regex doesn't match .co.uk scholar urls

If the emails use http://scholar.google.co.uk/scholar_url?url=.... then because the url has .co.uk rather than .com the regex doesn't match the url and hence ignores the paper. This adjusted url fixes it to match any .com and .co.uk (and theoretically any other one or two-part suffix).

infra: add CI for tests and release

Now, that the first 🐛 is found and fixed with #16 that brought in some tests :feelsgood: it would be good to add a CI.

The idea is to try Github Actions for:

  • a test profile: run go test .
  • a release profile: build the CLI report generation binary (macOS, linux), create GH release and upload binary in there

An (upcoming #15) server binary and CD for it is going to be handled in a separate issue

feature request: visible authors

In the current implementation, authors are always hidden and are shown only when abstracts are expanded. If someone is interested in authors much more than in abstracts, it requires them to click a lot to see all of the authors.

As a workaround, maybe the authors could be shown in between the title and the abstract?

Screenshot 2019-12-23 at 12 40 40

But I guess it makes the normal mode even less compact. :) So another way is to add the authors after the title like this:

Screenshot 2019-12-23 at 12 55 00

Screenshot 2019-12-23 at 12 49 24

Would this pattern introduce less noise?

Interaction model

Modelling the state of individual paper on backend (and not only in Gmail) will open a lot of opportunities e.g. tagging papers by topics, using that as a source for training data for training classifiers that target sub-fields, etc that go beyond our current use-cases that, of course, we also want to keep supporting (for background on the current use-cases see #19).

In order to decide how to proceed, we will need to answer the question: how does one mark papers as 'read' and then gets back to those later? Our current approach with a single, ever-growing "read" section on the same page although works, does not seem to be very productive.

I see two main alternative interaction models for managing the state of the paper:

  1. The "inbox" model
    Very similar to what Gmail does: individual paper checkboxes (with bulk select) + tags. Then a "Read" section could be modeled by a dedicated tag.
  2. The "report generation" model
    A "generate a report" action that aggregates everything unread to a timestamped report, marking all the papers as "read" in a bulk + a new page with the history of all reports for every individual user.

Marking more than 1000 email as read fails

Could you please implement marking more that a thousand emails as read?

I had 2067 unread emails with the target label and run the tool with the '-mark' argument, which resulted in the following message:

failed to batch-delete label UNREAD from 2067 messages: googleapi: Error 400: Number of ids cannot exceed 1000, invalidArgument

The full command in this case was go run main.go -l alerts -mark -html -refs -compact.

Handle "clusters" on paper extraction

On extracting publications (papers) from emails, a class of papers that in email look like

  • https://scholar.google.com/scholar?cluster=14905208172666766997&hl=en&oi=scholaralrt&hist=KBiQzPUAAAAJ:3103465405719670724:AAGBfm3tO_7Uk2dTXZseJcyJq0Kjaug97Q&html=&folt=rel

are skipped (14 papers out of +2k) as ATM we use a regex to extract the pdf URL from such links and it fails to match.
Instead of the usual /scholar_url?url=<url-to-the.pdf> pattern, these links looks like /scholar?cluster=14905208172666766997&... and a way to get the URL to individual pdf (any from the cluster) is not obvious.

One option is too keep those links as-is, so the user will have to choose the PDF from a scholar page themselves.

Refactor gmail API lib usage

Don't use deprecated API from Go gmail lib

gmail.go:91:14 gmail.New is deprecated: please use NewService instead. To provide a custom HTTP client, use option.WithHTTPClient. If you are using google.golang.org/api/googleapis/transport.APIKey, use option.WithAPIKey with NewService instead.  (SA1019)

Feature request: add a web server

It would be nice to have an option to deploy it somewhere and get reports generated so there is no need to run it locally for every individual user.

This will require a shared Gmail API 3-legged Oauth2.0 app configuration from #7 .

Actual deployment is going to be handled by a separate issue.

HTMLRenderer crashes when special characters appear in digests

Hello,

I just tried to generate a report out of a large number (~5000) of emails and scholar-alert-digest crashed with the following backtrace:

panic: template: layout:29666: unexpected "\\" in command

goroutine 1 [running]:
html/template.Must(...)
        /usr/lib/go/src/html/template/template.go:372
github.com/bzz/scholar-alert-digest/templates.(*HTMLRenderer).Render(0xc006d3f230, 0xafe5e0, 0xc000010018, 0xc00642d5c0, 0xc0022c1470, 0x0)
        /home/simon/git/scholar-alert-digest/templates/templates.go:232 +0x48a
main.main()
        /home/simon/git/scholar-alert-digest/main.go:154 +0x542

Unfortunately I don't know which email actually caused this but it looks like some sort of escaping problem before passing things to the go templating engine.

Command was go run main.go -l scholaralert -authors -html -refs on the current master branch

web server improvements

Technical things that need to be done before thinking about public deployment.

TODOs:

  • redirect of requests to any URL without a valid token to /login (middleware in /json/*)
  • automatic OAuth token refresh on expiration
  • refactoring: add a router
  • refactoring: use html/template for the main page rendering (done in #25)
  • refactoring: extract a gmailutils.MessageFetcher interface

feature request: "mark as read" interaction for web

Would that be nice to be able to mark papers as "read" in the web UI?

That seems useful to me, and could trigger different actions:

  1. move the paper to "Archive" section of the report immediately
    Although it looks nice at the first thought, that would be inconsistent with the "read/unread" gmail message state that is the premises of putting papers in New/Archive sections of the report in the CLI mode.
  2. keep the paper marked as "read" in the same section of the report
    When all the papers from the same email are "marked" as read, only then trigger actual sync with gmail and mark the email message as read.

The idea is to keep the mechanic of interaction as simple as possible for now (only server-side HTTP form submission):

  • add a checkbox near every paper
  • add "submit" button that could be clicked any time and would trigger the sync \w gmail

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.