blekhmanlab / rxivist Goto Github PK

View Code? Open in Web Editor NEW

57.0 4.0 11.0 1006 KB

API providing access to papers and authors scraped from biorxiv.org

Home Page: https://rxivist.org

License: GNU Affero General Public License v3.0

Python 96.95% Shell 1.22% Dockerfile 1.83%

scientific-publications scientific-publishing scientific-papers academic-publications academic-publishing

rxivist's People

Contributors

Stargazers

Watchers

Forkers

dineshresearch dbrg77 elifesciences-publications shibin-george koudyk jschoenbachler bioinfonerd-forks srravula1 jefferyustc cu-dbmi

rxivist's Issues

Look into psql indexing to speed up queries

Add flag in each title indicating how recently it was posted

Indicate last week, last month, last 3 months, etc

Add endpoint: most popular papers, last 30 days

Just use the most recent monthly results

Look into getting daily download statistics

Papers get incomplete stats for month they were indexed

When a paper's traffic is recorded, it grabs all the months, including the current (unfinished) month. When we go back to add new months to that list, there's a section in spider.py to make sure we don't re-record any papers:

# make a list that excludes the records we already know about
      to_record = []
      for i, record in enumerate(stats):
        print(record)
        month = record[0]
        year = record[1]
        if year in done.keys() and month in done[year]:
          print("Found, not recording")
        else:
          to_record.append(record)

HOWEVER, this loop should also throw out the stats for the most recent month in the most recent year, because that would have been recorded during whatever that month was, and we have a chance to get a more up-to-date number now that we're revisiting it.

Run spider against entire bioinformatics vertical

Come up with real method of monitoring server process

I put a crazy hacky rc.local thing in "api/startup.sh" and it's a travesty

Add endpoint: most popular papers, all-time

Just whichever ones have the most hits out of all of them

Record which bioRxiv category a paper came from

Add endpoint for single paper details

Add "last updated" field to article entries

Track the last time individual pages were crawled so we can pull the ones that need to be updated the most

Deploy server to AWS

Needed to run API and host frontend files

Add endpoint: papers listed by bounce rate

"Most appealing abstracts"

Add "trending" list that favors recent traffic

Uses data from #41 — "hot" papers with lots of traffic more recently.

Register domain name

Add endpoint: all papers by author

Include rankings of the papers in each category!

Paginate "list all papers" response

Calculate most popular papers

We need a way to rank all papers by traffic, and it would be chaos to run that calculation every time someone asks.

Deploy database to AWS

Postgres looks like the winner

Combine traffic stats from all revisions of paper

If a new version of a paper is released, we should pull in the download stats from all the old versions too.

NOTE: This may already happen in bioRxiv automatically, not sure.

OTHER NOTE: If bioRxiv doesn't combine traffic numbers between versions, should we keep crawling the old versions to get their updated traffic numbers?

Create main page: "Most popular stories"

Have column for "all-time"

Add author pages that include all papers and their ranks

Add indices to improve ranking efficiency

https://www.postgresql.org/docs/10/static/indexes-ordering.html

Evaluate "shape" of traffic for each paper

When does a paper get most of its traffic? Is it right away, and then it tapers off? How quickly? (Could that be used for some kind of "enduring popularity" metric?) Do some papers start slower and pick up steam over time?

Rank papers by bounce rate

Which ones have the closest ratio of "abstract viewed" to "PDF" downloaded? If we set a minimum number of views, then we can have a list for "Most appealing abstracts"

Add "paper details" page that displays all of its ranks

Also link to all of its authors' pages

Record clicks from Rxivist over to bioRxiv

We may end up with a weird problem when this thing gets released—if Rxivist is popular enough, the patterns we're observing will end up being influenced by our site. My guess is that popular papers will end up getting more popular, but it will be hard to measure the impact.

If we can at least track how many visitors click on the "view paper" button for a particular article, we can try to account for our impact.

Add endpoint: general text search (title, abstract, authors)

Check how easy it is to weight results. ie "title" hits are better than "abstract" hits

Combine all sorting metrics into one big table

We don't need separate tables for each metric we're ranking on and what the rank is—we could have a table where the primary key is the article ID, and then each field is a metric we need to sort on. This way we could have queries be more precise (for example, only give the 'bounce rate leaders' for papers with more than 50 downloads or something)

Figure out why it server tries to connect to DB twice on startup

root@server:/app# python main.py 
Connecting. Attempt 1 of 10.
Connected!
Connecting. Attempt 1 of 10.          <-- whyyyy
Connected!
Bottle v0.12.13 server starting up (using WSGIRefServer())...
Listening on http://0.0.0.0:80/
Hit Ctrl-C to quit.

Add "Most appealing abstracts" section to homepage

Use the new "papers by bounce rate" endpoint

Long titles overflow their accordion header

When encountering new revision of known paper, update old record

When crawling NEW papers from the list, check to see if we've already crawled an earlier version and update the record's URL, rather than creating a new record.

Make front-end templates more modular

Use more includes for stuff like the "about" modal:
https://bottlepy.org/docs/dev/stpl.html#template-functions

Look at benefits of eliminating server-side rendering

Should we have a separate JS app to call our API? If so, do we add all the complexity of an advanced framework like Aurelia or Angular 6?

Add query endpoint with all features of search page

Add endpoint: "list all papers"

with basic info about each. (NOT all info.)

Add homepage list for "most popular last month"

Incorporate bioRxiv categories into results

From about page:

Articles in bioRxiv are categorized as New Results, Confirmatory Results, or Contradictory Results. New Results describe an advance in a field. Confirmatory Results largely replicate and confirm previously published work, whereas Contradictory Results largely replicate experimental approaches used in previously published work but the results contradict and/or do not
support it.

It would be cool to have separate charts for these available: "Most popular contradictory results" sounds fun

Add more rankings for authors

Most papers? Most downloads?

Stream redirection isn't working right in API supervisor script

Some of the stdout stuff that SHOULD be going to /var/log/rxivist.log is just being written to the console. Hiding things in /var/log/messages is silly, I should figure out how to get this to behave right.

To replicate, go to server:

cd /etc
./rc.local

then refresh the web page

Document all available endpoints

Don't check article uniqueness based on title

Titles (and authors) can change between revisions, so use URL unless bioRxiv has some snazzy unique identifier we don't know about.

Look into creating accounts for people to save stuff

A-la the arxiv-sanity.com login system, where you can bookmark papers (and then it recommends new ones for you!)

Rank trending papers by recent traffic

Some papers at the top of the "all-time" list may have thousands of downloads, all of them years ago. If we assign a decreasing weight to traffic as it gets farther away from the present, we could have a list that incorporates papers with a lot of downloads but favors papers with RECENT downloads.

We may be able to prevent this entire problem from happening if we only do the "re-crawl old papers" step AFTER we've pulled the latest papers, so all the URLs will already be updated.

Add endpoint: Basic text search by title

https://www.postgresql.org/docs/9.1/static/textsearch-controls.html

Add custom user agent

http://www.user-agents.org/