Code Monkey home page Code Monkey logo

rxivist's Introduction

Note, March 2023: The Rxivist project has been discontinued. This code will remain available indefinitely, but we can't make any assurances about its functionality going forward as dependencies and external data sources change.

.

.

.

.

rxivist API

The Rxivist project is spread out over three code repositories:

  • This one contains code for the API, which provides programmatic access to the data in the Rxivist database.
  • The Rxivist web application, which consumes data from the API, is stored in the rxivist_web project.
  • The web crawler that indexes bioRxiv and medRxiv preprints is stored in the biorxiv_spider project.

Deployment

The Rxivist API is designed to be run in a Docker container (though it doesn't have to be). Once Docker is installed on whatever server you plan to use, there are only a few commands to run:

docker swarm init
docker build . -t rxivist:latest
docker service create --name rxivist_service --replicas 3 --publish published=80,target=80 --env RX_DBUSER --env RX_DBPASSWORD --env RX_DBHOST rxivist:latest

NOTE: This assumes that the necessary environment variables are set on the host machine. If they aren't, you should set them in the docker service create command: For example, rather than including the flag --env RX_DBUSER, you would replace it with something like --env RX_DBUSER=root The required variables are:

  • RX_DBHOST: The location of your database server, as it should be passed to Postgres. For example, rxivistdev.12345.us-east-1.rds.amazonaws.com
  • RX_DBUSER: The username with which the database client should connect to the Rxivist database.
  • RX_DBPASSWORD: That user's password.

Running the commands above builds a new image based on the current code on the repository, and deploys three containers to which all requests are load balanced. If one becomes unhealthy, it's removed and replaced with a fresh container. If you want your server to listen on a different port than 80, you can change the value of the "published" option to whatever you'd like—however, changing the "target" option will break the default settings of the app. The API listens on port 80 inside the container, but you can map that port to whatever host port you wish.

Note: You'll want to modify the config.py file before you run docker build, not after. This file contains several settings regarding the API's server and basic behavior. For now, the configuration is copied into the container at build time. This may change one day and be much nicer.

Development

Using Docker

For local development, you don't need to rebuild a container image every time you want to test a change: Mounting the repository to the container will allow you to test changes as you make them.

git clone https://github.com/blekhmanlab/rxivist.git
cd rxivist
docker run -it --rm --name rxapi -p 80:80 -v "$(pwd)":/app --env RX_DBUSER --env RX_DBPASSWORD --env RX_DBHOST python:3 bash

# You will now be in a shell within the container:
cd /app
pip install -r requirements.txt
python main.py

Note: To run the container in the background, replace the -it flags in the docker command above with -d.

Because the repository is bind-mounted to the container, editing the files locally using your editor of choice will result in the files also changing within the container. If you change the use_prod_webserver value in config.py to False, the server will reload the applications whenever a code modification is detected. (Note that the application will exit if it encounters an uncaught exception, and you'll have to start the application again by hand.)

rxivist's People

Contributors

dependabot[bot] avatar rabdill avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rxivist's Issues

Evaluate "shape" of traffic for each paper

When does a paper get most of its traffic? Is it right away, and then it tapers off? How quickly? (Could that be used for some kind of "enduring popularity" metric?) Do some papers start slower and pick up steam over time?

Combine traffic stats from all revisions of paper

If a new version of a paper is released, we should pull in the download stats from all the old versions too.

NOTE: This may already happen in bioRxiv automatically, not sure.

OTHER NOTE: If bioRxiv doesn't combine traffic numbers between versions, should we keep crawling the old versions to get their updated traffic numbers?

Papers get incomplete stats for month they were indexed

When a paper's traffic is recorded, it grabs all the months, including the current (unfinished) month. When we go back to add new months to that list, there's a section in spider.py to make sure we don't re-record any papers:

# make a list that excludes the records we already know about
      to_record = []
      for i, record in enumerate(stats):
        print(record)
        month = record[0]
        year = record[1]
        if year in done.keys() and month in done[year]:
          print("Found, not recording")
        else:
          to_record.append(record)

HOWEVER, this loop should also throw out the stats for the most recent month in the most recent year, because that would have been recorded during whatever that month was, and we have a chance to get a more up-to-date number now that we're revisiting it.

Rank trending papers by recent traffic

Some papers at the top of the "all-time" list may have thousands of downloads, all of them years ago. If we assign a decreasing weight to traffic as it gets farther away from the present, we could have a list that incorporates papers with a lot of downloads but favors papers with RECENT downloads.

Combine all sorting metrics into one big table

We don't need separate tables for each metric we're ranking on and what the rank is—we could have a table where the primary key is the article ID, and then each field is a metric we need to sort on. This way we could have queries be more precise (for example, only give the 'bounce rate leaders' for papers with more than 50 downloads or something)

When re-crawling papers, make sure there isn't a more recent version available

If the link we have for a paper isn't the right one (i.e. there's a newer one available), figure out the new one and update the old entry.

NOTE: Be careful with this one, if we record a new paper in this step, the spider will STOP when it runs into that paper in the "new" results because it thinks we've already seen everything after this one. That won't be the case.

We may be able to prevent this entire problem from happening if we only do the "re-crawl old papers" step AFTER we've pulled the latest papers, so all the URLs will already be updated.

Calculate most popular papers

We need a way to rank all papers by traffic, and it would be chaos to run that calculation every time someone asks.

Rank papers by bounce rate

Which ones have the closest ratio of "abstract viewed" to "PDF" downloaded? If we set a minimum number of views, then we can have a list for "Most appealing abstracts"

Add pauses to crawling

Add a flag to turn on a more polite mode of crawling that pauses between major operations to reduce load on bioRxiv

Stream redirection isn't working right in API supervisor script

Some of the stdout stuff that SHOULD be going to /var/log/rxivist.log is just being written to the console. Hiding things in /var/log/messages is silly, I should figure out how to get this to behave right.

To replicate, go to server:

cd /etc
./rc.local

then refresh the web page

Incorporate bioRxiv categories into results

From about page:

Articles in bioRxiv are categorized as New Results, Confirmatory Results, or Contradictory Results. New Results describe an advance in a field. Confirmatory Results largely replicate and confirm previously published work, whereas Contradictory Results largely replicate experimental approaches used in previously published work but the results contradict and/or do not
support it.

It would be cool to have separate charts for these available: "Most popular contradictory results" sounds fun

Record clicks from Rxivist over to bioRxiv

We may end up with a weird problem when this thing gets released—if Rxivist is popular enough, the patterns we're observing will end up being influenced by our site. My guess is that popular papers will end up getting more popular, but it will be hard to measure the impact.

If we can at least track how many visitors click on the "view paper" button for a particular article, we can try to account for our impact.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.