Code Monkey home page Code Monkey logo

api's People

Contributors

creekorful avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

api's Issues

Add endpoint to save website content

Currently the Persister process interact directly with the database, I'd like to change that by adding an endpoint in the API process to save content. By doing this refactoring the API will be the only process to interact with the database which would be great if I want to change the underlying database technology.

Things to do:

  • add a POST endpoint that take a website (URL, title, content) and use it to persist content to database (move appropriate code from Persister to API)
  • refactor Persister to use the newly created endpoint

Things to think:

  • Encoding content to base64 would be great.

Update endpoint to retrieve resources

GET /resources

with the following query parameters:

  • url: full url to the resource (excluding protocol)
  • date: the crawling date

the API will retrieve all content with the matching resource-url and the will pick up the closest one in time (absolute distance will be used).

If no resource is available or if the existing resource is too far in time (configurable threshold)
the API will return nothing

if the search query parameters exists a text search will be performed using the Elasticsearch instance instead.

Migrate database technology

Once the persister is refectored to use the API it would be great to change the database technology to something more adapted.

Cassandra DB looks a good choice.

  • Column based
  • High availability
  • Scalable
  • Designed to handle large amount of data

It would be cool to benchmark the results

Add dynamic linking when getting resource

After #4 is merge the following change should be applied:

when an application/html content type is being retrieved the API must extract and replace all absolute/relative URLs for resources with path to the API.

f.e: when requesting example.org at time 20191091551032 with content

<html>
<img src="/an-image.png">
<a href="https://google.de">
<html>

should be transformed to:

<html>
<img src="_<api_uri>_/content?date=20191091551032&url=example.org/an-image.png">
<a href="_<api_uri>_/content?date=20191091551032&url=https://google.de">
<html>

this will allow complete website reading from trandoshan, including image, stylesheets, etc...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.