Code Monkey home page Code Monkey logo

collect's Introduction

Collect

Collect is a server to collect & archive websites written for NodeJS.

It does not download entire sites, but rather single pages and all content needed to display them. This means that Collect stores a static copy of the website (and its assets) on your disk. It also hosts these pages so you can access them over the network.

Table of contents

Features

  • General
    • Archive web pages and videos
    • View all archived pages and videos
  • Web interface
    • Simply add sites and videos via their URL
    • Browse your archive by domain
    • Manage/Delete downloaded content
    • Any change on the server side will by sent to clients in real time
  • API
    • Get all sites / list sites by domain
    • Get details of saved content
    • Add a site to the archive
    • Delete a site
    • Edit title of a saved page
    • Download all saved pages as an archive (See Backup)
    • For more, see the API documentation

Screenshots

Main Page

Main Page Screenshot

New Page

New Page Screenshot

Details Page

Details Page Screenshot

Installation

Before installing Collect, please make sure that git, node and npm are installed.

Note: This install process is tested with Node version 12, 14 and 16. The test status can be read from the "Test" badge: Test. If this is green, then everything should work!

Start by cloning the repository to your computer/server:

git clone https://github.com/xarantolus/Collect.git

Switch to the Collect directory:

cd Collect/Collect

Install dependencies:

npm install

Start the server in production mode (recommended):

npm start production

or

node app production

Expected output:

Preparing integrity check...
Checking cookie file...
Checking if folders for ids exist...
All folders exist.
Checking if ids for folders exist...
All entrys exist.
Finished integrity check.
Collect-Server(1.17.0-production) listening on port 80

Now open the website in your browser by visiting http://localhost:80 if running on the same computer or http://yourserver:80, where yourserver is the network name of your server.

You will notice that you need to authenticate with a username and password. That can be set up as shown in the next section.

Settings

To change settings, edit Collect/config.json. There, you can set a port, username, password, id_length, api_token, allow_public_view and allow_public_all. Note that you need to restart the server to apply changes.

Settings documentation
Port

The port the server should listen on. If another program uses this port, the server will not be able to start.

Username

The username that should be used to log in.

Password

The password for this user. Please don't use a password you use somewhere else.

ID length

The length of the ids the server should generate. If you save a lot of websites from the same domain (> ~1 million / 16length) you should change this number.

API token

If you like to play around with the API, you can set an API token. It is implemented so integrating apps like Workflow is easy.

If you don't want to use the API, it is recommended to set the token to a long random string.

Allow Public View

Disable authentification for viewing sites and enable a /public/list url.

Allow Public All

Completly disable access control. Use at your own risk !

User Guide

After setting up the server, you can read the user guide to find out more about general usage, keyboard shortcuts and download options.

Optional Plugins

There is one plugin available for Collect. Open to get more info.

The server can use PhantomJS to process websites after downloading. This ensures that dynamically loaded content is also saved.

Note: This is no longer recommended as PhantomJS is not actively maintained. I'm not stopping you though.

To use this, install the node-website-scraper-phantom module.

npm install website-scraper-phantom

This command must be run in the directory that contains the package.json file.

After installing, the server should output PhantomJS will be used to process websites when started.

If the install fails, you cannot use the module and Collect will fall back to the normal way of saving pages.

If you cannot save any pages after installing, remove the module by running

npm uninstall website-scraper-phantom

Updating

If you already have Collect installed on your computer/server and want to update to the latest version, follow these steps.

Go in the directory where Collect is installed.

cd /path/to/Collect

You might want to back up your settings file.

Windows:

move Collect\config.json ..\

Linux/Unix:

mv Collect/config.json ../config.json

Download the latest version:

git fetch --all

Apply all changes (this usually overwrites your cookies file, but not the directory where your sites are saved.)

git reset --hard origin/master

Restore the settings file.

Windows:

move ..\config.json Collect\

Linux/Unix:

mv ../config.json Collect/config.json

Go to the directory that contains package.json.

cd Collect

Install all required packages.

npm install

After restarting your server, the new version should be up & running.

If it doesn't start, delete the node_modules directory and re-run npm install.

Contributing

See the contributing file.

Thanks to ❤️

Security considerations

  • The login system uses plain text. Anyone with access to your server (e.g. SSH or any malicious program) can read your credentials.
  • Any site you download can read & set cookies. A downloaded website could send your login cookie to another server. If you host this software in your private network without outside access, everything should be fine even if a cookie gets stolen, but don't take my word for it.
  • The connection does by default not use HTTPS.

Warning

You're using this tool at your own risk. I am not responsible for any lost data like passwords or websites.

Credits

Website Scraper Module: MIT License. This server is mostly a user interface to this module and would never have been possible without their work.

Website Scraper Module PhantomJS Plugin: MIT License. Makes processing dynamic pages as easy as pie.

The UIkit library: Copyright YOOtheme GmbH under the MIT License. I really love this UI framework.

ArchiverJS: Mit License. node-archiver is a nice module for generating all kind of archive files. It is used to create backups.

Ionicons: MIT License. The icons are really nice. I used the ion-ios-cloudy-outline icon.

Notification Sound: CC0 1.0 Universal License

License

See the License file

collect's People

Contributors

dependabot[bot] avatar teogoddet avatar xarantolus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

collect's Issues

Redirect after login

An user should be redirected to the page they wanted to go to after logging in

Modernize

This software is

  1. quite old (especially regarding TypeScript/JS coding conventions)
  2. not well suited for larger collections (no search, no tagging etc.)
  3. barely updated (dependencies are out of date, some of them no longer supported etc)
    • I think this server doesn't work with newer versions of the website-scraper node module, so it's still on an older version

While it still works (I've had it running for 4+ years), it feels like it's time for an overhaul. The following could be good steps:

  1. Creating a better frontend (the code in browser.js works, but is a mess)
  2. Creating a docker (compose) configuration that just works to simplify installation
  3. Maybe switch to a database instead of just storing everything in a JSON file (not sure I'll do this, as it complicates setup etc.)
  4. Updating the downloader code to use newer website-scraper versions

2 & 4 should be possible without major rewrites, the others are somewhat more involved.

Maybe it also makes sense to rewrite the server in Go because I like it better, but I'm not sure if there's a good website-scraper-like module for Go. I actually tried writing something like that that started a chromium browser and used SingleFile to download pages, but it didn't work that well.

I'm not sure when (or if) I'll work on this.

Embed sites in Iframe

When clicking on the title in the table view, redirect to a site with an iframe so that the header is still displayed

Add option for using PhantomJS

Add an option to use the website-scraper-phantom module to download pages.

This can might be accomplished by using the module if it is installed (users just have to install the module & the normal install doesn't fail if their platform is not supported by PhantomJS)

Archive Format

Have you considered using webarchive format and or pdf for archiving? .war is pretty standard now

Details page improvements

Improve the "details" page:

  • Link to the original url instead of displaying it in an input box
  • Link to the saved page instead of just displaying it

Socket.IO authentication

Right now, any client can connect to the server to receive events. This should be restricted to people who have an api token or the right cookie.

How to reproduce: register events and enter io.connect() in your browser console

Add an editor

Add an editor(link to it from the details page) for html pages so they can be edited from the webinterface

Content file not being created

I am not sure what you what to know so:

I get this:

Error 500
ENOENT: no such file or directory, open 'public/s/content.json'

Have this installed:

npm -v
5.8.0
nodejs -v
v8.11.4

and run this to start

sudo npm start production

Fix mobile layout

The layout of the details page looks horrible in mobile browsers.

Support Video Downloads

As this server is for archiving all types of websites, it should also support downloading only videos. To specify whether to download only the video or the website with the video, the url could be given as video:https://youtube.com/watch?v=..... or with another option on the new page.

This features would use youtube-dl because it supports a wide range of sites.

Allow public access

I would be great to have an option (either global or by site) to allow unauthentified acces to website (and also potentially list of website)

Main problem is that this would need to distribute the call to the auth middleware to the url handlers or another equivalent solution

Set proper content-types

When visiting a downloaded website that has HTML content but a .php or .md extension, the server assumes the wrong content type. Browsers usually display the file wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.