Code Monkey home page Code Monkey logo

matrix-archive's Introduction

Matrix Archive Tools

Import messages from a matrix.org room, for research, archival, and preservation.

Developed at Dinacon 2018, for use by the documentation team.

Use this responsibly and ethically. Don't re-publish people's messages without their knowledge and consent.

Setup

Install Pipenv. Run pipenv install.

Set these environment variables: MATRIX_USER, MATRIX_PASSWORD, MATRIX_ROOM_IDS (and eventually MATRIX_HOST in case your homeserver differs from https://matrix.org).

MATRIX_ROOM_IDS should be a comma-separated list of Matrix room IDs (or a single id). Run pipenv run list to list the room ids.

Set MONGODB_URI to a MongoDB connection URL, or install a local MongoDB instance.

Usage

Import Messages

pipenv run import imports the messages into the database.

Export Messages

pipenv run export filename.html exports a text, HTML, JSON, or YAML file, depending on the name of filename.html. The file contains links to the image download URLs on the Matrix server.

Download Images

pipenv run download_images.py downloads all the thumbnail images in the database into a download directory (default thumbnails), skipping images that have already been downloaded.

Use the --no-thumbnails option to download full size images instead of thumbnails. In this case, the default directory is images instead of thumbnails.

References

Matrix Client-Server API

License

MIT

matrix-archive's People

Contributors

bodqhrohro avatar dependabot[bot] avatar jtrees avatar osteele avatar xloem avatar yayayayaka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

matrix-archive's Issues

Sort operation used more than the maximum 33554432 bytes of RAM

I encountered this when exporting a large chat (137988 messages):

Traceback (most recent call last):
  File "/media/d/temp/git/matrix-archive/export_messages.py", line 80, in <module>
    export_archive()
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/media/d/temp/git/matrix-archive/export_messages.py", line 66, in export_archive
    print(f"Writing {len(messages)} messages to {filename!r}")
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/mongoengine/queryset/queryset.py", line 62, in __len__
    list(self._iter_results())
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/mongoengine/queryset/queryset.py", line 110, in _iter_results
    self._populate_cache()
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/mongoengine/queryset/queryset.py", line 129, in _populate_cache
    self._result_cache.append(next(self))
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/mongoengine/queryset/base.py", line 1619, in __next__
    raw_doc = next(self._cursor)
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/cursor.py", line 1207, in next
    if len(self.__data) or self._refresh():
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/cursor.py", line 1124, in _refresh
    self.__send_message(q)
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/cursor.py", line 999, in __send_message
    response = client._run_operation_with_response(
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1368, in _run_operation_with_response
    return self._retryable_read(
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1471, in _retryable_read
    return func(session, server, sock_info, slave_ok)
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1360, in _cmd
    return server.run_operation_with_response(
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/server.py", line 135, in run_operation_with_response
    _check_command_response(first, sock_info.max_wire_version)
  File "/home/bodqhrohro/.local/lib/python3.8/site-packages/pymongo/helpers.py", line 160, in _check_command_response
    raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.', 'code': 96, 'codeName': 'OperationFailed'}

This can be worked around, though adding an actual index should be better.

pipenv run list Traceback

I have setup matrix-archive like in https://github.com/osteele/matrix-archive, set env variables (MATRIX_*) and run pipenv run list, but get this Traceback:

Signing into https://my_matrix_server...
Traceback (most recent call last):
  File "list_rooms.py", line 23, in <module>
    list_rooms()
  File "/root/.local/share/virtualenvs/matrix-archive-8WJWP9GR/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/root/.local/share/virtualenvs/matrix-archive-8WJWP9GR/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/root/.local/share/virtualenvs/matrix-archive-8WJWP9GR/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/.local/share/virtualenvs/matrix-archive-8WJWP9GR/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "list_rooms.py", line 13, in list_rooms
    rooms = matrix_client().get_rooms()
  File "/opt/matrix-archive/matrix_connection.py", line 21, in matrix_client
    password=MATRIX_PASSWORD)
  File "/root/.local/share/virtualenvs/matrix-archive-8WJWP9GR/lib/python3.6/site-packages/matrix_client/client.py", line 249, in login_with_password
    return self.login(username, password, limit, sync=True)
  File "/root/.local/share/virtualenvs/matrix-archive-8WJWP9GR/lib/python3.6/site-packages/matrix_client/client.py", line 280, in login
    self._sync()
  File "/root/.local/share/virtualenvs/matrix-archive-8WJWP9GR/lib/python3.6/site-packages/matrix_client/client.py", line 562, in _sync
    for room_id, invite_room in response['rooms']['invite'].items():
KeyError: 'invite'

I use Python 3.6.5 from a Docker container:

docker run -it python:3.6.5 bash

pip install --upgrade pip
pip install pipenv
cd /opt/
git clone https://github.com/osteele/matrix-archive.git
cd matrix-archive/
pipenv install
export MATRIX_HOST=https://my_matrix_server
export MATRIX_USER=my_username
export MATRIX_PASSWORD=my_password
pipenv run list

I use 2fa3a22129b8 matrixdotorg/synapse:latest. Thanks.

Support encrypted rooms

Thanks for sharing this script!

It would be cool if it supported end-to-end-encrypted rooms too. Do you think you could add support?

Download image script aborts as soon as an image cannot be downloaded

When an image cannot be downloaded (for example when the hosting server has gone extinct) ,the script aborts with an AssertionError without downloading the rest of the images.

Stack Trace:

$ python download_images.py --no-thumbnails
Skipping 872 already-downloaded images
Downloading 73 new images...
Traceback (most recent call last):
  File "download_images.py", line 63, in <module>
    download_images()
  File "/home/naka/.local/share/virtualenvs/matrix-archive-j2607j-8/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/naka/.local/share/virtualenvs/matrix-archive-j2607j-8/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/naka/.local/share/virtualenvs/matrix-archive-j2607j-8/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/naka/.local/share/virtualenvs/matrix-archive-j2607j-8/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "download_images.py", line 59, in download_images
    run_downloads(new_messages, download_dir, prefer_thumbnails=thumbnails)
  File "download_images.py", line 23, in run_downloads
    assert res.status_code == 200
AssertionError

Rather than asserting a 200 response, I'd propose catching the requests.exceptions.RequestException and continue downloading the next images.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.