Code Monkey home page Code Monkey logo

marius's People

Contributors

anzexie avatar basavaraj29 avatar cthoyt avatar jasonmoho avatar rogerwaleffe avatar ryansun117 avatar sarda-devesh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marius's Issues

Add python version compatibility tests

Is your feature request related to a problem? Please describe.
We previously have observed occasional issues when using Python 3.8 and 3.9, but that behavior wasn't documented.

A user recently reported that they had trouble using python 3.8 and 3.9 with the system. #55 (comment)

Describe the solution you'd like
The tox test suites should be modified to run under python 3.6-3.9.
We should document and fix any compatibility issues with specific python versions.

oom-kill for preprocessing ogb_mag240m

Hi,

I am trying to preprocess ogb_mag240m with marius_preprocess --dataset ogb_mag240m --output_dir datasets/ogb_mag240m/ while it was killed due to oom.

The dataset.yaml was half-way generated:

dataset_dir: /marius/datasets/ogb_mag240m/
num_edges: 1297748926
num_nodes: 121751666
num_relations: 1
num_train: 1297748926
num_valid: -1
num_test: -1
node_feature_dim: -1
rel_feature_dim: -1
num_classes: -1
initialized: false

The cpu mem is as high as I am able to get (312GB). I am wondering if there is any way around if I want to run ogb_mag240m on this machine. Thank you.

Config generation for preprocessor only outputs to a single directory

Describe the bug
The path "./output_dir/" is hardcoded as the location where configuration files are generated.

If the directory doesn't exist then the preprocessing will throw an error.

To Reproduce
Any call to tools/preprocessing.py will hit this issue.

Expected behavior
The configuration files should be output to the proper directory

Environment
Affects all environments

marius_preprocess triggers program aborted

Describe the bug
run marius_preprocess or import preprocess would trigger the following error.

free(): invalid pointer
Aborted

To Reproduce
Steps to reproduce the behavior:

  1. Run the given example 'marius_preprocess output_dir/ --dataset fb15k'
    OR
  2. 'from marius.tools import preprocess' in Python

Environment
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
Python 3.8.5

Add style and logical linting tools for python sources

Is your feature request related to a problem? Please describe.
We currently provide no enforcement of python style in our python sources and our testing of these sources is incomplete.

This means some easy to catch bugs are missed during testing: #86

Describe the solution you'd like
Add a set of linting tools which:

  • Enforce a style guide
  • Check for logical issues (e.g. import errors)
  • Are run automatically as part of the test suite

Flake8 seems like a solid tool which can achieve all of the above: https://flake8.pycqa.org/en/latest/index.html

An example of its usage can be found in many open source libraries. Here's two:
PyKeen: https://github.com/pykeen/pykeen/blob/d9a93b07f85c839169530a3f0a4c8845c306602a/.flake8
Dask: https://github.com/dask/dask/blob/9bdc32a896e35f69770ec291d83655a1fd1a0346/setup.cfg

Describe alternatives you've considered
N/A

Additional context
N/A

Switch to `/src` layout for Python code

Right now, the source code for both python and c++ is scattered among several folders. It would be most idiomatic to have all python code come under /src/marius and perhaps have /src/cpp for the other code. Not sure if this will be a problem with the extensions though

The test "test_remap_ids_true" fails unnecssariliy

Describe the bug
The test test/python/preprocessing/test_csv_preprocessor.py::TestGeneralParser::test_remap_ids_true occasionally fails with an assertion error when it should pass.

2021-06-05T17:17:06.1877640Z =================================== FAILURES ===================================
2021-06-05T17:17:06.1878720Z ____________________ TestGeneralParser.test_remap_ids_true _____________________
2021-06-05T17:17:06.1879520Z 
2021-06-05T17:17:06.1880940Z self = <test.python.preprocessing.test_csv_preprocessor.TestGeneralParser testMethod=test_remap_ids_true>
2021-06-05T17:17:06.1882130Z 
2021-06-05T17:17:06.1882890Z     def test_remap_ids_true(self):
2021-06-05T17:17:06.1883710Z         """
2021-06-05T17:17:06.1885570Z         Check if processed data has non-sequential ids if remap_ids is set
2021-06-05T17:17:06.1886600Z             to True
2021-06-05T17:17:06.1887270Z         """
2021-06-05T17:17:06.1888190Z         general_parser([str(Path(input_dir) / Path(train_file)),
2021-06-05T17:17:06.1889190Z                         str(Path(input_dir) / Path(valid_file)),
2021-06-05T17:17:06.1890150Z                         str(Path(input_dir) / Path(test_file))],
2021-06-05T17:17:06.1891080Z                        ["srd"], [output_dir], remap_ids=True)
2021-06-05T17:17:06.1891820Z     
2021-06-05T17:17:06.1892810Z         internal_node_ids = np.fromfile(str(Path(output_dir)) /
2021-06-05T17:17:06.1893870Z                                         Path("node_mapping.bin"), dtype=int)
2021-06-05T17:17:06.1895720Z         internal_rel_ids = np.fromfile(str(Path(output_dir)) /
2021-06-05T17:17:06.1896890Z                                        Path("rel_mapping.bin"), dtype=int)
2021-06-05T17:17:06.1897690Z     
2021-06-05T17:17:06.1898440Z         delta_list = []
2021-06-05T17:17:06.1899820Z         for i in range(len(internal_node_ids) - 1):
2021-06-05T17:17:06.1901380Z             delta_list.append(internal_node_ids[i+1] - internal_node_ids[i])
2021-06-05T17:17:06.1902890Z         delta_list_1 = [i - 1 for i in delta_list]
2021-06-05T17:17:06.1903910Z         delta_list_2 = [i + 1 for i in delta_list]
2021-06-05T17:17:06.1904900Z         self.assertNotEqual(sum(delta_list_1), 0)
2021-06-05T17:17:06.1905920Z         self.assertNotEqual(sum(delta_list_2), 0)
2021-06-05T17:17:06.1906960Z         self.assertNotEqual(sum(delta_list), 0)
2021-06-05T17:17:06.1907760Z     
2021-06-05T17:17:06.1908510Z         delta_list = []
2021-06-05T17:17:06.1909830Z         for i in range(len(internal_rel_ids) - 1):
2021-06-05T17:17:06.1911480Z             delta_list.append(internal_rel_ids[i+1] - internal_rel_ids[i])
2021-06-05T17:17:06.1913010Z         delta_list_1 = [i - 1 for i in delta_list]
2021-06-05T17:17:06.1913950Z         delta_list_2 = [i + 1 for i in delta_list]
2021-06-05T17:17:06.1914920Z >       self.assertNotEqual(sum(delta_list_1), 0)
2021-06-05T17:17:06.1915860Z E       AssertionError: 0 == 0
2021-06-05T17:17:06.1931790Z 
2021-06-05T17:17:06.1932910Z test/python/preprocessing/test_csv_preprocessor.py:326: AssertionError

To Reproduce
Running the python test suite can cause this error to occur: https://github.com/marius-team/marius/pull/43/checks?check_run_id=2753751019

Expected behavior
The test should make sure the the ids have been remapped without failure.

Environment
All environments.

Additional context
The issue seems to be stemming from delta_list_1 and delta_list_2. My guess is that if the IDs have been remapped to a specific set of values, that will cause this method of checking to fail.

Is there a different/easier way to check that the ids have been remapped? I think the old id values and the new id values can be just compared directly and asserted to be different to pass the test.

Populate Documentation

What is the documentation lacking? Please describe.
The documentation is only populated for describing the configuration files. The rest of the documentation needs to be filled out.

Describe the improvement you'd like

Add documentation for:

  • Training embeddings graphs of varying scale
  • Evaluating embeddings
  • Configuration file usage
  • Python API description and usage
  • Custom Model definition
  • Preprocessing and input file formats
  • Postprocessing, output file formats and downstream inference
  • How Marius works
  • Edge bucket orderings and the partition buffer
  • Pipelining and asynchronous training
  • Structure of the codebase
  • Development information and workflows

Additional context
The documentation should also be built and hosted automatically on the marius-project.org website. This can be put in a separate pull request.

README example not working

Describe the bug

Traceback (most recent call last):
  File "/Users/cthoyt/dev/marius/test.py", line 20, in <module>
    fb15k_example()
  File "/Users/cthoyt/dev/marius/test.py", line 8, in fb15k_example
    train_set, eval_set = m.initializeDatasets(config)
RuntimeError: filesystem error: in copy_file: No such file or directory [training_data/marius/edges/train/edges.bin] [output_dir/train_edges.pt]

To Reproduce

I took the example from the README verbatim besides fixing the config path

import marius as m

def fb15k_example():
    config_path = "/Users/cthoyt/dev/marius/examples/training/configs/kinships_cpu.ini"
    config = m.parseConfig(config_path)

    train_set, eval_set = m.initializeDatasets(config)

    model = m.initializeModel(config.model.encoder_model, config.model.decoder_model)

    trainer = m.SynchronousTrainer(train_set, model)
    evaluator = m.SynchronousEvaluator(eval_set, model)

    trainer.train(1)
    evaluator.evaluate(True)


if __name__ == "__main__":
    fb15k_example()

Expected behavior
A clear and concise description of what you expected to happen.

Environment
Mac os 11.2.3 big sur, python 3.9.2, pip installed from latest code on marius

Add documentation for preprocessing

What is the documentation lacking? Please describe.
The documentation about how to use marius_preprocess to download and convert supported 21 datasets to Marius trainable versions and how to convert custom datasets to Marius trainable versions.

Describe the improvement you'd like
The documentation about how to use marius_preprocess on supported datasets and custom datasets.

Graph embeddings and graph classification

Hi,
I've been reading the documentation for node classification and edge prediction tasks. I have a set of custom graphs I'd like to use for graph classification or graph-level embeddings for additional downstream tasks. Is this possible with the current version of Marius?
Thank you.

Address strip error

Describe the bug
Line 179 of csv_converter.py strips the output_dir address. In the case the address starts with /, the leading / is also striped.

To Reproduce
Steps to reproduce the behavior:

  1. When using preprocess.py, enter the output_directory with an address starts with /
  2. See error

Expected behavior
The output_directory option should take any valid input addresses.

Environment
Any environment would have this problem.

Improve configuration file generation

Is your feature request related to a problem? Please describe.
Currently configuration generation is performed for every call to preprocess.py and three configuration files are generated. One for CPU training, one for GPU training, and one for multi-GPU training.

Describe the solution you'd like
We should put this generation into a separate optional step where it can be included in the preprocessing by adding a flag to the preprocessor call.

E.g.

python3 preprocess.py fb15k output_dir/                             // No config generated
python3 preprocess.py fb15k output_dir/ --generate_config           // generates a single-GPU training configuration file by default
python3 preprocess.py fb15k output_dir/ --generate_config=GPU       // generates a single-GPU training configuration file
python3 preprocess.py fb15k output_dir/ --generate_config=CPU       // generates a CPU training configuration file
python3 preprocess.py fb15k output_dir/ --generate_config=Multi-GPU // generates a multi-GPU training configuration default

Adding config arguments should be supported too.

python3 preprocess.py fb15k output_dir/ --generate_config=GPU  <args>
python3 preprocess.py fb15k output_dir/ --generate_config=GPU  --model.embedding_size=400     // generates a single-GPU training configuration file for 400 dimensional embeddings

We should also allow for this configuration generation to be called separately. E.g.

python3 generate_config.py <files> <args> // args should include --embedding_dimension --num_partitions and --config_type (GPU, CPU, multi-GPU)

Describe alternatives you've considered
We could also disable this feature, but providing configuration file generation makes it easier on the users to get up and running on built-in and custom datasets.

Additional context
Eventually we will want to support generating a config based on user system characteristics. E.g. The users system has 64 GB of memory and wants to train 400 dimension embeddings on Freebase86m. We can set the number of partitions and buffer capacity to well utilize the available memory.

Change argument variable output_directory to data_directory

What is the documentation lacking? Please describe.
The variable output_directory in the input arguments is overloaded. The directory is used to include both the input and output data for a data set. Imprecise naming leads to wrong use of the system.

Describe the improvement you'd like
Rename the variable to data_directory instead of output_directory

Initialize testing framework for python code

Is your feature request related to a problem? Please describe.
Currently we have a testing framework Initialized for the c++ code with GTest (although the tests aren't written yet). For the python part of the codebase, we need to initialize a testing framework for python.

Describe the solution you'd like
Tox and Pytest seem to be good candidates for handling python tests.

Describe alternatives you've considered
Hypothesis is an interesting testing framework used by PyTorch: https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md

Additional context
Further work needs to be done to populate the tests for the cpp code and the python code

Cannot import marius

Hello,
I installed marius on my local server, using the command 'pip3 install .'.
I successfully installed it, but I cannot import marius.

If I try to import marius or run 'marius_preprocess', then I get the error below:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/.conda/envs/marius/lib/python3.7/site-packages/marius/__init__.py", line 13, in <module>
    from . import _config as config
ImportError: /home/.conda/envs/marius/lib/python3.7/site-packages/marius/libmarius.so: undefined symbol: _ZNSt12experimental10filesystem2v16statusERKNS1_4pathE

Could I fix it and run marius??

The following is my setting:

cmake:
  version: 3.13.2
cpu_info:
  num_cpus: 64
  total_memory: 754GB
cuda:
  version: '10.2'
gpu_info:
  - memory: 24GB
    name: NVIDIA TITAN RTX
marius:
  bindings_installed: false
  install_path: N/A
  version: N/A
openmp:
  version: '201511'
operating_system:
  platform: Linux-4.15.0-162-generic-x86_64-with-debian-buster-sid
pybind:
  PYBIND11_BUILD_ABI: _cxxabi1011
  PYBIND11_COMPILER_TYPE: _gcc
  PYBIND11_STDLIB: _libstdcpp
python:
  compiler: GCC 7.5.0
  deps:
    breathe_version: 4.33.1
    numpy_version: 1.21.6
    omegaconf_version: 2.2.1
    pandas_version: 1.3.5
    pip_version: 21.2.2
    pyspark_version: 3.2.1
    pytest_version: 7.1.2
    sphinx_rtd_theme_version: 1.0.0
    torch_version: 1.8.1+cu102
    tox_version: 3.25.0
  version: 3.7.13
pytorch:
  install_path: /home/.conda/envs/marius/lib/python3.7/site-packages/torch
  version: 1.8.1+cu102

Thanks

Marius++ code/example request

What is the documentation lacking? Please describe.
A code example accompanying the Marius++ paper

Describe the improvement you'd like
A code example accompanying the Marius++ paper

Additional context
Thank you for releasing this amazing repo! Have you released the code/examples to accompany the Marius++ paper - it'd be great to be run Marius++ code to better understand the system. Thank you

Remove dependency on Boost

Is your feature request related to a problem? Please describe.
We currently use Boost for command line argument parsing and parsing .ini configuration files.

Boost is very heavyweight and complicates the build process. Additionally, the download links to the boost library may fail: See #16.

Describe the solution you'd like
We should remove the dependency on Boost by switching to a lightweight library which can parse .ini files and command line arguments with the same semantics.

The implementation with the new library should match functionality with the current implementation in Boost.

Modifications will be largely contained to src/config.cpp.

One minor dependency on Boost's lockfree queues can be removed in src/buffer.cpp and replaced with traditional lock + queue data structure.

Describe alternatives you've considered
We can implement our own parsing functionality if no libraries fit our requirements.

Additional context
We might not be able to find a library which does both config parsing and the command line parsing. If we cannot, we should pick one which can do the config parsing and then implement our own command line parser.

Tests and validators for csv_converter.py

Is your feature request related to a problem? Please describe.
The converter for delimited files does not have a set of tests associated with it.

Describe the solution you'd like
We should add tests for each function in csv_converter.py which cover reasonable inputs and possible failure modes.

For example, for the general_parser function https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L118 we should test:

  • Different numbers of files
  • Invalid input files
  • Delimiters
  • Dataset splits
  • Number of partitions
  • Other input arguments (format, dtype, remap_ids, start_col, and num_line_skip)

Part of this testing effort should be to add validators to the input arguments to the general_parsers ensure no unreasonable values are passed into it: e.g a dataset split of (.8, .8), or a format ("sxrd"), etc.

Describe alternatives you've considered
The alternative is to leave it untested. No thanks.

Additional context
While testing we should note the ways we can improve and simplify the design of the preprocessing code and create a list of changes we will want to make in a future pull request. For example, https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L213, https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L238, and https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L252 should be put into a function and called.

Remove output_directory as a required argument

Is your feature request related to a problem? Please describe.
Currently, we require the user to specify the output_directory.

We want to remove this requirement and set the output directory to match the name of the dataset by default. For custom datasets, we should choose a reasonable default name such as “custom_dataset/”

Users should still be able to specify the output directory if they wish.

Describe the solution you'd like
We remove the output_directory as a required argument. For built-in dataset, we set the default base directory name "_dataset", for custom datasets, we set the default base directory name "custom_dataset".

Users are given the option to specify the output directory name if they want.

Additional context
This issue corresponds to MAR-51 (https://marius-project.atlassian.net/browse/MAR-51).

RuntimeError: Expected all tensors to be on the same device

Hi,

I was trying to run marius on ogbn_products dataset on on gcp vm with following:
CPU: 16 x Intel Haswell
Memory: 60 GB
Storage: 1 x SSD 100 GB
GPU: 1 x NVIDIA Tesla P100
OS: Ubuntu 20.04

I installed and ran marius using docker following this.

OOM is triggered if all edges, features and embeddings are placed in DEVICE_MEMORY. Then, I tried to use mixed CPU-GPU following this sample file (where I think there may be some inconsistencies between the sample code and the explanation below for mixed CPU-GPU section). However, this triggers a RuntimeError where the error message is

root@7f36de16f004:/marius# marius_train examples/configuration/ogbn_products.yaml
[2022-07-05 15:01:54.234] [info] [marius.cpp:43] Start initialization
[07/05/22 15:02:03.393] Initialization Complete: 9.158s
[07/05/22 15:02:18.924] ################ Starting training epoch 1 ################
Traceback (most recent call last):
  File "/usr/local/bin/marius_train", line 11, in <module>
    load_entry_point('marius==0.0.2', 'console_scripts', 'marius_train')()
  File "/usr/local/lib/python3.6/dist-packages/marius/console_scripts/marius_train.py", line 18, in main
    m.manager.marius_train(config)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

The configuration file is as following:

model:
  learning_task: NODE_CLASSIFICATION
  encoder:
    train_neighbor_sampling:
      - type: ALL
      - type: ALL
      - type: ALL
    layers:
      - - type: FEATURE
          output_dim: 100
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 100
          output_dim: 100
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 100
          output_dim: 100
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 100
          output_dim: 47
          bias: true
  decoder:
    type: NODE
  loss:
    type: CROSS_ENTROPY
    options:
      reduction: SUM
  dense_optimizer:
    type: ADAM
    options:
      learning_rate: 0.01
storage:
  device_type: cuda
  dataset:
    dataset_dir: datasets/ogbn_products/
  edges:
    type: HOST_MEMORY
    options:
      dtype: int
  features:
    type: HOST_MEMORY
    options:
      dtype: float
  embeddings:
    type: HOST_MEMORY
    options:
      dtype: float
training:
  batch_size: 100
  num_epochs: 2
  pipeline:
    sync: true
evaluation:
  batch_size: 100
  pipeline:
    sync: true

I tried different combinations of setting the type of edges, features and embeddings, but all of them gave the same RuntimeError.
Screen Shot 2022-07-05 at 11 20 55

I am wondering how to solve this error to get marius run on ognb_products using gpu. Thank you.

Complete and test custom model support with the python bindings

Is your feature request related to a problem? Please describe.
The pythons bindings need additional implementation to support custom models.

Describe the solution you'd like
We should be able to support defining a custom model in the python API by doing the following:

import pymarius as m

class customRelationOperator(m.RelationOperator):
    def forward(node_embs, rel_embs):
        return node_embs + rel_embs

class customComparator(m.Comparator):
    def forward(src_embs, dst_embs):
        return src_embs * dst_embs

class CustomModel(m.Model):
    def __init__():
        self.decoder = m.LinkPredictionDecoder(customComparator(), customRelationOperator())

We may need to make modifications of the c++ to support these semantics.

Tests should be written for the custom models here: https://github.com/marius-team/marius/tree/main/test/python

We should test:

  • The relation operator and comparator forward methods compute the proper values
  • The model.forward() method matches the expected computation
  • The model is used when used in a training loop. Pytest has functionality to tell if a function has been called. So the tests can ensure that during a single training loop, the model.forward() function was called for every batch, same for the relation operator and the comparator.

Describe alternatives you've considered
Alternative designs for custom models might require large changes to the core c++ code.

Additional context
For the rest of the bindings, we will add their tests in a future pull request.

Could I run C++ code for Marius?

Hi,
I built execution files (marius_train and marius_eval), using CMakeLists.txt.
However, when I run this execution file to execute as in the example of github, error occurs.

$ ./marius_train examples/configuration/fb15k_237.yaml

Result:
Aborted (core dumped)

Is the execution files created through CMake not working at the moment? Or is the input that should be entered differently from when running the marius python??

Thanks

Add Support for Parquet File Storage Backend

Is your feature request related to a problem? Please describe.
Marius currently supports the following backends for storing parameters and training data:

  • GPU Memory
  • CPU Memory
  • Flat File (essentially a dump of tensor memory)
  • Partition Buffer (backed by a flat file)

Parquet files are commonly used for handling large amounts of data. Currently, if a user has a large amount of training data (edges) that is stored in a parquet file, they will have to convert the file into the flat file format. This conversion process is handled as a preprocessing step and will likely require the data to be copied.

Describe the solution you'd like
To avoid unnecessary copies of large amounts of data and expensive preprocessing. We should support a parquet file backend directly using https://github.com/apache/parquet-cpp. https://github.com/apache/arrow.

Describe alternatives you've considered
A preprocessor step can be written which converts the input Parquet file into the file format required by the FlatFile backend.

Additional context
This will add an additional dependency on spark to the system (This could be a heavy dependency). We should make this dependency optional as not all users will be operating with parquet files.

Parquet-cpp has merged with https://github.com/apache/arrow. So we can use that instead.

Make Marius pip installable

Is your feature request related to a problem? Please describe.
Users shouldn't have to build Marius with cmake to use it. We should provide pip install capabilities to simplify the usage/

Describe the solution you'd like
Building and install Marius from source:

git clone https://github.com/marius-team/marius.git
cd marius
python3 -m pip install .

Installing Marius from PyPi:

python3 -m pip install pymarius 
// or 
python3 -m pip install marius 

Describe alternatives you've considered
The alternative is to make people build it themselves from source with the instructions in the docs. Not great...

Additional context
Should we separate out installing the Python API vs. the config based executable?

Example Kinships configuration file has unreasonable hyper parameters.

Describe the bug
The kinships dataset only has 100 edges, yet uses a batch size of 1000. Other parameters are also far too large for this dataset.
https://github.com/marius-team/marius/blob/main/examples/training/configs/kinships_cpu.ini
https://github.com/marius-team/marius/blob/main/examples/training/configs/kinships_gpu.ini
To Reproduce
See #23

Expected behavior
We should be providing reasonable hyper parameters for each dataset.

Environment
All environments

Additional context
Other datasets might also have this issue. We should check each one to make sure the values are at least reasonable. For future work we should tune them to optimal values.

Example Scripts not Up-to-date

Describe the bug
The example training scripts in ./examples/training/scripts/ are not updated according to the latest documents.

To Reproduce
The scripts in ./examples/training/scripts/ would not work with the current version of marius_preprocess.

Expected behavior
The scripts in ./examples/training/scripts/ work with the current version of marius_preprocess.

Environment
This issue should be present in any environment.

Add contribution documentation

What is the documentation lacking? Please describe.
Developers need to know how to contribute to Marius.

Describe the improvement you'd like
Add a CONTRIBUTING.md file to the repo which describes development instructions and the workflow.

node_ids & rel_ids file path generated by config generator

Describe the bug
The path of node_ids.bin and rel_ids.bin files are still generated by config generator. However, these two files are no longer required by Marius and not generated during preprocessing.

To Reproduce
Steps to reproduce the behavior:

  1. Run config generator
  2. Check the configuration file generated
  3. The parameters path.node_ids and path.rel_ids are present. This is unwanted.

Expected behavior
There should not be path.node_ids and path.rel_ids in the configuratio file generated by config generator.

Environment
Any environment would have this bug.

Enhance description of the output files from pre-processing and description of input format that Marius requires.

What is the documentation lacking? Please describe.
Please add a clear description of the output of pre-processing; specifically, describe all files, their format, schema, and encoding requirements that are output by pre-processing.

Describe the improvement you'd like
Add this description in the comments of the general_parser function

Additional context
The above enhancement will enable writing custom (scalable) pre-processors that can emit Marius input files and won't require one starting from a raw CSV file.

Scope out additional decoder models

Our current functionality is limited. We only support DistMult, ComplEx, and TransE, with double-sided relation embeddings.

We should expand our functionality by adding more models to Marius. The first thing to do is to scope out which models are out there and which can be implemented easily in our current abstractions.

A starting point is to look into the models supported by PyKeen:
List:
https://github.com/pykeen/pykeen#models-26
Implementation:
https://github.com/pykeen/pykeen/blob/master/src/pykeen/nn/functional.py
Documentation:
https://pykeen.readthedocs.io/en/stable/api/pykeen.nn.functional.convkb_interaction.html

Once we get a better handle on which models are out there, we can see in what ways our current abstractions are lacking and how we can improve them.

For each decoder model we should ask the following questions:

  • Does the model need additional information beyond a source, relation and destination embedding for a given edge?
  • Does the model require computation that does not fit with our double-sided relation embedding approach? E.g score_lhs = comparator(relation_operator(h, r), t) and score_rhs = comparator(relation_operator(t, r'), h)
  • What modifications would we need to make to LinkPredictionDecoder to support this model?
  • Is there global state we need to maintain?

error for final make step

Describe the bug

I am trying to build marius on a google cloud vm. The build went smoothly until the final step at which point I got an error at the linking step:

[100%] Built target marius
Scanning dependencies of target marius_train
[100%] Building CXX object CMakeFiles/marius_train.dir/src/marius.cpp.o
[100%] Linking CXX executable marius_train
/usr/bin/ld: libmarius.so: undefined reference to std::filesystem::copy_file(std::filesystem::path const&, std::filesys tem::path const&, std::filesystem::copy_options)' /usr/bin/ld: libmarius.so: undefined reference to std::filesystem::path::_M_split_cmpts()'
/usr/bin/ld: libmarius.so: undefined reference to std::filesystem::status(std::filesystem::path const&)' /usr/bin/ld: libmarius.so: undefined reference to std::filesystem::rename(std::filesystem::path const&, std::filesystem
::path const&)'
collect2: error: ld returned 1 exit status
make[3]: *** [CMakeFiles/marius_train.dir/build.make:100: marius_train] Error 1
make[2]: *** [CMakeFiles/Makefile2:113: CMakeFiles/marius_train.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:125: CMakeFiles/marius_train.dir/rule] Error 2
make: *** [Makefile:188: marius_train] Error 2

To Reproduce
Steps to reproduce the behavior:

Follow the installation instruction from github README:

git clone https://github.com/marius-team/marius.git
cd marius
python3 -m pip install -r requirements.txt
mkdir build
cd build
cmake ../ -DUSE_CUDA=1
make marius_train -j

Expected behavior
The final step of the build should create marius executables, presumably in the build/bin directory.

Environment
This build was attempted on a google cloud vm with these parameters:

boot disk = tensorflow-2-4-20210414-140511-boot
Environment version = M65

I can provide more environment details after sshing into the instance, but not sure what is relevant for the above.

Structure of mapping files not up-to-date

Describe the bug
The content of the node_mapping.txt and rel_mapping.txt is not up-to-date. The mapping of original id and remapped id is still represented by *_mapping.txt and *.bin files. According to the documentation, only *_mapping.txt files are required. The mapping of original ids and remapped ids should be the first and second columns of *_mapping.txt files.

To Reproduce
Run marius_preprocess. This issue can be found by checking files included in the output directory and contents of *_mapping.txt files.

Expected behavior
According to the documentation, only *_mapping.txt files are required. The mapping of original ids and remapped ids should be the first and second columns of *_mapping.txt files.

Environment
All versions would have this problem.

Additional context
marius_preprocess and marius_postprocess should be updated according to the documentation.

ERROR during the pip installation process

I was trying to install pip and

ERROR: Could not find a version that satisfies the requirement torch (from marius==0.0.2) (from versions: none)
ERROR: No matching distribution found for torch (from marius==0.0.2)

keeps popping up.

Tried to pip install torchvision==0.1.8 in command line and it showed Successfully installed torch-1.11.0 torchvision-0.1.8. Then, when I tried to pip3 install . again, the same error appears. I am wondering how to solve this to proceed. Thank you.

Configuration file modification script

Is your feature request related to a problem? Please describe.
As mentioned in PR#45, we need a script which can perform conversion for example configuration files to new versions when changes to the configurations options are made.

Describe the solution you'd like
We need a script that can perform the following basic operations:

  • remove section
  • remove option
  • add section
  • add option

Describe alternatives you've considered
We can add additional functions in the future.

Add support for [source, destination] prediction in marius_predict

Is your feature request related to a problem? Please describe.
Current version of marius_predict don't support prediction for datasets with only 1 relation type (no relation column).

Describe the solution you'd like
Enable marius_predict to perform link prediction for datasets with only 1 relation type.

marius_predict & marius_postprocess having bad assumption on value of general.experiment_name

Describe the bug
Currently, marius_predict and marius_postprocess assume "marius" as the value of general.experiment_name. If other values are used, these two tools would not work.

To Reproduce
Steps to reproduce the behavior:

  1. Run an experiment with the configuration parameter general.experiment_name other than "marius"
  2. Run marius_predict or marius_postprocess
  3. See error

Expected behavior
Multiple experiments with different general.experiment_name can have separate directories under the base directory (default to data/) for trained data. marius_predict and marius_postprocess should be able to handle directories created by different experiments.

Environment
Any environment will have this issue.

Expand system debug logs and environment information

Is your feature request related to a problem? Please describe.
Debugging information from the system is currently limited, and when Marius is installed with pip, the full c++ stack trace is hidden from users and only the error message will be output, as in #55.

We currently have a debug log level, which needs to be better utilized to print out useful debug information at reasonable spots in the code.

Also, we should make it easier for users to report their environment, like PyTorch does with this script: https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py

Describe the solution you'd like
Detailed debug logs should be created for major checkpoints in the code: reading the config, creating storage objects, initializing the model, ...

A python script based on the above torch script should be added as separate python tool marius_environment.

CUDA error: device-side assert triggered when trying to execute example scripts

Describe the bug
I successfully installed the program and it passed test/cpp/end_to_end, then when I tried to execute examples/training/scripts/fb15k_gpu.sh (and also some other configs with GPU enabled), it triggered a nll_loss_backward_reduce_cuda_kernel_2d assertion failure.

To Reproduce
Steps to reproduce the behavior:

  1. I execute bash examples/training/scripts/fb15k_gpu.sh
  2. marius_preprocess step is able to be executed without any problems
  3. When marius_train proceeds to backward for the first batch of the first epoch, the following error occurs:
nfp@node19:~/marius$ bash examples/training/scripts/fb15k_gpu.sh 
fb15k
Downloading fb15k.tgz to output_dir/fb15k.tgz
Extracting
Extraction completed
Detected delimiter: ~   ~
Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
Number of instance per file:[483142, 50000, 59071]
Number of nodes: 14951
Number of edges: 592213
Number of relations: 1345
Delimiter: ~    ~
['/home/nfp/.local/bin/marius_train', 'examples/training/configs/fb15k_gpu.ini']
[info] [10/28/21 22:12:59.865] Start preprocessing
[debug] [10/28/21 22:12:59.866] Initializing Model
[debug] [10/28/21 22:12:59.866] Empty Encoder
[debug] [10/28/21 22:12:59.866] DistMult Decoder
[debug] [10/28/21 22:12:59.867] data/ directory already exists
[debug] [10/28/21 22:12:59.867] data/marius/ directory already exists
[debug] [10/28/21 22:12:59.867] data/marius/embeddings/ directory already exists
[debug] [10/28/21 22:12:59.867] data/marius/relations/ directory already exists
[debug] [10/28/21 22:12:59.867] data/marius/edges/ directory already exists
[debug] [10/28/21 22:12:59.867] data/marius/edges/train/ directory already exists
[debug] [10/28/21 22:12:59.867] data/marius/edges/evaluation/ directory already exists
[debug] [10/28/21 22:12:59.867] data/marius/edges/test/ directory already exists
[debug] [10/28/21 22:12:59.880] Edges: DeviceMemory storage initialized
[debug] [10/28/21 22:12:59.894] Edges shuffled
[debug] [10/28/21 22:12:59.894] Edge storage initialized. Train: 483142, Valid: 50000, Test: 59071
[debug] [10/28/21 22:13:00.004] Node embeddings: DeviceMemory storage initialized
[debug] [10/28/21 22:13:00.004] Node embeddings state: DeviceMemory storage initialized
[debug] [10/28/21 22:13:00.004] Node embeddings initialized: 14951
[debug] [10/28/21 22:13:00.014] Relation embeddings: DeviceMemory storage initialized
[debug] [10/28/21 22:13:00.014] Relation embeddings state: DeviceMemory storage initialized
[debug] [10/28/21 22:13:00.014] Relation embeddings initialized: 1345
[debug] [10/28/21 22:13:00.014] Getting batches from edge list
[info] [10/28/21 22:13:00.014] Training set initialized
[debug] [10/28/21 22:13:00.014] Getting batches from edge list
[debug] [10/28/21 22:13:00.014] Batches initialized
[info] [10/28/21 22:13:00.015] Evaluation set initialized
[info] [10/28/21 22:13:00.015] Preprocessing Complete: 0.149s
[debug] [10/28/21 22:13:00.032] Loaded training set
[info] [10/28/21 22:13:00.032] ################ Starting training epoch 1 ################
[trace] [10/28/21 22:13:00.032] Starting Batch. ID 0, Starting Index 0, Batch Size 10000 
[trace] [10/28/21 22:13:00.034] Batch: 0 Accumulated 11109 unique embeddings
[trace] [10/28/21 22:13:00.034] Batch: 0 Accumulated 640 unique relations
[trace] [10/28/21 22:13:00.034] Batch: 0 Indices sent to device
[trace] [10/28/21 22:13:00.034] Batch: 0 Node Embeddings read
[trace] [10/28/21 22:13:00.034] Batch: 0 Node State read
[trace] [10/28/21 22:13:00.034] Batch: 0 Relation Embeddings read
[trace] [10/28/21 22:13:00.034] Batch: 0 Relation State read
[trace] [10/28/21 22:13:00.035] Batch: 0 prepared for compute
[debug] [10/28/21 22:13:00.040] Loss: 124804.266, Regularization loss: 0.012812799
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "/home/nfp/.local/bin/marius_train", line 8, in <module>
    sys.exit(main())
  File "/home/nfp/.local/lib/python3.6/site-packages/marius/console_scripts/marius_train.py", line 8, in main
    m.marius_train(len(sys.argv), sys.argv)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from launch_unrolled_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:132 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f95645bcd62 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > const&) + 0xb37 (0x7f95665b2f27 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #2: void at::native::gpu_kernel<at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float> > const&) + 0x113 (0x7f95665bf333 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #3: void at::native::opmath_gpu_kernel_with_scalars<float, float, float, at::native::AddFunctor<float> >(at::TensorIteratorBase&, at::native::AddFunctor<float> const&) + 0xa9 (0x7f95665bf4c9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #4: <unknown function> + 0xe5d953 (0x7f9566592953 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #5: at::native::add_kernel_cuda(at::TensorIteratorBase&, c10::Scalar const&) + 0x15 (0x7f95665930a5 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #6: <unknown function> + 0xe5e0cf (0x7f95665930cf in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #7: at::native::structured_sub_out::impl(at::Tensor const&, at::Tensor const&, c10::Scalar const&, at::Tensor const&) + 0x40 (0x7f95a9f1ef00 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x25e52ab (0x7f9567d1a2ab in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #9: <unknown function> + 0x25e5372 (0x7f9567d1a372 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #10: at::_ops::sub_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0xb9 (0x7f95aa55d3f9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x34be046 (0x7f95ac03c046 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x34be655 (0x7f95ac03c655 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::sub_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x13f (0x7f95aa5b5b2f in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x3f299b0 (0x7f95acaa79b0 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::generated::LogsumexpBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1dc (0x7f95abd1447c in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x3896817 (0x7f95ac414817 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x145b (0x7f95ac40fa7b in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x57a (0x7f95ac4107aa in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f95ac4081c9 in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0xc71f (0x7f962b3ad71f in /home/nfp/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #21: <unknown function> + 0x76db (0x7f962d01f6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #22: clone + 0x3f (0x7f962d35871f in /lib/x86_64-linux-gnu/libc.so.6)

Expected behavior
The program works well for CPU configs:

nfp@node19:~/marius$ bash examples/training/scripts/fb15k_cpu.sh 
fb15k
Downloading fb15k.tgz to output_dir/fb15k.tgz
Extracting
Extraction completed
Detected delimiter: ~   ~
Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
Number of instance per file:[483142, 50000, 59071]
Number of nodes: 14951
Number of edges: 592213
Number of relations: 1345
Delimiter: ~    ~
['/home/nfp/.local/bin/marius_train', 'examples/training/configs/fb15k_cpu.ini']
[info] [10/28/21 22:19:07.259] Start preprocessing
[info] [10/28/21 22:19:08.397] Training set initialized
[info] [10/28/21 22:19:08.397] Evaluation set initialized
[info] [10/28/21 22:19:08.397] Preprocessing Complete: 1.137s
[info] [10/28/21 22:19:08.410] ################ Starting training epoch 1 ################
[info] [10/28/21 22:19:08.904] Total Edges Processed: 50000, Percent Complete: 0.099
[info] [10/28/21 22:19:09.252] Total Edges Processed: 95000, Percent Complete: 0.198
[info] [10/28/21 22:19:09.700] Total Edges Processed: 152000, Percent Complete: 0.298
[info] [10/28/21 22:19:09.998] Total Edges Processed: 190000, Percent Complete: 0.397
[info] [10/28/21 22:19:10.418] Total Edges Processed: 237000, Percent Complete: 0.496
[info] [10/28/21 22:19:10.809] Total Edges Processed: 286000, Percent Complete: 0.595
[info] [10/28/21 22:19:11.211] Total Edges Processed: 336000, Percent Complete: 0.694
[info] [10/28/21 22:19:11.567] Total Edges Processed: 383000, Percent Complete: 0.793
[info] [10/28/21 22:19:11.958] Total Edges Processed: 432000, Percent Complete: 0.893
[info] [10/28/21 22:19:12.320] Total Edges Processed: 478000, Percent Complete: 0.992
[info] [10/28/21 22:19:12.357] ################ Finished training epoch 1 ################
[info] [10/28/21 22:19:12.357] Epoch Runtime (Before shuffle/sync): 3946ms
[info] [10/28/21 22:19:12.357] Edges per Second (Before shuffle/sync): 122438.414
[info] [10/28/21 22:19:12.358] Pipeline flush complete
[info] [10/28/21 22:19:12.374] Edges Shuffled
[info] [10/28/21 22:19:12.374] Epoch Runtime (Including shuffle/sync): 3963ms
[info] [10/28/21 22:19:12.374] Edges per Second (Including shuffle/sync): 121913.195
[info] [10/28/21 22:19:12.389] Starting evaluating
[info] [10/28/21 22:19:12.709] Pipeline flush complete
[info] [10/28/21 22:19:15.909] Num Eval Edges: 50000
[info] [10/28/21 22:19:15.909] Num Eval Batches: 50
[info] [10/28/21 22:19:15.909] Auc: 0.941, Avg Ranks: 40.139, MRR: 0.336, Hits@1: 0.212, Hits@5: 0.476, Hits@10: 0.600, Hits@20: 0.707, Hits@50: 0.827, Hits@100: 0.895
[info] [10/28/21 22:19:15.920] Evaluation complete: 3531ms
[info] [10/28/21 22:19:15.931] ################ Starting training epoch 2 ################
[info] [10/28/21 22:19:16.361] Total Edges Processed: 46000, Percent Complete: 0.099
[info] [10/28/21 22:19:16.900] Total Edges Processed: 97000, Percent Complete: 0.198
[info] [10/28/21 22:19:17.424] Total Edges Processed: 156000, Percent Complete: 0.298
[info] [10/28/21 22:19:17.697] Total Edges Processed: 189000, Percent Complete: 0.397
[info] [10/28/21 22:19:18.078] Total Edges Processed: 238000, Percent Complete: 0.496
[info] [10/28/21 22:19:18.466] Total Edges Processed: 288000, Percent Complete: 0.595
[info] [10/28/21 22:19:18.825] Total Edges Processed: 336000, Percent Complete: 0.694
[info] [10/28/21 22:19:19.160] Total Edges Processed: 381000, Percent Complete: 0.793
[info] [10/28/21 22:19:19.584] Total Edges Processed: 436000, Percent Complete: 0.893
[info] [10/28/21 22:19:19.909] Total Edges Processed: 481000, Percent Complete: 0.992
[info] [10/28/21 22:19:19.928] ################ Finished training epoch 2 ################
[info] [10/28/21 22:19:19.928] Epoch Runtime (Before shuffle/sync): 3997ms
[info] [10/28/21 22:19:19.928] Edges per Second (Before shuffle/sync): 120876.16
[info] [10/28/21 22:19:19.929] Pipeline flush complete
[info] [10/28/21 22:19:19.947] Edges Shuffled
[info] [10/28/21 22:19:19.948] Epoch Runtime (Including shuffle/sync): 4016ms
[info] [10/28/21 22:19:19.948] Edges per Second (Including shuffle/sync): 120304.29
[info] [10/28/21 22:19:19.961] Starting evaluating
[info] [10/28/21 22:19:20.246] Pipeline flush complete
[info] [10/28/21 22:19:20.255] Num Eval Edges: 50000
[info] [10/28/21 22:19:20.255] Num Eval Batches: 50
[info] [10/28/21 22:19:20.255] Auc: 0.972, Avg Ranks: 21.458, MRR: 0.431, Hits@1: 0.294, Hits@5: 0.595, Hits@10: 0.719, Hits@20: 0.812, Hits@50: 0.906, Hits@100: 0.949
[info] [10/28/21 22:19:20.271] Evaluation complete: 309ms
[info] [10/28/21 22:19:20.282] ################ Starting training epoch 3 ################
[info] [10/28/21 22:19:20.694] Total Edges Processed: 47000, Percent Complete: 0.099
[info] [10/28/21 22:19:21.042] Total Edges Processed: 95000, Percent Complete: 0.198
[info] [10/28/21 22:19:21.425] Total Edges Processed: 143000, Percent Complete: 0.298
[info] [10/28/21 22:19:21.872] Total Edges Processed: 203000, Percent Complete: 0.397
^C[info] [10/28/21 22:19:22.195] Total Edges Processed: 244000, Percent Complete: 0.496
[info] [10/28/21 22:19:22.561] Total Edges Processed: 288000, Percent Complete: 0.595
[info] [10/28/21 22:19:22.971] Total Edges Processed: 342000, Percent Complete: 0.694
[info] [10/28/21 22:19:23.266] Total Edges Processed: 380000, Percent Complete: 0.793
[info] [10/28/21 22:19:23.747] Total Edges Processed: 438000, Percent Complete: 0.893
[info] [10/28/21 22:19:24.101] Total Edges Processed: 479142, Percent Complete: 0.992
...

Environment
I tried on 2 machines and got the same error.
Platform: linux (Ubuntu 18.04 LTS)
Python version: 3.6.9
Pytorch version: 1.10.0+cu102; 1.10.0+cu113

Preprocessing outputs incomplete training edge files

Describe the bug
The training edge list file output by the preprocessor is incomplete, only a single chunk of the unprocessed edges will be written instead.

To Reproduce
Any call to tools/preprocessor.py will hit this issue

Expected behavior
The full input training file should be output into the binary format.

Environment
All environments will hit this issue

Training Wikidata embedding

I'm trying to create embeddings for Wikidata, using this conf file
[general] device=CPU num_train=611058458 num_nodes=91580024 num_valid=612283 num_test=612283 experiment_name=wikidata num_relations=1390 ...

However, I am getting the error:

ValueError: cannot create std::vector larger than max_size()

Looking for any workaround, thanks

Store and use custom dataset statistics for config_generator

Is your feature request related to a problem? Please describe.
Users need to input custom dataset statistics manually currently whey using config_generator.

Describe the solution you'd like
It would be good to have the preprocess store a JSON file containing all the required custom dataset statistics and add an option to the config_generator so that users can use the stored custom dataset statistics for generating configuration files.

GIL issue thrown when testing pip install on macOS workflow

Describe the bug
MacOS pip install test throwing GIL error even though all tests pass: https://github.com/marius-team/marius/runs/2401968116

Could be an issue with Python 3.9 since the linux workflow passes but uses Python 3.8. Possibly related to pytorch/pytorch#49370

Output:

2021-04-21T16:01:51.7260960Z ##[group]Run python3 -c "import marius as m"
2021-04-21T16:01:51.7261640Z �[36;1mpython3 -c "import marius as m"�[0m
2021-04-21T16:01:51.7262230Z �[36;1mpython3 -c "from marius.tools import preprocess"�[0m
2021-04-21T16:01:51.7262850Z �[36;1mmarius_preprocess fb15k output_dir/�[0m
2021-04-21T16:01:51.7263760Z �[36;1mpytest test�[0m
2021-04-21T16:01:51.8917040Z shell: /bin/bash --noprofile --norc -e -o pipefail {0}
2021-04-21T16:01:51.8917570Z env:
2021-04-21T16:01:51.8918010Z   BUILD_TYPE: Release
2021-04-21T16:01:51.8918430Z ##[endgroup]
2021-04-21T16:02:03.4541320Z fb15k
2021-04-21T16:02:03.4642510Z Downloading fb15k.tgz to output_dir/fb15k.tgz
2021-04-21T16:02:03.4658930Z Extracting
2021-04-21T16:02:03.4659870Z Extraction completed
2021-04-21T16:02:03.4660660Z Detected delimiter: 	
2021-04-21T16:02:03.4662650Z Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
2021-04-21T16:02:03.4664160Z Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
2021-04-21T16:02:03.4665790Z Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
2021-04-21T16:02:03.4666760Z Number of instance per file:[483142, 50000, 59071]
2021-04-21T16:02:03.4667560Z Number of nodes: 14951
2021-04-21T16:02:03.4668370Z Number of edges: 592213
2021-04-21T16:02:03.4669180Z Number of relations: 1345
2021-04-21T16:02:03.4670000Z Delimiter: ~	~
2021-04-21T16:02:05.0357020Z ============================= test session starts ==============================
2021-04-21T16:02:05.0358980Z platform darwin -- Python 3.9.4, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
2021-04-21T16:02:05.0360090Z rootdir: /Users/runner/work/marius/marius
2021-04-21T16:02:05.0360930Z collected 29 items
2021-04-21T16:02:05.0361460Z 
2021-04-21T16:04:46.3756720Z test/python/bindings/test_fb15k.py .                                     [  3%]
2021-04-21T16:04:46.4321450Z test/python/preprocessing/test_config_generator_cmd_opt_parsing.py ..... [ 20%]
2021-04-21T16:04:47.7820700Z .........                                                                [ 51%]
2021-04-21T16:04:47.8108760Z test/python/preprocessing/test_csv_preprocessor.py .                     [ 55%]
2021-04-21T16:04:59.0886020Z test/python/preprocessing/test_preprocess_cmd_opt_parsing.py ........... [ 93%]
2021-04-21T16:04:59.1086690Z ..                                                                       [100%]
2021-04-21T16:04:59.1171760Z 
2021-04-21T16:04:59.1204890Z ======================== 29 passed in 175.06s (0:02:55) ========================
2021-04-21T16:04:59.2552200Z Fatal Python error: PyEval_SaveThread: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
2021-04-21T16:04:59.2652700Z Python runtime state: finalizing (tstate=0x7fe41c409b50)
2021-04-21T16:04:59.2754080Z 
2021-04-21T16:04:59.2856250Z /Users/runner/work/_temp/511be060-bb2e-418a-ac5e-2e0f5d09f4d7.sh: line 4:  5232 Abort trap: 6           pytest test

To Reproduce
Run the macOS pip install test workflow

Expected behavior
The pip install works fine on linux:

2021-04-21T15:50:21.1538556Z �[36;1mpython3 -c "import marius as m"�[0m
2021-04-21T15:50:21.1539213Z �[36;1mpython3 -c "from marius.tools import preprocess"�[0m
2021-04-21T15:50:21.1539916Z �[36;1mmarius_preprocess fb15k output_dir/�[0m
2021-04-21T15:50:21.1540448Z �[36;1mpytest test�[0m
2021-04-21T15:50:21.1584496Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2021-04-21T15:50:21.1585040Z env:
2021-04-21T15:50:21.1585484Z   BUILD_TYPE: Release
2021-04-21T15:50:21.1586287Z ##[endgroup]
2021-04-21T15:50:26.6729316Z fb15k
2021-04-21T15:50:26.6730578Z Downloading fb15k.tgz to output_dir/fb15k.tgz
2021-04-21T15:50:26.6731334Z Extracting
2021-04-21T15:50:26.6731982Z Extraction completed
2021-04-21T15:50:26.6732836Z Detected delimiter: 	
2021-04-21T15:50:26.6734284Z Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
2021-04-21T15:50:26.6735973Z Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
2021-04-21T15:50:26.6738109Z Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
2021-04-21T15:50:26.6739043Z Number of instance per file:[483142, 50000, 59071]
2021-04-21T15:50:26.6739918Z Number of nodes: 14951
2021-04-21T15:50:26.6740497Z Number of edges: 592213
2021-04-21T15:50:26.6741087Z Number of relations: 1345
2021-04-21T15:50:26.6741661Z Delimiter: ~	~
2021-04-21T15:50:27.8808863Z ============================= test session starts ==============================
2021-04-21T15:50:27.8811125Z platform linux -- Python 3.8.5, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
2021-04-21T15:50:27.8812170Z rootdir: /home/runner/work/marius/marius
2021-04-21T15:50:27.8812954Z collected 29 items
2021-04-21T15:50:27.8813617Z 
2021-04-21T15:50:50.9462537Z test/python/bindings/test_fb15k.py .                                     [  3%]
2021-04-21T15:50:50.9827642Z test/python/preprocessing/test_config_generator_cmd_opt_parsing.py ..... [ 20%]
2021-04-21T15:50:51.6762691Z .........                                                                [ 51%]
2021-04-21T15:50:51.6988451Z test/python/preprocessing/test_csv_preprocessor.py .                     [ 55%]
2021-04-21T15:50:57.6109717Z test/python/preprocessing/test_preprocess_cmd_opt_parsing.py ........... [ 93%]
2021-04-21T15:50:57.6234674Z ..                                                                       [100%]
2021-04-21T15:50:57.6235430Z 
2021-04-21T15:50:57.6236116Z ============================= 29 passed in 30.61s ==============================

Environment
MacOS: platform darwin -- Python 3.9.4, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
Linux: platform linux -- Python 3.8.5, pytest-6.2.3, py-1.10.0, pluggy-0.13.1

Additional context
test/python/bindings/test_fb15k.py is the likely culprit for throwing errors since it's the only one which runs the bindings. Unclear why it marks the test as passed.

Boost download fails

Describe the bug
The boost download link is failing. See the excerpt below.

This issue pops up time to time with boost:
boostorg/boost#299
Orphis/boost-cmake#88

 [ 22%] Performing download step (download, verify and extract) for 'boost-populate'
    -- Downloading...
       dst='/tmp/pip-k4qygwbv-build/build/temp.linux-x86_64-3.6/_deps/boost-subbuild/boost-populate-prefix/src/boost_1_71_0.tar.bz2'
       timeout='none'
       inactivity timeout='none'
    -- Using src='https://dl.bintray.com/boostorg/release/1.71.0/source/boost_1_71_0.tar.bz2'
    -- [download 0% complete]
    CMake Error at boost-subbuild/boost-populate-prefix/src/boost-populate-stamp/download-boost-populate.cmake:170 (message):
      Each download failed!
    
        error: downloading 'https://dl.bintray.com/boostorg/release/1.71.0/source/boost_1_71_0.tar.bz2' failed
              status_code: 22
              status_string: "HTTP response code said error"
              log:
              --- LOG BEGIN ---
                Trying 34.214.135.19:443...
    
      Connected to dl.bintray.com (34.214.135.19) port 443 (#0)
    
      ALPN, offering h2
    
      ALPN, offering http/1.1
    
      successfully set certificate verify locations:
    
       CAfile: /etc/ssl/certs/ca-certificates.crt
       CApath: /etc/ssl/certs
    
      [5 bytes data]
    
      TLSv1.3 (OUT), TLS handshake, Client hello (1):
    
      [512 bytes data]
    
      [5 bytes data]
    
      TLSv1.3 (IN), TLS handshake, Server hello (2):
    
      [102 bytes data]
    
      NPN, negotiated HTTP1.1
    
      [5 bytes data]
    
      TLSv1.2 (IN), TLS handshake, Certificate (11):
    
      [2765 bytes data]
    
      [5 bytes data]
    
      TLSv1.2 (IN), TLS handshake, Server key exchange (12):
    
      [333 bytes data]
    
      [5 bytes data]
    
      TLSv1.2 (IN), TLS handshake, Server finished (14):
    
      [4 bytes data]
    
      [5 bytes data]
    
      TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
    
      [70 bytes data]
    
      [5 bytes data]
    
      TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
    
      [1 bytes data]
    
      [5 bytes data]
    
      TLSv1.2 (OUT), TLS handshake, Next protocol (67):
    
      [36 bytes data]
    
      [5 bytes data]
    
      TLSv1.2 (OUT), TLS handshake, Finished (20):
    
      [16 bytes data]
    
      [5 bytes data]
    
      [5 bytes data]
    
      TLSv1.2 (IN), TLS handshake, Finished (20):
    
      [16 bytes data]
    
      SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
    
      ALPN, server did not agree to a protocol
    
      Server certificate:
    
       subject: CN=*.bintray.com
       start date: Sep 26 00:00:00 2019 GMT
       expire date: Nov  9 12:00:00 2021 GMT
       subjectAltName: host "dl.bintray.com" matched cert's "*.bintray.com"
       issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=GeoTrust RSA CA 2018
       SSL certificate verify ok.
    
      [5 bytes data]
    
      GET /boostorg/release/1.71.0/source/boost_1_71_0.tar.bz2 HTTP/1.1
    
      Host: dl.bintray.com
    
      User-Agent: curl/7.75.0
    
      Accept: */*
    
    
    
      [5 bytes data]
    
      Mark bundle as not supporting multiuse
    
      HTTP/1.1 403 Forbidden
    
      Server: nginx
    
      Date: Mon, 12 Apr 2021 15:10:54 GMT
    
      Content-Type: text/plain
    
      Content-Length: 10
    
      Connection: keep-alive
    
      ETag: "5c3b2e0c-a"
    
      The requested URL returned error: 403
    
      Closing connection 0
    
    
    
              --- LOG END ---
    
    
    
    
    CMakeFiles/boost-populate.dir/build.make:98: recipe for target 'boost-populate-prefix/src/boost-populate-stamp/boost-populate-download' failed
    make[2]: *** [boost-populate-prefix/src/boost-populate-stamp/boost-populate-download] Error 1
    CMakeFiles/Makefile2:82: recipe for target 'CMakeFiles/boost-populate.dir/all' failed
    make[1]: *** [CMakeFiles/boost-populate.dir/all] Error 2
    Makefile:90: recipe for target 'all' failed
    make: *** [all] Error 2
    
    CMake Error at /opt/cmake/share/cmake-3.20/Modules/FetchContent.cmake:1012 (message):
      Build step for boost failed: 2
    Call Stack (most recent call first):
      /opt/cmake/share/cmake-3.20/Modules/FetchContent.cmake:1141:EVAL:2 (__FetchContent_directPopulate)
      /opt/cmake/share/cmake-3.20/Modules/FetchContent.cmake:1141 (cmake_language)
      third_party/boost-cmake/CMakeLists.txt:19 (FetchContent_Populate)

To Reproduce
Building Marius will encounter this issue if the boost servers are acting up.

Expected behavior
The download of boost should succeed.

Environment
Affects all environments

Additional context
We should remove the dependency on Boost. We only use it to parse .ini configuration files and for parsing command line options.

Inconsistent results with CPU and GPU configs on the dataset ogbl-ppa

Describe the bug
I got really wired results regarding the evaluation on the dataset ogbl-ppa with CPU and with GPU, respectively. I have to change the memory to HostDevice for GPU version due to its overwhelming GRAM consumption (I thought the code could be running with 16G but it eventually exceeded 24GB).

To Reproduce
Steps to reproduce the behavior:
Run the marius script with config ogbl_ppa_cpu.ini and ogbl_ppa_gpu.ini, and then we have the following results

[2021-12-12 02:47:01.554] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################
[2021-12-12 02:49:36.904] [info] [trainer.cpp:94] Total Edges Processed: 44586862, Percent Complete: 0.100
[2021-12-12 02:52:19.113] [info] [trainer.cpp:94] Total Edges Processed: 46709862, Percent Complete: 0.200
[2021-12-12 02:55:00.754] [info] [trainer.cpp:94] Total Edges Processed: 48832862, Percent Complete: 0.300
[2021-12-12 02:57:44.074] [info] [trainer.cpp:94] Total Edges Processed: 50955862, Percent Complete: 0.400
[2021-12-12 03:00:25.467] [info] [trainer.cpp:94] Total Edges Processed: 53078862, Percent Complete: 0.500
[2021-12-12 03:03:09.531] [info] [trainer.cpp:94] Total Edges Processed: 55201862, Percent Complete: 0.600
[2021-12-12 03:06:03.269] [info] [trainer.cpp:94] Total Edges Processed: 57324862, Percent Complete: 0.700
[2021-12-12 03:08:51.169] [info] [trainer.cpp:94] Total Edges Processed: 59447862, Percent Complete: 0.800
[2021-12-12 03:11:32.560] [info] [trainer.cpp:94] Total Edges Processed: 61570862, Percent Complete: 0.900
[2021-12-12 03:14:13.438] [info] [trainer.cpp:94] Total Edges Processed: 63693862, Percent Complete: 1.000
[2021-12-12 03:14:13.558] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################
[2021-12-12 03:14:13.558] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 1632004ms
[2021-12-12 03:14:13.558] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 13009.73
[2021-12-12 03:14:14.870] [info] [dataset.cpp:761] Edges Shuffled
[2021-12-12 03:14:14.870] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 1633315ms
[2021-12-12 03:14:14.870] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 12999.288
[2021-12-12 03:14:37.284] [info] [evaluator.cpp:95] Num Eval Edges: 6062562
[2021-12-12 03:14:37.284] [info] [evaluator.cpp:96] Num Eval Batches: 0
[2021-12-12 03:14:37.284] [info] [evaluator.cpp:97] Auc: 0.508, Avg Ranks: 490.966, MRR: 0.008, Hits@1: 0.006, Hits@5: 0.007, Hits@10: 0.007, Hits@20: 0.008, Hits@50: 0.008, Hits@100: 0.009

[2021-12-13 01:53:58.848] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################
[2021-12-13 01:54:03.413] [info] [trainer.cpp:94] Total Edges Processed: 44583862, Percent Complete: 0.100
[2021-12-13 01:54:07.270] [info] [trainer.cpp:94] Total Edges Processed: 46703862, Percent Complete: 0.200
[2021-12-13 01:54:11.005] [info] [trainer.cpp:94] Total Edges Processed: 48823862, Percent Complete: 0.299
[2021-12-13 01:54:15.259] [info] [trainer.cpp:94] Total Edges Processed: 50943862, Percent Complete: 0.399
[2021-12-13 01:54:19.315] [info] [trainer.cpp:94] Total Edges Processed: 53063862, Percent Complete: 0.499
[2021-12-13 01:54:23.355] [info] [trainer.cpp:94] Total Edges Processed: 55183862, Percent Complete: 0.599
[2021-12-13 01:54:27.633] [info] [trainer.cpp:94] Total Edges Processed: 57303862, Percent Complete: 0.699
[2021-12-13 01:54:31.465] [info] [trainer.cpp:94] Total Edges Processed: 59423862, Percent Complete: 0.798
[2021-12-13 01:54:35.505] [info] [trainer.cpp:94] Total Edges Processed: 61543862, Percent Complete: 0.898
[2021-12-13 01:54:39.482] [info] [trainer.cpp:94] Total Edges Processed: 63663862, Percent Complete: 0.998
[2021-12-13 01:54:39.547] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################
[2021-12-13 01:54:39.547] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 40698ms
[2021-12-13 01:54:39.547] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 521694.72
[2021-12-13 01:54:40.847] [info] [dataset.cpp:761] Edges Shuffled
[2021-12-13 01:54:40.847] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 41998ms
[2021-12-13 01:54:40.847] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 505546.25
[2021-12-13 01:54:58.952] [info] [evaluator.cpp:95] Num Eval Edges: 6062562
[2021-12-13 01:54:58.952] [info] [evaluator.cpp:96] Num Eval Batches: 0
[2021-12-13 01:54:58.952] [info] [evaluator.cpp:97] Auc: 0.992, Avg Ranks: 2.925, MRR: 0.991, Hits@1: 0.990, Hits@5: 0.991, Hits@10: 0.991, Hits@20: 0.992, Hits@50: 0.993, Hits@100: 0.995

Environment
List your operating system, and dependency versions
Python 3.7.10
pytorch 1.7.1 (py3.7_cuda10.1.243_cudnn7.6.3_0)
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
cmake version 3.16.3
GNU Make 4.2.1

Scope out additional loss functions

Is your feature request related to a problem? Please describe.
Currently, the supported loss functions we have for Marius are SoftMax and RankingLoss.

Describe the solution you'd like
We can expand the set of loss functions by implementing additional loss functions to Marius.
We can implement losses into 2 new source files: loss.cpp and loss.h.
We can also add a new section in the configuration for loss options.

We can use the loss functions implemented by PyKeen as a reference:
List:
https://github.com/pykeen/pykeen#losses-7
Implementation:
https://github.com/pykeen/pykeen/blob/master/src/pykeen/losses.py
Documentation:
https://pykeen.readthedocs.io/en/stable/reference/losses.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.