paperswithcode / axcell Goto Github PK

View Code? Open in Web Editor NEW

383.0 383.0 57.0 661 KB

Tools for extracting tables and results from Machine Learning papers

License: Apache License 2.0

Shell 0.43% Python 54.98% TeX 0.21% Jupyter Notebook 44.37%

axcell's People

Contributors

Stargazers

Watchers

Forkers

alisharifi2000 gehongpeng aniketgurav zhangym linhduongtuan saranshkarira jingmouren codeaudit ssusantachary keruhua phillip1029 bigdatamatta roysh stjordanis amirstudy doc22940 kapitsa2811 rajesh16702 pankajkarman huanghao-coco fighting41love vxenomac ymohit samzhang8 bobycv06fpm rajeshkpandey valaydave benjum otanet pameladdd skabongo kabongosalomon krntneja md-experiments huabao97 armon-chen hk9984 sree181 timbmg beira-bf bhargavaganti huzefasiyamwala zergey mvisionai liamdgray curiousme-lab yuzelou johnson7788 8589 trellixvulnteam ki-rin olaignacyk chhaviilli goyalkaraniit iq-scm corei5 lazoark

axcell's Issues

permission denied of latex2html.sh

When I run the extraction.ipynb. I am facing the following issues:

docker.errors.APIError: 400 Client Error: Bad Request ("OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"/files/latex2html.sh\": permission denied": unknown")

I can run docker without sudo and can successfully run the following sample using docker-py

import docker
client = docker.from_env()

>>> client.containers.run("ubuntu:latest", "echo hello world")
'hello world\n

Could you help? Thanks.

SourceChangeWarning & WeightDropout error

Hi！
I am reproducing the results and encountered two problems as follows:
1.SourceChangeWarning
When using ResultsExtractor in evaluation.ipynb, the error was raised: 453: SourceChangeWarning: source code of class 'torch.nn.modules.loss.BCEWithLogitsLoss' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
Could you please give me some advice on how to retrieve the original source?

2.WeightDropout
When extracting the results in evaluation.ipynb, an error was raised: AttributeError: 'WeightDropout' object has no attribute 'idxs'

Looking forward to your kind response:) Thank you very much!

Metric for Table Segmentation

It looks like the following code in nbsvm.py is used to compute precision and recall for the cell type classification task:

def metrics(preds, true_y):
    
    y = true_y
    p = preds
    acc = (p == y).mean()
    tp = ((y != 0) & (p == y)).sum()
    fp = ((p != 0) & (p != y)).sum()
    fn = ((y != 0) & (p == 0)).sum()
    prec = tp / (fp + tp)
    reca = tp / (fn + tp)
    return {
        "precision": prec,
        "accuracy": acc,
        "recall": reca,
        "TP": tp,
        "FP": fp,
    }

My understanding is that you are trying to exclude OTHER. Then why fp is not calculated as fp = ((y != 0) & (p != y)).sum()? Also, why not use the standard way that treats all classes identically?

Missing file

Thanks for sharing your code. I have a quick question for you guys:

It seems to me that the file ("pwc/papers-with-abstracts.json") required to execute axcell/scripts/download_arxiv_s3_papers.sh is missing. How can I fetch this file?

Does this step (https://github.com/ymohit/axcell/blob/master/scripts/download_arxiv_s3_papers.sh#L10) list arxiv papers mentioned in csvhttps://github.com/paperswithcode/axcell/releases/download/v1.0/arxiv-papers.csv.xz or its a different list?

Thanks

Inquiry about file path in notebooks

Hello! I am working on reproducing the results in Jupyter notebooks on macOS but I have some questions on the following code in terms of the file path:
ROOT_PATH = Path('data')
PWC_LEADERBOARDS_ROOT_PATH = Path('pwc-laderboards')
How to deal with the file path on macOS? I cannot find the mentioned directory 'data' under /axcell-master/notebooks, where can I download the required file? Whether should I create the directory named data under /axcell-master/notebooks?
Looking forward to your kind reply

Dataset

Hi, as for the three datasets you are using (ArxivPapers, SegmentedTables & LinkedResults, PWCLeaderboards), are the three data files involved in the notebooks (arxiv-papers.csv.xz, segmented-tables.json.xz, pwc-leaderboards.json.xz) all that we need necessary? Do we need to manually download each paper using the get_eprint_link(paper) function in the datasets notebook? If so, it would be great if a zip file for all papers can be provided.

In addition, I have looked into the paper_collection.py, where many .json files needed but not provided neither. Could you give some guidance about how to get those files as well?

ConnectionError when calling arxiv-vanity/engrafo

Hi, I really appreciate this work and am trying to reproduce the results.
However, I failed to perform the extraction, even from a single e-print archive for paper 1903.11816v1. To be more specific, when the LatexConverter is calling the arxiv-vanity/engrafo api (at line 65 of latex_converter.py), it reports:
Exception has occurred: ConnectionError
('Connection aborted.', PermissionError(13, 'Permission denied'))
File "/mnt/zr/axcell/axcell/helpers/latex_converter.py", line 65, in latex2html
self.client.containers.run("arxivvanity/engrafo:b3db888fefa118eacf4f13566204b68ce100b3a6", command, remove=True, volumes=volumes)
File "/mnt/zr/axcell/axcell/helpers/latex_converter.py", line 84, in to_html
self.latex2html(source_dir, output_dir)
File "/mnt/zr/axcell/axcell/helpers/paper_extractor.py", line 41, in call
html = self.latex.to_html(unpack_path)

Could you share some ways to resolve this problem? Thanks very much!

'LSTM' object has no attribute '_flat_weights_names'

Hi!

After successfully setting up the conda environment, I tried to run result-extraction.ipynb notebook. Unfortunately, I faced the following error 'LSTM' object has no attribute '_flat_weights_names' while loading the 'ResultsExtractor'.

Have I done something wrong or missed some requirements/dependencies?

ConnectionRefusedError

Hi,

When I tried to run this line in the evaluation notebook,

results = Parallel(backend='multiprocessing', n_jobs=-1)(delayed(process_single)(index) for index in range(len(pc)))

I encountered the following error:

07/03/2020 13:32:37 - WARNING - elasticsearch - PUT http://127.0.0.1:9200/paper-fragments/_doc/1207.4708v2_1000 [status:N/A request:0.001s]
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/opt/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection
raise err
File "/opt/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

Do you have any idea to resolve this issue?

llvm module

I have encountered the following error while reproducing the result from evaluation.ipynb:
ModuleNotFoundError: No module named 'llvm'
More errors were generated while finishing up the installation and setup. I’m currently stucked with this step. After a thorough searching, I also figured that llvmpy is a relative old module. Hence, I’m wondering if there is any required settings for this part. Or, do you have any idea of how to resolve this issue.

file not found in AWS S3

I have configured my AWS CLI and i'm able to download arxiv_src_manifest.xml file and the code creates tars.txt but fails as shown below --

How do i resolve this?

Steps to run replicate the full code on windows.

Hi - thanks for the great work and sharing the code.
I am new to docker and i'm finding it difficult to understand how to exactly reproduce the code and results on windows.
Could you please provide a more detailed Readme with step by step actions to take to recreate the work on windows?

Docker Requirement?

Hi all,

Super cool tool; thanks for making this! I was wondering if I could get a liitle more information about how Docker is used and whether it is possible to get around the Docker requirement (e.g., to run axcell in Colab). My best guess is that it is creating extra nodes for elasticsearch?

Thanks for your time,
Bernie

Estimated Cost for Using AWS API

Hi,

Have you calculated the number about the cost if I use the S3 API provided by AWS to download all the related paper resources? As AWS seems charge downloading by data size so I cannot have no idea of an approximate number for the cost.

AxCell CondaEnv. creation fails on Windows

Hey,

I am planing to further elaborate and implement features for AxCell. However unfortunately I do not own a Linux or Mac OS. While trying to install AxCell on Windows10 i ran into some issues. When im trying to install the Conda Enviroment (using Anaconda 3.8 Python) via the presented file, "magic_python" and "docker-compose" seem to be only available for Linux/MacOs.

Is it possible to run AxCell on Windows?

How to use the API

Hi,

I'm trying to run the notebooks and use your pre-trained model to test it on a paper. I use the extraction.ipynb and results-extraction.ipynb notebooks but get some errors. It seems it needs to download data. For example in the extraction.ipynb notebook I get this:

What folders and directories should be created and what data should I download? It's not clear for me.

I also downloaded three dataset csv and json files and put them in the scripts folder and run download_arxiv_s3_papers.sh, but I get an error again:

fatal error: Unable to locate credentials
warning: failed to load external entity "arXiv_src_manifest.xml"

I cannot find the arXiv_src_manifest.xml file.

Thanks

OSError: [E053] Could not read config.cfg

When using result_extraction notebook on a paper, it crashes at
ResultExtractor(...)
with the error:

[PID 496779] Load model table-structure-classifier.pth
/home/vivoli/miniconda3/envs/arxiv-manipulation/lib/python3.7/site-packages/spacy/util.py:715: UserWarning: [W094] Model 'en_core_sci_sm' (0.2.4) specifies an under-constrained spaCy version requirement: >=2.2.1. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.0.5,<3.1.0
  warnings.warn(warn_msg)

OSError: [E053] Could not read config.cfg from /home/vivoli/miniconda3/envs/axcell/lib/python3.7/site-packages/en_core_sci_sm/en_core_sci_sm-0.2.4/config.cfg

Do you have some idea to solve it?
Thanks