neuml / paperai Goto Github PK
View Code? Open in Web Editor NEW๐ ๐ค Semantic search and workflows for medical/scientific papers
License: Apache License 2.0
๐ ๐ค Semantic search and workflows for medical/scientific papers
License: Apache License 2.0
I working in the field of material science. Can you give some instructions how to use it with PDF papers?
Currently, the vectors process defaults the output location to ~/.cord19/vectors/cord19-300d.magnitude
Add optional command line parameters for:
I want to create an index and vector file over a Custom sqlite articles database. I have created a articles.sqlite database on medical papers, using paperetl. But I did not find any instruction as to how to process it . Can you please give instructions on this ?
It was reported that paperai can't be installed in a Windows environment due to the following error:
ValueError: path 'src/python/' cannot end with '/'
when I command in a docker: python -m paperai.vectors cord19/models, the output srror is "sqlite3.OperationalError: no such table: sections"
When inputting a nested input directory as well as a desired output directory, the file tree is being matched in the output directory. The tei.xml files just get dumped in the output directory. This can lead to issues when multiple pdf files have the same name in a nested input directory. If the client is told to force overwrite xml files, identically named files are overwritten. To fix this, the nested file tree needs to copied to the output directory. At your discretion I have created a PR that solves this issue. I've also included directory creation in case the output directory hasn't been created yet.
Remove any references to ~/.cord19 as this is no longer relevant
Add unit tests to paperai to help with quality assurance
The current index method makes a number of calls that force creating copies of the working embeddings index array. For larger datasources, this can cause out of memory errors. Make the following improvements:
When streaming embeddings back from disk, create an empty NumPy array initialized to the size of the embeddings index array. Appending NumPy arrays to a list will force a copy to be created, when creating the final NumPy embeddings array.
Modify the removePC method to operate directly on the input array vs returning a copy
Modify the normalize method to operate directly on the input array vs returning a copy
Hi,
I get the following error when running python -m paperai.index
raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: 'C:\Users\x\.cord19\vectors\cord19-300d.magnitude'
PS. I am quite new to all this; so, apologies if the mistake is on my end.
When trying to download cord19-300d.magnitude from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude, I get the error: "Too many requests"
when i run command "python -m paperai.report tasks/risk-factors.yml 50 md cord19/models
", i can't find file risk-factors.yml, and i can't understand argument "50"
Currently, all columns are uniquely generated. The extraction process for each column can be time-consuming. The framework should be able to generate new columns from existing columns to apply formatting to the output. This change adds a new column type derived from a reference column A prior column can be referenced in the query section of the new column.
Add a check to the sections query to filter out sections that don't have tokens.
Add the following new column parameters:
Add a Streamlit version of the query command line
When building word vectors, add parameter to specify the output file for word vectors
Currently, reports are hardcoded to the "NeuML/bert-small-cord19qa" model for qa extraction. Add a path parameter to allow Hugging Face hub models and local models. If no qa model set, fall back to "NeuML/bert-small-cord19qa".
Update deprecated methods in txtai 3.3
Currently, the Embeddings index is hardcoded to BM25 + fastText. Allow passing in embeddings configuration to the indexing process, so that any txtai supported index can be created.
txtai 4.x had a regression error in 4.x that causes issues with building word vector indexes. The dependency should be updated to require >= 4.3.1
After successfully installing paperai in Linux (Ubuntu 20.04.1 LTS), I tried to run it by using the pre-trained vectors option to build the model, as follows:
(1) I downloaded the vectors from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude
(2) My Downloads folder in my computer ended up with a Zip file containing the vectors.
(3) I created a directory ~/.cord19/vectors/ and moved the downloaded Zip file into this directory (see yellow folder in the figure below).
(4) I extracted the Zip file, which resulted in the grey folder shown below, which contained the file cord19-300d.magnitude
(5) I moved the cord19-300d.magnitude file outside of the grey folder and thus into the ~/.cord19/vectors/ directory (see figure below)
(6) I excuted the following command to build the embeddings index with the above pre-trained vectors:
python -m paperai.index
Upon performing the above I got the following error message (see below)
Am I getting this error because the above steps are not the correct ones?
If so, what would be the correct steps?
Otherwise, what other things should I try to eliminate the issue?
Currently, the indexing process runs on all articles in the input database. Add a parameter to specify the maximum number of documents. This parameter will pull the most recent N documents, by entry date and only index those.
Fix breaking changes and add txtai 2.0 as a dependency.
Python 3.6 is EOL in days. Update scripts and requirements to 3.7.
Run all extractor queries in parallel to help improve performance
Add functionality to require a term as part of a query column "query" filter field. This would mirror the functionality already present in general searches.
Identify and migrate common functionality for building embeddings index into a separate project that can be used for multiple use cases.
What are the minimum memory requirements for the PaperAI? When running on Nvidia V100, 32 GB DDRAM I got: RuntimeError: CUDA error: out of memory. GPU memory seems to be completely free.
Is there a way how to run it from GPU, or can I run it exclusively on TPUs?
from txtai.embeddings import Embeddings
import torch
torch.cuda.empty_cache()
# MEMORY
id = 1
t = torch.cuda.get_device_properties(id).total_memory
c = torch.cuda.memory_cached(id)
a = torch.cuda.memory_allocated(id)
f = c-a # free inside cache
print("TOTAL", t / 1024/1024/1024," GB")
print("ALLOCATED", a)
# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
import numpy as np
sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]
query = "health"
uid = np.argmax(embeddings.similarity(query, sections))
print("%-20s %s" % (query, sections[uid]))
TOTAL 31.74853515625 GB
ALLOCATED 0
Traceback (most recent call last):
File "pokus2.py", line 32, in
uid = np.argmax(embeddings.similarity(query, sections))
File "/home/user/.local/lib/python3.8/site-packages/txtai/embeddings.py", line 228, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/home/user/.local/lib/python3.8/site-packages/txtai/embeddings.py", line 179, in transform
embedding = self.model.transform(document)
File "/home/user/.local/lib/python3.8/site-packages/txtai/vectors.py", line 264, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
Currently, paperai uses the standard txtai API when running searches. This method returns the matching index ids.
The API should be enriched similar to what is being done in tldrstory to return more information.
I ran DeepSource analysis on my fork of this repository and found some code quality issues. Have a look at the issues caught in this repository by DeepSource here.
DeepSource is a code review automation tool that detects code quality issues and helps you to automatically fix some of them. You can use DeepSource to track test coverage, Detect problems in Dockerfiles, etc. in addition to detecting issues in code.
The PR #24 fixed some of the issues caught by DeepSource.
All the features of the DeepSource are mentioned here.
I'd suggest you integrate DeepSource since it is free for Open Source projects forever.
Integrating DeepSource to continuously analyze your repository:
.deepsource.toml
configuration specific to this repo or use the configuration mentioned below which I used to run the analysis on the fork of this repo.version = 1
test_patterns = ["/test/python/*.py"]
[[analyzers]]
name = "python"
enabled = true
[analyzers.meta]
runtime_version = "3.x.x"
Currently, Index.run
takes configuration as either a vectors path or a path to a YAML file. This should be expanded to also accept configuration as a dictionary.
Add a new report column format parameter (dtype) that allows applying formatting rules to the column output. Examples include converting the result to a number, formatting a duration field and converting the text to a categorical value.
Add .pre-commit-config.yaml file to enable checks for code quality.
NLTK is no longer a necessary dependency
Add a new report type that annotates the answers over input PDF documents. This source requires a path to the original PDFs.
Use the txtmarker library.
Currently, reports are built off search results. For smaller sources, it should be possible to build a report off a full database without a driving search. Add support for wildcard query reports.
neuml/paperetl#34 removes the study design columns as discussed in the issue. paperai needs to remove references to those columns. The same functionality can be provided with extraction queries.
Currently, extractive qa filters support both + and - modifiers to include/not include results when pre-filtering for qa candidate rows. Also add this logic to query filters.
Currently, queries with a match score of less than 0.6 are filtered from result lists. Make this setting configurable.
Currently, the pre-trained vector model is stored on Kaggle. Put a copy of this file on the next GitHub release. This will allow automation/docker builds.
# Can optionally use pre-trained vectors
# https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude
# Default location: ~/.cord19/vectors/cord19-300d.magnitude
Model after this: https://github.com/neuml/txtai/blob/master/.github/workflows/build.yml
Add a Dockerfile for running paperai. The Dockerfile should have a way to integrate with paperetl so both can run within the same image.
Currently, an embeddings index is required for reports. But with all queries, the full database can be processed without the need for an embeddings query. Modify the model loading logic to allow this use case.
Currently, report options must be passed on the command line. This is cumbersome as the number of arguments grow. This change will add support for report options directly in the report yml file. The current command line options will still work and take precedence but moving forward new options will be primarily defined in the report yml.
Update to txtai 3.2
The original mdv project is no longer maintained and there is an active fork: https://pypi.org/project/mdv3/
Switch to mdv3
Support extractor context setting added in neuml/txtai#137
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.