Comments (10)
Hi David,
Thank you for the detailed report. It looks like you don't have a database file to index (the default location is ~/.cord19/models/articles.sqlite).
The paperetl project is used to build the articles database. That project has instructions on indexing CORD-19 and/or custom PDF files.
If you want to test out CORD-19, you can use a pre-built articles.sqlite database found on Kaggle.
from paperai.
Hi David,
Thank you for your help above.
paperai is installed in my local Linux computer. So in order to test it out I proceeded as follows:
- used pre-built articles.sqlite database from https://www.kaggle.com/davidmezzetti/cord-19-etl/output
and placed it at ~/.cord19/models - used pre-trained vectors from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude and placed them at ~/.cord19/vectors/cord19-300d.magnitude
- Then, in order to build the model I entered the following command for building embeddings index:
python -m paperai.index
Subsequently, the terminal displayed the following output sequence:
Building new model
streamed XXXXXX documents *
Iterated over 3377117 total rows
streamed XXXXXX documents **
Iterated over 3377117 total rows
- where XXXXXX increased progressively from 0 to 3377117 in a matter of 5 minutes and then the line disappeared from the terminal
** where XXXXXX increased progressively from 0 to 3377117 over the course of 6 hours and then the line disappeared from the terminal. After the line disappeared nothing else was displayed on the terminal but my computer’s CPU kept working at full capacity for 7+ more hours until I decided to unplug my computer to terminate the processing by turning it off.
Furthermore, no model or file was stored in ~/.cord19
Do you know what could be the problem?
(also, do the above pre-trained vectors correspond to the above pre-built articles.sqlite database? Was it OK to use both for this test?)
from paperai.
Thank you for the continued attempts to get this installed, sorry it's not going as smooth as we would hope.
The first question I have is what version of the code you're using? I notice you have a fork of paperai, are you using that fork or going off the main codebase? paperai has had a number of major changes the last few weeks, I think the issues may stem from there.
from paperai.
Thank you for pointing out that detail. I currently have paperai 1.0.0 installed, and it was installed going off the main codebase (from your GitHub repo). I'll try your latest version, with just a small portion of your pre-built articles.sqlite database, for testing purposes.
from paperai.
For testing purposes, I have a very small version that might help with debugging - https://www.kaggle.com/davidmezzetti/cord-19-slim/output
from paperai.
For debugging purposes, I reduced the pre-built articles.sqlite database (https://www.kaggle.com/davidmezzetti/cord-19-etl/output) by retaining only the first 500,000 rows of its sections table while keeping the articles and other tables unchanged.
Subsequently, I installed paperai version 1.2.1. from PYPI, and run the following commands, which created the following respective (directories:) files:
$ python -m paperai.vectors
created in ~/.cord19/vectors: cord19-300d.magnitude and cord19-300d.txt
$ python -m paperai.index
created in ~/.cord19/models: config, embeddings, lsa, and scoring
~/.cord19: remained unchanged, with only 2 folders (models and vectors) and no files
As shown in the following screenshot, 2 different attempts to run queries resulted in the same error
Suggestions about solving this problem would be appreciated.
from paperai.
Looks like it's almost there. I will see if I can update the install scripts to automatically do this but to let you continue testing - at the command prompt shown above run:
import nltk
nltk.download("stopwords")
from paperai.
In reviewing the code using the stopwords methods, this is no longer needed with the current code base and has been removed for future versions (now included in the master branch).
from paperai.
Thank you for fixing it so quickly. It works great! We'll keep working on your model.
from paperai.
Glad it worked, thank you for the diligence in getting this to work.
from paperai.
Related Issues (20)
- Add ability to read index configuration from dictionary
- Update test for reproducibility across environments
- Wrong annotation places HOT 1
- paperai for beginners HOT 2
- Add support for latest txtai indexing options
- Migrate from mdv to rich library
- Add example notebook
- Modify default index configuration
- Shell doesn't accept command line parameters when run as a console script
- Match report section query with embeddings index query
- Fix issue with passing empty queue to Extractor pipeline
- feature request: zotero integration HOT 6
- Consider switching from lxml's clean_html for enhanced security (and possibly performance) HOT 2
- Upgrade to txtai 6.0
- Update minimum Python version to 3.8
- Update setup.py to only show standard image on PyPI
- Report generation fails when index contains sections with no text HOT 1
- PaperAI not accessing information from newly created XML database
- Clarification on PaperAI scores
- Error in models.py file of paperai pip package
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paperai.