rufuspollock-okfn / bibserver Goto Github PK
View Code? Open in Web Editor NEWBibServer is open-source software what makes it easy to publish, manage and find bibliographies. BibServer is RESTful and web-friendly.
License: MIT License
BibServer is open-source software what makes it easy to publish, manage and find bibliographies. BibServer is RESTful and web-friendly.
License: MIT License
The PyES queries are defaulting to only returning the most common 10 values - how do we pass through the "size" parameter to ES through PyES?
This method needs fixing - quite a few different things cause it to trip and throw a UnicodeEncodeError
Large number of small templates which are often only included in one other template (usually solreyes). In order to simplify development and keep code cleaner suggest consolidating into main template and removing as needed.
Est cost: 2h.
With http://bibsoup.net/collection/chung_test
when the page first loads, there flashes by an alternate paging scheme, like
153 Results [1-10] [11-20] ....
then this is replaced by
Results of 1-10 153. Show 10 per page. (with dropdowns).
Fine to experiment with different pager options, but the flashing of one before the other is disturbing.
In the present pager, need [Next] and [Previous] buttons.
Have added a config option and controls in web.py and index template to hide the upload page and remove upload functionality when allow_upload is set to NO.
This, in combination with using bulk_upload from the command line, allows a department to run a bibserver with only the content an administrator pushes to it.
Search for "Pitman" in the aldous data
http://bibsoup.net/collection/aldous9?q=Pitman&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search
returns as top entry
Weak Convergence of Random
Probab. Th. Rel. Fields
D.J. Aldous, G. Miermont, J. Pitman
http://xxx.arXiv.org/abs/math.PR/0401115
But search for title words "Continuum Random Trees"
http://bibsoup.net/collection/aldous9?q=Continuum+Random+Trees&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search
Returns nothing.
Search
returns 4 entries, but not the one above. These entries have "continuum random tree" in "subjects" field, but not in title.
Its a serious defect to miss words in title. Maybe the title words never got into the index?
login / signup, identify collection tag as belonging to me.
when a user imports a file, it is stored locally in store/raw - when a file is imported again, should check to see if it is different. presumably it will be, but this will also be required when we enable scheduling automated checks of web urls. so need to do this anyway,
From the display
http://bibsoup.net/collection/hartley
it is not apparent from what url the data was uploaded. The url, in this instance http://rsise.anu.edu.au/~hartley/hartley.bib
should be apparent, as it is in
http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http://rsise.anu.edu.au/~hartley/hartley.bib
See text "Display created by BibServer from this bibtex source file http://rsise.anu.edu.au/~hartley/hartley.bib" at bottom of page
with hyperlink to the source file.
Providers of source data should be encouraged to provide metadata about their collection which can then be made part of the default BibServer display of that collection. Formatting and handling of that metadata is another issue. First lets ensure that the
use case with no metadata besides a source url is accomodated.
Allowing uploaders to name their collection uploads complicates the issue raised here. The legacy BibServer names its caches by source url, so the issue of what happens if the same url is uploaded with two different collection names does not arise. It does not seem a good idea to be making copies of the data from the same url even with different collection names, but this may require further discussion.
To summarize.
Manager was initially envisaged as managing async uploads, but this is not required, so change manager to importer, just handles importing of content to the index.
Update tests too.
Turns out flask does not support JSONP... will have to fix this in flask in order to make the query endpoint useful again.
bibsoup.net still running against SOLR. Move it.
A number of sources in various fields have stabilized on NLM XML for their biblio standard.
Including
EuDML
PKP Citation Markup Assistant
We should provide a converter to BibJSON
The refactored version runs but fails to upload on a specific dataset because upsert trips pyes when mapping.
Find error, add failing example test, and then fix.
need to enable user logins so that users can own and edit collections
The View record or search [source] for [text] [Go]
is a potentially a great feature! It looks like this has been scripted from some search resource list I had lying around.
I'd like to see how this is controlled, and work on improving the selection of search resources which should be provided
in a config file.
Note that the [text] would be better entered in a text box.
A tricky point is that best construction of the query from data will depend on the source searched.
Probably some curated searches should be offered, then the user can mix/match with others.
As an example, if the record has a DOI, e.g. 10.1214/aoms/1177705069 then the search
http://scholar.google.com/scholar?q=10.1214/aoms/1177705069
typically returns an exact match. So if an item has a DOI, this link should always be offered. Without the DOI, something like
http://scholar.google.com/scholar?hl=en&q=author%3A%22Chung%22+The+ergodic+theorem+of+information+theory
gets the item as number 3 in a list of 85. The list from this softer search is very useful, but good to be able to provide both soft and exact matches into Google Scholar and other resources. Where exact matches are obained, we should try to harvest them,
and register the remote ids. e.g.
http://scholar.google.com/scholar?cluster=17976357296721002721
I expect that a tool for providing high quality searches and links from a particular record is something we should develop as a separate module. This needs to be customized a lot depending on the type of record (article, book, person, ... ) and the
information already available. I have had many attempts at this. A general framework for managing such searches and links would be desirable.
DAO has an init_db method. I added a call to put_mapping, and put the mapping in the config. However put_mapping fails. Will need to find how to get pyes to put a dynamic mapping. Doing this would save having to manually create the index and put the mapping during install.
add in content negotiation so that JSON or HTML can be easily returned. Richard to drop in the code from SSS.
update website and check that ryan can install a bibserver
currently, when the manager is called, it just executes an upload. need to change this to queue an upload.
RP: Suggest using celery for this ...
enable existence of a user account that can do anything
exact seems like a more sensible term instead of raw, as the unanalysed field is stored exactly as it is found.
The refactored version is not combining text search value with facets, so when a facet filter is applied the text search value is deleted. I will check this out and repair.
whilst refactoring solreyes, it is necessary to change some of the tests we have, because solreyes will not exist any more.
solreyes code was built up from stuff originally started in a different project. it quickly got us off the ground, but needs factoring out. will separate it into resultmanager and urlmanager and setconfig.
subsequent tickets might see refactorings of those individual items.
NB: RP has already done quite a bit of wrapping elasticsearch in python in https://github.com/okfn/hypernotes
there are currently no tests to run - we have been pulling together various bits and pieces. We now need some tests to check that changes do not break functionality
Bulk import json file:
{ collections: [ { url: ... name: ... } ] }
Repeated uploads from the same source are producing multiple entries.
e.g. http://bibsoup.net/collection/hartley
has been uploaded 3 times it there are now 600 entries instead of 200 with each entry repeated 3 times e.g.
http://bibsoup.net/collection/hartley?q=lifting+group
This is not the desired behavior. The simplest behaviour implemented by the legacy bibserver is that
each new upload from the same url should overwrite any previous upload. That should be implemented first.
Note that this overwriting should not just be entry by entry. The entire collection should be replaced.
install will be easier once we have a setup script.
This is a breakout of item 10) of #14
to which Mark replied:
I agree with "we need to control how people will manage their own config file"
I do not agree this implies "we need to know who people are in some way".
I think we should regard BibServer as a webservice which takes two sorts of input data
Of course, both biblio datasets and config files should be checked in some way to see they are not malicious. I hope this
is adequate as an alternative to login gates.
Replication of the old bibserver capability on the benchmark Schramm dataset http://research.microsoft.com/~schramm/bibserver.bib is a first milestone. Following are a few issues I see between here and there. Probably these should be broken out into several issues, but I try to collect them here for completeness.
upload of the dataset. The upload failed for me.
The upload from a url is a post request. It should be a get request, so it can be easily bookmarked, and the data should be saved with a filename or id which is a suitably sanitized form of the url. e.g.
User supplied titles should not be used as ids, as they will clash eventually.
First upload should create a cache. Thereafter, subsequent calls for the same url should pull from cache, except with an indication e.g. "refresh=yes" in the get string to refresh from source.
There should be no redirect required for this upload procedure from a url. Perhaps optionally, but not required.
Listings of Authors should be alphabetical, with author links, like http://bibserver.berkeley.edu/cgi-bin/bibs7?&source=http://research.microsoft.com/~schramm/bibserver.bib&index=authors or at least such a complete listing should be available.
Format in .bib or .json to produce such displays is negotiable. Its good to have a simple dropdown list of authors in the
left nav bar, but this does not replace a comprehensive author listing page.
Capability for subjects listing like http://bibserver.berkeley.edu/cgi-bin/bibs7?&source=http://research.microsoft.com/~schramm/bibserver.bib&index=subjects
should be supported, both in the data model and in display.
Similar capability for journals is also desirable.
Generally, we want for any facet (except perhaps types and years), to have both a simple dropdown in the nav, and a more
comprehensive full page listing which allows external links based on attributes in suitable entity tables for journals/subjects/people/...
Need the footer e.g. "Display created by BibServer from this bibtex source file http://research.microsoft.com/~schramm/bibserver.bib" or ".... from file uploaded by user ... now cached at .... " if its a user upload. Generally, need to display the provenance of the data, so user knows where it is coming from.
Provide "Edit source" input at the bottom.
Demonstrate display template to replicate the item display as closely as possible, and give maintainer control of details in display, e.g. what things are linked, in what order, ....
Imagine this involves:
An as yet unidentified memory leak has been introduced during the refactor. This could be due to pyes, flask or a change in our code. Bibserver appears to eat about 4gb per day. Have not looked into the cause yet.
i need to improve the documentation for the code we have now - we have stuff that does something useful at this point, so time to scale up dev efforts. I will update docs and provide diagrams of how code fits together etc.
Would like to carry some operations from command line e.g.:
urlmanager came from solreyes. It now needs a test.
I just tested
http://bibsoup.net/upload?source=http://bibserver.berkeley.edu/tmp/erdos.bib&collection=erdos
and it returns an Internal Server Error
This was a previous issue (#15) but I dont seem to have permission to reopen issues so I am making it a new one.
There is a big ES instance running on dev.okfn.org.
Move bibsoup.net there, particularly to try running the medline index.
Check if anyone will be upset if I kill the ES service by trying to facet the author field on the medline index...
Upload from http://bibserver.berkeley.edu/tmp/erdos.bib produces
<type 'exceptions.IndexError'> at /upload
list index out of range
Python /opt/bibserver/parsers/BibTexParser.py in read_bibitem, line 87
Web POST http://bibsoup.net/upload
Compare with http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http%3A%2F%2Fbibserver.berkeley.edu%2Ftmp%2Ferdos.bib
Note that the author listing for Erdos is an excellent test of unicode conversion capabilities from tex accents, also
handling of a long author list (568 authors) The old bibserver is doing this by a crude mapping to html entities. For the new one, best to use the NUMDAM tex to unicode converter.
Current theme is good but we could make it quite a bit nicer without too much work (e.g. by reusing an existing OKFN theme).
The dataset class is improperly named for function now, rename to parser.
This class now manages parsing of files, and uses whichever parser is requested.
Flask bundles Jinja2 by default and it seems to preferred in the general community. However, we have working mako templates and others may prefer them. So this is a very low priority (and perhaps is a wontfix?)
Importer currently only indexes one record at a time. These should be batched and bulk inserted.
Also suggest renaming BibTextParser.py to bibtex.py
this is not too hard, but requires some form of user control first. then allow people to edit records and collections.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.