artfl-project / philologic4 Goto Github PK

PhiloLogic4

License: GNU General Public License v3.0

JavaScript 1.21% Python 50.99% HTML 0.25% C 19.81% Makefile 0.68% Shell 0.18% SCSS 0.84% Vue 25.92% Dockerfile 0.12%

xml search-engine python tei-xml tei concordances collocations

philologic4's Introduction

4.7

PhiloLogic is an XML database/search engine/web app that is designed for the particular difficulties of TEI XML. For a more theoretical description, you can refer to our research publications or our blog.

Note that as of version 4.7.3, PhiloLogic can now parse plain text files. See documentation for more details.

See documentation

IMPORTANT

PhiloLogic 4.7 will only work on Unix-based systems (Linux, *BSD) though MacOS is not supported and guaranteed to work.
PhiloLogic 4.7 will only run on the Apache Webserver
PhiloLogic 4.7 has only been tested on Python 3.8 and up. For a Python 2 version, use the latest PhiloLogic 4.5 release.
The PhiloLogic 4.7 Web App will only work on recent versions of web browsers: Chrome, Firefox, Safari, Opera, Edge. No support for Internet Explorer.

philologic4's People

Contributors

Stargazers

Watchers

Forkers

rwhaling pleonard212 ffzg muranava mbwolff brown-ccv patrickdrouin broadwell pajusmar thawsitt clovis jsaizm

philologic4's Issues

Integrate ranked relevance into the Python library

This is much needed for various reasons:

it would support the same query syntax as the other reports
it would make it more stable
it would make the ranking code reusable for other reports

Math domain error in ranked relevance

This was identified by iIchard on an Encyclopédie build. This is probably due to the logarithm function in TF-IDF. This is easily fixable.

Turn off searches when more than one word is selected

This wouldn't apply to ranked relevance.

Result Count Contextualization

3/18: for the search "esprit saint" in Concordance report. Two possibly related issues, one a bug and one a possible confusion.

A possible confusing path or at least it confused me for a while. Select an author, say Mallet, from the Count by Author,and get 8 hits. Selecting Display frequency by Class of Knowledge works on the current search results and not the first results.
Using the "Display Frequency by" pull down, getting contexts from the counts works for some and not for others. Search esprit saint, Display frequency by headword, click on a headword it gets no results. No results for Class of knowledge. Works for author. ObjectID is NULL, to be eliminated.
Can't seem to get it to work for Volume from the pulldown.
collocate counts are a little odd. for the same search, select "fils" as the collocate. has a count of 4, returns two hits. Might want to check this collocation with collocation tables.

Dirty XML Parsing

right now, the Loader is hardwired to use my StrictParser, which is itself hardwired to expat. For Frantext support, etc., I'll need to re-instate dirty parsing. We can probably re-use the LXML driver class from FragmentParser

collocation-concordance dropping context for some results

Tim discovered some situations in which the conc-from-colloc displays no context; investigate further.

suppress Indexing in headwords, headers, etc.

The Parsers needs a list of elements to not perform indexing in at all. At one point, I was disabling indexing of all metadata extractions, but given that they can badly misfire with an unclosed element, e.g., Frantext, that was unacceptable.

object-to-page mapping

Clovis and I disable some buggy/heuristic object/page mapping code. Let's revisit it and get it right in the next build. One option is to fill in page links for the toms at load time; another is to do a run-time search by byte offset.

lxml object formatter

The current shlax-based object formatter is inadequate for the encyclopedie, and many other collections. We should implement a proper lxml-based error-tolerant tree transformation formatter instead.

Concordance to collocation: make the highlighting of collocate smarter

The highlighting of the collocate is done with a regex, unlike the highlighting of the original query word which is based on its byte offset. I should write a fancier regex to only match collocates which are within the determined parameters, no farther than x number of words from the original query.

Catch nonexistent metadata columns in loader

metadata columns declared in the loader, but not extant in the data, screw up all sorts of things. catch this, and create at least a placeholder column.

Frequency by Headword vs by Object

Another damnable detail. Do a search for homme* and then get Display Frequency by headword. The totals very close to what I get in Philo3 and the articles counts are usually fairly close. Discours Preliminaire 151 vs 153, Evolutions, Encyclopedie ~143 and so on. Except for "Homme". Philo4 reports 156 while P3 has split out several subarticles, all "Homme" with their specific frequencies, totaling 149.

Homme, (Morale.) [Class: Morale] [Author: Le Roy (Charles Georges)] {Machine Class: Morale} (Page 8:274)
Homme, (Hist. nat.) [Class: Histoire naturelle] [Author: Diderot] {Machine Class: Histoire naturelle} (Page 8:257)

It appears to me that P4 counts by string match on headword, thus merging objects with the same title, while P3 counts by object ID, keeping them distinct.

This is another one of those things that I think are both reasonable and valid ways. But we probably want to make sure that we all agree on which is to be adopted, why, and document it.

Too many SQL variables

This one's a mess. SQLite has a compiled maximum of 999 parameter-substitutions in a statement, and our metadata-expander can push way beyond that in certain situations. We should look at either splitting these up somehow, or else recompiling SQLite to increase the limit, although that might not jibe well with Python. http://www.sqlite.org/limits.html

default_object_level in db.locals

Some db's should return an object type other than doc for empty queries--allow for dic3's in encyc, for example. should be an optional db.locals var.

clear up default_object_level

there are a variety of redundant "default behavior" mechanisms. The wsgi_handler sets a no_q flag, the DB object will try to detect null queries, and the conc report will try to do the same and hand off to the biblio report, which itself tests for no_q. This needs some tidying. I would reccommend moving all this into the library, at the DB object and the default_object_level flag.

new report for page images

Robert suggests that we use proper templates for page images, with prev/next buttons, rather than linking directly to the file. Not necessary for internal testing, but absolutely must-do before public testing.

ValueError with loader.py

Hi there,
I managed to compile and install PhiloLogic 4 but when loading a TEI file (or any other file) with 'python loader.py database1 lg.xml' I get the following errors:

The database1 database already exists
Do you want to delete this database? Yes/No
yes
Traceback (most recent call last):
File "loader.py", line 204, in
debug=debug)
File "/usr/local/lib/python2.7/site-packages/philologic/Loader.py", line 94, in init
for o_type, path, param in self.metadata_xpaths:
ValueError: too many values to unpack

How shall I proceed?
Thank you all,
Peter

Replace all tabs in xml by a whitespace

The loader splits on tabs, so does the query parser to handle normalized forms. Which means we need to remove all tabs in the XML at load time.

Concordance Report, Truncated entries

Search "terme" and there are truncated entries in the default concordance report:

Eidous - ABAISSE - Abaissé, page 1:7. More
e son Ordre, & depuis Bailli de Lyon. D'or
d'Alembert - ABAISSER - Abaisser, page 1:8. More
it alors ravaler. Voyez Ravaler. (K)
unknown - ACOUTREUR, page 1:111. More
habitude que se sont formées les régles du goût pour l'art de bâtir selon l'

The More link seems not to work in truncated entries.

Seems to be a general bug, now that I am looking for it. search "passion"

d'Alembert - ANTIPATHIE, page 1:510. More
d, &c. Voyez d

Does not seem to be related to length of article, but may be related to the position of the word in the smallest object ... such as a subarticle.

Preserve inline formatting in concordance

Currently, we're removing all tags in concordance results, just like in KWIC. We need to use lxml to keep inline formatting.

Inaccurate results count in concordance searches when fewer than the results per page setting

Here's Tim's description of the issue:

I've noticed one little issue that occasionally pops up when I run concordance searches. I'm seeing reports that return the message 'Hits 1-50 of 33...' (See attached screen shot.) Can that message be re-jiggered to say something like '33 hits returned so far. Still working. Refresh for a more accurate count of results.'?

tag mangling in chunkifier()

while working on a more flexible ObjectFormatter, I found out that the get_text_obj() function, which is responsible for retrieving the content of the object, is mangling the beginning and end tags, I believe somewhere around the call to chunkifier(). In the common case, the first 5 or 6 bytes of the tag are lopped off, which is disastrous for consistent formatting--the tag that ends the div is likewise wrecked. I'm going to re-implement highlighting in the ObjectFormatter, so this concordance chunkifier won't need to be re-used there; however, we should really avoid mangling tags like this anywhere.

Collocation span

http://robespierre.ci.uchicago.edu/philo4/encycmarch12_4/

3/14/2013: Search "esprit*" in 5 word collocation. Get 132 hits with
“nos”. First hit:

Ainsi plusieurs Sciences ont été, pour ainsi dire, contemporaines;
mais dans l'ordre 'historique des progrès de l'esprit, on ne peut les
embrasser que successivement. Il n'en est pas de même de l'ordre
encyclopédique de nos connoissances.

esprit and nos seem pretty far removed. Many seem reasonable, with a
significant portion rather far away (for 5 words), so it might be good
to check how are collocations being counted and/or searched from the
collocation tables?

philo_id queries broken, badly

philo_id queries don't work because I can't infer a type for them. I need to have special handling of philo_id's, and also think about other "universal" parameters.

Frequency table to concordance: mismatch in number of hits

When searching for frequencies by headword of "esprit saint" within 3 words, you get 7 hits for "esprit", but you click on the link, you only get 3 hits.

word attribute indexing

We need a single, standard word attribute indexing post-filter for expanding word metadata attributes. Ideally, this would run after the word frequency generators, etc., so as not to pollute them--although creating separate pos and lemma tables would be nice also. The index and query systems should then pick it up automagically.

Tokenization rules -- very minor

I ran a search for = and came back with lots of hit (excellent by the way :-). I'm not sure if this should be a word breaker or not, when numbers are involved (I'm not that concerned). There are instances of things like ombre=230, horisontale=56, parametre=b, signe=ou, etc. which you probably want to tokenize.

There are also 19,000 instances of ° .... which I think is fine, but we probably want to note someplace.

I see » and « are treated as characters and words (if stand alone).
Thus: bourgades« and république»
I think you may want these as word breakers, at least when they are tokenized as part of a word. But I'm in tow minds on whether or not they should be search characters. Very handy in some cases -- find me all the direct quotations marks, but it will be somewhat counter intuitive for context searching -- find me word1 and word2 in 3 words.

If you have a moment, please do put this build on pantagruel so I can poke around the word lists, etc.

No highlighting in pages and other text objects

As the title indicates, highlighting is not working when clicking on page links or text object links in a results page.

Put the javascript global variable in the header rather than instantiating them from philologic.js

Frequency Table click does not go to results

3/18: for the search esprit saint (with three words) for frequency report by head_norm, clicking on the headword does not fetch either the article or the results. Works as expected for Author and Volume, not for class of knowledge. One can select "Object ID", which returns a count for NULL (probably want to eliminate that report option).

This is either a question from me or a bug report. I don't know if Frequency Table is something we should be looking at right now. May be moot given search results available in Concordance report.

Group of hits fetch goes beyond end of hitlist

http://robespierre.ci.uchicago.edu/philo4/encycmarch12_4/dispatcher.py/

03/13: 5 word collocation of esprit, click on homme, then go to Last and get

"230 occurences of collocate "homme" in the vicinity of "esprit":
Hits 9241 - 220 of 230"

and obviously no results.

replace remaining ElementTree references with LXML

check metadata/egrep parser

Charlie observed that multi-term metadata searches are croaking. the logs indicate that "Diderot Vandenesse" is getting translated to one egrep for "diderot vandenesse", not into one command for each term. NOT terms may have similar problems; check the parser.

page object api

we need slightly more rigor around page object access. in particular, moving from toms to pages is tricky. add this to ObjectWrapper, then re-visit Clovis object-pager code.

possible issue in Query.py and process ID?

I've loaded a TEI doc, and am pretty sure at least some parts of philo4 are happy with it, viz:

http://kartor.32by32.com/philologic/douglass/scripts/term_list.py?term=nat

The problem occurs when trying to execute a search. Python jumps to 100% CPU and sticks there, eventually causing a timeout. Debug output from error_log (below) suggests that Query.py couldn't change directories to (the nonexistant) /var/lib/philologic/hitlists/. Which it should only do if it cannot find the current process id of the search...?

Environment is OS X 10.8.4, python 2.7.2...

Traceback (most recent call last):, referer: http://kartor.32by32.com/philologic/douglass/
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/wsgiref/handlers.py", line 86, in run, referer: http://kartor.32by32.com/philologic/douglass/
    self.finish_response(), referer: http://kartor.32by32.com/philologic/douglass/
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/wsgiref/handlers.py", line 126, in finish_response, referer: http://kartor.32by32.com/philologic/douglass/
    for data in self.result:, referer: http://kartor.32by32.com/philologic/douglass/
  File "/Library/WebServer/Documents/philologic/douglass/dispatcher.py", line 23, in philo_dispatcher, referer: http://kartor.32by32.com/philologic/douglass/
    yield getattr(reports, report or "concordance")(environ,start_response), referer: http://kartor.32by32.com/philologic/douglass/
  File "/Library/WebServer/Documents/philologic/douglass/reports/concordance.py", line 20, in concordance, referer: http://kartor.32by32.com/philologic/douglass/
    hits = db.query(q["q"],q["method"],q["arg"],**q["metadata"]), referer: http://kartor.32by32.com/philologic/douglass/
  File "/Library/Python/2.7/site-packages/philologic/DB.py", line 139, in query, referer: http://kartor.32by32.com/philologic/douglass/
    return Query.query(self,qs,corpus_file,self.width,method,method_arg,limit,filename=search_file), referer: http://kartor.32by32.com/philologic/douglass/
  File "/Library/Python/2.7/site-packages/philologic/Query.py", line 26, in query, referer: http://kartor.32by32.com/philologic/douglass/
    os.chdir(dir), referer: http://kartor.32by32.com/philologic/douglass/
OSError: [Errno 2] No such file or directory: '/var/lib/philologic/hitlists/', referer: http://kartor.32by32.com/philologic/douglass/

Mismatch between "page" field in toms table and "n" field in pages table

While working some more on the paging system, I noticed that for the Encyclopédie build, pages in the toms table are just the page numbers, whereas in the pages table there the volume number + the page number. For instance, page 35 in the toms table might be 3:35 in the pages table.
This causes issues with my paging code which fetches the page number by philo_id from the toms table, and then the start_byte for that page from the pages table. I suppose I could get the byte start from the toms table...

Template/JS cleanup

The Mako templates are quite disorganized, frequently misnamed, and several are completely obsolete and unused--we should go through and rationalize the design. We should also modify the javascript code to detect as much structure as possible at run-time, so that we don't have to edit JS when we modify a template, and clearly document what JS data structures can and should be edited by users.

make absolute object links for object titles

Key Error in ranked relevance: rare issue

When running a search for "drachmes" in the Encyclopédie, I get the following error message:

Error !

KeyError: u'\xe6'

${f.cite.make_div_cite(i)} % endif % if hit_num: <% metadata = {} for m in db.locals['metadata_fields']: metadata[m] = i[m] url = f.link.make_query_link(q["q"],q["method"],q["arg"],*metadata) %>
/usr/lib64/python2.6/urllib.py, line 1236:
res = map(safe_map.getitem, s)
/usr/lib64/python2.6/urllib.py, line 1244:
return quote(s, safe)
/var/www/html/philo4/encycmarch12_4_clovis/functions/link.py, line 43:
encoded_str.append(quote_plus(k, safe='/') + '=' + quote_plus(v, safe='/'))
/var/www/html/philo4/encycmarch12_4_clovis/functions/link.py, line 36:
return "./?" + url_encode(q_params)
templates/relevance.mako, line 33:
<%
/usr/lib/python2.6/site-packages/mako/runtime.py, line 824:
callable(context, *args, *_kwargs)
/usr/lib/python2.6/site-packages/mako/runtime.py, line 798:
_exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
/usr/lib/python2.6/site-packages/mako/runtime.py, line 766:
**kwargs_for_callable(callable, data))
/usr/lib/python2.6/site-packages/mako/template.py, line 412:
return runtime.render(self, self.callable, args, data)
/var/www/html/philo4/encycmarch12_4_clovis/reports/render_template.py, line 17:
return template.render(_args, *_data).encode("UTF-8", "ignore")

check out range queries

range queries work okay-ish on their own, but may crash in non-first position on complex queries. look closely at this once we have some good numeric metadata loaded. for now, we're quote-escaping the autocomplete on any metadata value with a hyphen, just like the vertical bar.

text search oddity due to sentence bounding?

Here is an odd one and low priority IMHO.

Text search for "M de Garenjeot". 3 hits which don't seem right, the hit hi-lite is in the middle of these lins:

la troisieme race: on voit que dès le tems de Henri premier il signoit les chartes de
Novogrodeck, de Brestia, de Kiovie, de Mscislau, de Vitepsk, & de Poloczk. La Lithuanie
de bienfaisance & d'humanité dans le tems de leur pauvreté que dans le tems

I've run it several times and got the same.

"m de Garenjeot" produces a number of hits, mostly "m de"

"m de garenjeot": produces nothing.

There is one hit in the ENC:
http://artflx.uchicago.edu/cgi-bin/philologic/getobject.pl?c.3:190.encyclopedie0113.1318406.1318409.1318412

I'm wondering if the M. is throwing off because it being handled as a sentence terminator? Just a guess, but I'm thinking sentence bounding, since "m de" gets 89 in this and 4,200 in P3. "paroles c v restitutor" also shows same.

Most of the sentence bounding looks good. And man this reminds me just how much I hate ".". Almost as bad as "'". :-)

Check accented characters in headword search

http://robespierre.ci.uchicago.edu/philo4/encycmarch12_4/

3/15

"ecartele" works
"ECARTELE" works

unknown - CONTRÉCARTELÉ, Volume 4
Eidous - CONTRÉCARTELER, Volume 4
unknown - ECARTELÉ, Volume 5
unknown - ECARTELER, Volume 5

"ecartelé" fails
"écartelé" fails
"ÉCARTELÉ" fails
ECARTELÉ fails

Also, for text search, "ÉCARTELÉ" fails ... probably need to lower case input accented capitals since "écartelé" and .... opps

text search for "ECARTELE" gets 1 hit "Il y a des écus contr'écartelés qui ont vingt " ...

100 in P3.

Question: Are we using the OLD convention of upper case letters match all forms of letter?

Booleans in headword search and text search?

Status of boolean operators? I don't know if they are supposed to work, but I thought I would mention 'em

Headword:
"tradition mythologique" works

"tradition NOT mythologique: fails
"tradition AND mythologique: fails
"tradition OR mythologique": files

in text search
"esprit NOT vin" fails.
"esprit AND vin" fails
"esprit OR vin" fails.

Now, the implied AND as a space " ", so "esprit vin" in a set span works. The "|" also appears to work. We put the keywords "AND" and "OR" as a way to try to standardize metadata and full text searching.

NOT is really helpful ... eg find all of the instances of esprit NOT followed by vin.

proper noun flag

For some languages, we have lemmatization or other resources in markup to distinguish capitalized common nouns from proper nouns. Greek, for example, almost never capitalizes common nouns. It would be nice to have a flag in the word object to disable accent-flattening. This cuts across a lot of different mechanisms, but it would basically only work for exact quote matching.

Make table of contents loadable via ajax

Avoid long load times

multi-word phrases in ObjectWrapper

Currently, the HitWrapper / ObjectWrapper tests whether the length of Word objects is 7 or greater. Having specific modeling and access to individual word objects in a multi-word phrase is going to be critical.
::
RW

3+ word phrases dying in templates

1 and 2 word phrases work in concordance, but throw a template error with 3+. Probably formatting code. Investigate immediately.

Frequency By Object

Right now, we can perform frequency by title, which operates on the title string, which may or may not be unique, and thus may correspond to more than one object. For proper bibliography-style results, we need to be able to compute and report frequency by object. However, this would itself require the ability to directly query Philo-id, which is a separate issue

tokens getting split intermittently

Robert noted that searches for "homm" also contained what appeared to be the tokens "homme" and "hommes"--investigation shows that all those records were getting returned from "homm e" and "homm es". This suggests a tokenization problem; I've checked the source code; the most likely explanation is that the ExpatWrapper is calling multiple consecutive char_data events. I plan to address this by buffering char_data, same fix as non-word-breaking tags.

word counts / word frequency tables

I'm unsure whether this belongs in the main frequency table, or else in its own report, but we'd like to be able to access the total word frequency files, and query them vs. metadata via the ranked-relevancy tables; These would also be the primitive form of PhiloMine feature vector web services.