alan-turing-institute / defoe Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 4.0 120.11 MB

Code to analyse books and newspapers data using Apache Spark.

License: MIT License

Python 0.40% Shell 0.03% Lex 98.79% HTML 0.69% XSLT 0.07% JavaScript 0.01% Makefile 0.01% C 0.03%

defoe's People

Contributors

Stargazers

Watchers

Forkers

rosafilgueira akrause2014 mikej888 kallewesterling

defoe's Issues

Combine duplicated query-specific helper functions

Combine the many duplicated query-specific helper functions (e.g. tuple unpackers) into a single helper module.

Reduce number/overhead of Azure authentication calls

defoe/spark_utils.py open_stream creates and connects to a new instance of azure.storage.blob.BlobService for every "blob". This should be refactored so it's only done once per Azure container within which the blobs reside.

Decouple object model from storage format

defoe/alto|books|fmp/archive.py each define Archive though Archive is not part of the data model (books/pages) but part of how the data is bundled. It should be possible, for example, to run queries over ALTO-compliant books that are not in a ZIP files too. Complementarily, it should be possible to run queries over British Library Newspapers which are in ZIP files.

Identify canonical commands to install "azure" package with "azure.storage.blob.BlobService"

On CentOS 7:

$ python --version
Python 2.7.15 :: Anaconda, Inc.
$ conda install -c anaconda --file requirements.txt
$ python
> from azure.storage.blob import BlobService
...
ImportError: No module named azure.storage.blob
...

Remove azure packages and reinstall, using conda-forge:

$ pip freeze | grep azure > azure.txt
$ pip uninstall -y -r azure.txt
$ conda install -c conda-forge azure
$ python
> from azure.storage.blob import BlobService

$ pip freeze | grep azure
azure==1.0.3
azure-common==1.1.16
azure-mgmt==0.20.1
azure-mgmt-common==0.20.0
azure-mgmt-compute==0.20.1
azure-mgmt-network==0.20.1
azure-mgmt-nspkg==1.0.0
azure-mgmt-resource==0.20.1
azure-mgmt-storage==0.20.0
azure-nspkg==3.0.2
azure-servicebus==0.20.1
azure-servicemanagement-legacy==0.20.1
azure-storage==0.20.3

On Urika:

$ python --version
Python 2.7.14 :: Anaconda custom (64-bit)
$ conda install -c anaconda --file requirements.txt
$ python
> from azure.storage.blob import BlobService
...
ImportError: No module named azure.storage.blob
...

$ pip freeze | grep azure > azure.txt
$ pip uninstall -y -r azure.txt
$ conda install -c conda-forge azure
$ python
> from azure.storage.blob import BlobService
ImportError: No module named azure.storage.blob

$ conda install -c conda-forge azure-storage
$ python
> from azure.storage.blob import BlobService
ImportError: cannot import name BlobService

$ pip freeze | grep azure
azure-common==1.0.0
azure-mgmt==0.20.1
azure-mgmt-common==0.20.0
azure-mgmt-compute==0.20.1
azure-mgmt-network==0.20.1
azure-mgmt-nspkg==1.0.0
azure-mgmt-resource==0.20.1
azure-mgmt-storage==0.20.0
azure-nspkg==1.0.0
azure-servicebus==0.20.1
azure-servicemanagement-legacy==0.20.1
azure-storage==0.36.0

Allow Azure blobs from different containers to be used

Clean up and standardise logging

Clean up and standardise logging.
Relates to #5.

Provide Azure credentials via a YAML file

Allow for credentials for multiple containers to be provided.

Document all queries, configuration and result formats

Both in-code and in documentation.

Support configuration of word preprocessing type

The following queries use PreprocessWordType.LEMMATIZE by default:

defoe.papers.queries.keysentence_by_year
defoe.papers.queries.target_and_keywords_count_by_year
defoe.papers.queries.target_and_keywords_by_year
defoe.papers.queries.target_concordance_collocation_by_date

To change to another preprocessing type requires editing the source code. Update the query configuration file to:

Either a YAML file that includes the preprocessing type and list of words/sentences.
Or a YAML file that includes the preprocessing type and the path (relative to the current file or absolute) to a plaintext list of words/sentences.

Extend support for preprocessing to all other queries across all data models.

Clean up and standardise error handling

Decide how errors are to be caught and handled e.g.

Problem reading a data file.
Problem reading a query configuration file.
Problem loading an object model or query module.
Parsing errors.
Unicode errors.

Abstract out common data and query model

Identify opportunities to abstract out a common data model from books and/or newspapers. A model which, at least, allows for the same query code to be run across books and/or newspapers - for a set of common queries - would be good.

Standardise query names across books/papers

Structure target_concordance_collocation_by_date.py query results

Currently defoe/papers/queries/target_concordance_collocation_by_date.py returns:

        {
            <YEAR>:
            [
                [<FILENAME>, <WORD>, <CONCORDANCE>, <OCR>],
                [<FILENAME>, <WORD>, <CONCORDANCE>, <OCR>],
                ...
            ],
            <YEAR>:
            ...
        }

Turn the lists into dicts with filename, keyword, concordance, ocr keys.

Update word searches to match on individual normalized words

Some queries create a single regexp including every word being searched for and try and match this regexp against a string containing every word in the article/page.

defoe.papers.queries.articles_containing_words creates a single regexp with all the words being searched for:

interesting_words = [re.compile(r'\b' + word.strip() + r'\b', re.I | re.U)
    for word in list(open(interesting_words_file))]

This regexp is then matched against all the words in the article:

interest = articles.flatMap(lambda (year, article):
    [((year, regex.pattern), 1) for regex in interesting_words if regex.findall(article.words_string)])

Adopting this approach in defoe.books.queries.find_words_group_by_year|word gave divergent results from running the non-Spark versions of these queries using alan-turing-institute/cluster-code (epcc-master) , in for example, diseases.py. Inspecting the raw XML data and manually searching for words showed matches not picked up by the above approach e.g. and Small-Pox, Small-pox.

diseases.py uses a function:

lowalpha=re.compile('[^a-z]')
def normalize(word):
    return re.sub(lowalpha,'',word.lower())

and matches against each word, rather than the entire text of a page:

for page, word in book.scan_words():
    normalised=normalize(word)
    if normalised in diseases:
        finds[normalised].append(page.code)

Adopting this approach in defoe.books.queries.find_words_group_by_year|word gave the same results as for cluster-code (epcc-sparkrods).

Update word searchers to match on individual normalized words.

Move defoe.books.alto.utils, which has normalize, to defoe.utils.

Upgrade to be Python 3-compliant

Run 2to3 to get recommendations for changes.

The key challenge will be handling Unicode. For example:

from cStringIO import StringIO

can be replaced by either of:

from io import BytesIO     # for handling byte strings
from io import StringIO    # for handling unicode strings

2to3 recommends changing:

try:
    return str(result[0])
except UnicodeEncodeError:
    return unicode(result[0])

to:

try:
    return str(result[0])
except UnicodeEncodeError:
    return str(result[0])

and

self.bwords = list(map(unicode, self.query(Page.words_path)))

to:

self.bwords = list(map(str, self.query(Page.words_path)))

https://python-future.org/compatible_idioms.html#unicode suggests something like:

from builtins import str as text
try:
    return str(result[0])
except UnicodeEncodeError:
    return text(result[0])

Update concordance queries to return window of surrounding text not full article/page text

This provides more user control over the volume of data returned for each match.

Update:

defoe/alto/queries/keyword_concordance_by_word.py
defoe/alto/queries/keyword_concordance_by_year.py
defoe/papers/queries/keyword_concordance_by_date.py
defoe/nzpp/queries/keyword_concordance_by_date.py

The following query:

defoe/papers/queries/target_concordance_collocation_by_date.py

provides a more complex example of returning concordance within a window of N words and could be copied and simplified (to remove the use of a target word) and then ported to alto too.

Review XPath queries and Python attribute names

Review all XPath queries and Python attribute names in object models.

Check that the interpretation of what XML elements and attributes mean by the Python developers is correct.

Differentiate hyphens from subordinate clauses

defoe/papers/article.py has:

@property
def words_string(self):
    """ 
    Return the full text of the article as a string. Remove all hyphens.
    This merges hyphenated word by may cause problems with subordinate
    clauses (The sheep - the really loud one - had just entered my office).
    """
    return ' '.join(self.words).replace(' - ', '')

Investigate if it is possible to differentiate hyphens from subordinate clauses.

Comment and document lda_topics

Improve commenting in and documentation about "lda_topics" query:

defoe/papers/queries/lda_topics.py
docs/papers/lda_topics.md

Support configuration of window size

defoe.papers.queries.target_concordance_collocation_by_date has a hard-coded WINDOW_SIZE = 5 by default. To change to another window size requires editing the source code. Update the query configuration file to:

Either a YAML file that includes the window size and list of words/sentences.
Or a YAML file that includes the window doze and the path (relative to the current file or absolute) to a plaintext list of words/sentences.

Fix Issue false positives

If defoe.papers.issue.Issue fails to parse an XML document an object is still built with empty strings, lists etc as fields and datetime.now() as a date. This can give misleading results when running queries in which such documents are encountered e.g.

nohup spark-submit --py-files defoe.zip  defoe/run_query.py ~/data/papers/files.all.txt papers defoe.papers.queries.articles_per_year -n 144 > log.txt &
cat results.yml

{1714: 115, 1715: 302, ....,1950: 7155, 2019: 0}