Code Monkey home page Code Monkey logo

defoe's People

Contributors

akrause2014 avatar mialondon avatar mikej888 avatar rosafilgueira avatar thobson88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

defoe's Issues

Reduce number/overhead of Azure authentication calls

defoe/spark_utils.py open_stream creates and connects to a new instance of azure.storage.blob.BlobService for every "blob". This should be refactored so it's only done once per Azure container within which the blobs reside.

Decouple object model from storage format

defoe/alto|books|fmp/archive.py each define Archive though Archive is not part of the data model (books/pages) but part of how the data is bundled. It should be possible, for example, to run queries over ALTO-compliant books that are not in a ZIP files too. Complementarily, it should be possible to run queries over British Library Newspapers which are in ZIP files.

Identify canonical commands to install "azure" package with "azure.storage.blob.BlobService"

On CentOS 7:

$ python --version
Python 2.7.15 :: Anaconda, Inc.
$ conda install -c anaconda --file requirements.txt
$ python
> from azure.storage.blob import BlobService
...
ImportError: No module named azure.storage.blob
...

Remove azure packages and reinstall, using conda-forge:

$ pip freeze | grep azure > azure.txt
$ pip uninstall -y -r azure.txt
$ conda install -c conda-forge azure
$ python
> from azure.storage.blob import BlobService

OK

$ pip freeze | grep azure
azure==1.0.3
azure-common==1.1.16
azure-mgmt==0.20.1
azure-mgmt-common==0.20.0
azure-mgmt-compute==0.20.1
azure-mgmt-network==0.20.1
azure-mgmt-nspkg==1.0.0
azure-mgmt-resource==0.20.1
azure-mgmt-storage==0.20.0
azure-nspkg==3.0.2
azure-servicebus==0.20.1
azure-servicemanagement-legacy==0.20.1
azure-storage==0.20.3

On Urika:

$ python --version
Python 2.7.14 :: Anaconda custom (64-bit)
$ conda install -c anaconda --file requirements.txt
$ python
> from azure.storage.blob import BlobService
...
ImportError: No module named azure.storage.blob
...

$ pip freeze | grep azure > azure.txt
$ pip uninstall -y -r azure.txt
$ conda install -c conda-forge azure
$ python
> from azure.storage.blob import BlobService
ImportError: No module named azure.storage.blob

$ conda install -c conda-forge azure-storage
$ python
> from azure.storage.blob import BlobService
ImportError: cannot import name BlobService

$ pip freeze | grep azure
azure-common==1.0.0
azure-mgmt==0.20.1
azure-mgmt-common==0.20.0
azure-mgmt-compute==0.20.1
azure-mgmt-network==0.20.1
azure-mgmt-nspkg==1.0.0
azure-mgmt-resource==0.20.1
azure-mgmt-storage==0.20.0
azure-nspkg==1.0.0
azure-servicebus==0.20.1
azure-servicemanagement-legacy==0.20.1
azure-storage==0.36.0

Support configuration of word preprocessing type

The following queries use PreprocessWordType.LEMMATIZE by default:

  • defoe.papers.queries.keysentence_by_year
  • defoe.papers.queries.target_and_keywords_count_by_year
  • defoe.papers.queries.target_and_keywords_by_year
  • defoe.papers.queries.target_concordance_collocation_by_date

To change to another preprocessing type requires editing the source code. Update the query configuration file to:

  • Either a YAML file that includes the preprocessing type and list of words/sentences.
  • Or a YAML file that includes the preprocessing type and the path (relative to the current file or absolute) to a plaintext list of words/sentences.

Extend support for preprocessing to all other queries across all data models.

Clean up and standardise error handling

Decide how errors are to be caught and handled e.g.

  • Problem reading a data file.
  • Problem reading a query configuration file.
  • Problem loading an object model or query module.
  • Parsing errors.
  • Unicode errors.

Abstract out common data and query model

Identify opportunities to abstract out a common data model from books and/or newspapers. A model which, at least, allows for the same query code to be run across books and/or newspapers - for a set of common queries - would be good.

Structure target_concordance_collocation_by_date.py query results

Currently defoe/papers/queries/target_concordance_collocation_by_date.py returns:

        {
            <YEAR>:
            [
                [<FILENAME>, <WORD>, <CONCORDANCE>, <OCR>],
                [<FILENAME>, <WORD>, <CONCORDANCE>, <OCR>],
                ...
            ],
            <YEAR>:
            ...
        }

Turn the lists into dicts with filename, keyword, concordance, ocr keys.

Update word searches to match on individual normalized words

Some queries create a single regexp including every word being searched for and try and match this regexp against a string containing every word in the article/page.

defoe.papers.queries.articles_containing_words creates a single regexp with all the words being searched for:

interesting_words = [re.compile(r'\b' + word.strip() + r'\b', re.I | re.U)
    for word in list(open(interesting_words_file))]

This regexp is then matched against all the words in the article:

interest = articles.flatMap(lambda (year, article):
    [((year, regex.pattern), 1) for regex in interesting_words if regex.findall(article.words_string)])

Adopting this approach in defoe.books.queries.find_words_group_by_year|word gave divergent results from running the non-Spark versions of these queries using alan-turing-institute/cluster-code (epcc-master) , in for example, diseases.py. Inspecting the raw XML data and manually searching for words showed matches not picked up by the above approach e.g. and Small-Pox, Small-pox.

diseases.py uses a function:

lowalpha=re.compile('[^a-z]')
def normalize(word):
    return re.sub(lowalpha,'',word.lower())

and matches against each word, rather than the entire text of a page:

for page, word in book.scan_words():
    normalised=normalize(word)
    if normalised in diseases:
        finds[normalised].append(page.code)

Adopting this approach in defoe.books.queries.find_words_group_by_year|word gave the same results as for cluster-code (epcc-sparkrods).

Update word searchers to match on individual normalized words.

Move defoe.books.alto.utils, which has normalize, to defoe.utils.

Upgrade to be Python 3-compliant

Run 2to3 to get recommendations for changes.

The key challenge will be handling Unicode. For example:

from cStringIO import StringIO

can be replaced by either of:

from io import BytesIO     # for handling byte strings
from io import StringIO    # for handling unicode strings

2to3 recommends changing:

try:
    return str(result[0])
except UnicodeEncodeError:
    return unicode(result[0])

to:

try:
    return str(result[0])
except UnicodeEncodeError:
    return str(result[0])

and

self.bwords = list(map(unicode, self.query(Page.words_path)))

to:

self.bwords = list(map(str, self.query(Page.words_path)))

https://python-future.org/compatible_idioms.html#unicode suggests something like:

from builtins import str as text
try:
    return str(result[0])
except UnicodeEncodeError:
    return text(result[0])

Update concordance queries to return window of surrounding text not full article/page text

This provides more user control over the volume of data returned for each match.

Update:

defoe/alto/queries/keyword_concordance_by_word.py
defoe/alto/queries/keyword_concordance_by_year.py
defoe/papers/queries/keyword_concordance_by_date.py
defoe/nzpp/queries/keyword_concordance_by_date.py

The following query:

defoe/papers/queries/target_concordance_collocation_by_date.py

provides a more complex example of returning concordance within a window of N words and could be copied and simplified (to remove the use of a target word) and then ported to alto too.

Differentiate hyphens from subordinate clauses

defoe/papers/article.py has:

@property
def words_string(self):
    """ 
    Return the full text of the article as a string. Remove all hyphens.
    This merges hyphenated word by may cause problems with subordinate
    clauses (The sheep - the really loud one - had just entered my office).
    """
    return ' '.join(self.words).replace(' - ', '')

Investigate if it is possible to differentiate hyphens from subordinate clauses.

Comment and document lda_topics

Improve commenting in and documentation about "lda_topics" query:

  • defoe/papers/queries/lda_topics.py
  • docs/papers/lda_topics.md

Support configuration of window size

defoe.papers.queries.target_concordance_collocation_by_date has a hard-coded WINDOW_SIZE = 5 by default. To change to another window size requires editing the source code. Update the query configuration file to:

  • Either a YAML file that includes the window size and list of words/sentences.
  • Or a YAML file that includes the window doze and the path (relative to the current file or absolute) to a plaintext list of words/sentences.

Fix Issue false positives

If defoe.papers.issue.Issue fails to parse an XML document an object is still built with empty strings, lists etc as fields and datetime.now() as a date. This can give misleading results when running queries in which such documents are encountered e.g.

nohup spark-submit --py-files defoe.zip  defoe/run_query.py ~/data/papers/files.all.txt papers defoe.papers.queries.articles_per_year -n 144 > log.txt &
cat results.yml
{1714: 115, 1715: 302, ....,1950: 7155, 2019: 0}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.