Code Monkey home page Code Monkey logo

aiu's People

Contributors

himarshaj avatar machawk1 avatar shawnmjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

machawk1

aiu's Issues

The convert_LinkTimeMap_to_dict function does not handle unquoted relations

There are cases when non-memento relations show up in the Link header. The convert_LinkTimeMap_to_dict function does not need to handle these relations, and currently ignores them.

In the case of https://ianmilligan.ca/, there is an item in the link header that causes this function to raise an exception because it is unquoted. The issue is present in https://homerfan.wordpress.com as well. Both of these appear to be WordPress sites.

In the example below, the relation "shortlink" is unquoted, which breaks the parser in this function.

# curl -I https://ianmilligan.ca/
HTTP/2 200
server: nginx
date: Thu, 06 Sep 2018 22:11:55 GMT
content-type: text/html; charset=UTF-8
strict-transport-security: max-age=86400
vary: Accept-Encoding
vary: Cookie
x-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
link: <https://wp.me/4cEB>; rel=shortlink
x-ac: 3.bur _bur

URIs returned from some PandoraCollection and PandoraSubject methods have an extra slash

There is an extra slash before the identifier in some URIs returned from the code.

For subjects:

In [1]: ps = PandoraSubject(83)

In [2]: ps.subject_uri
Out[2]: 'http://pandora.nla.gov.au/subject//83'

For collections:

In [1]: pc = PandoraCollection(12022)

In [2]: pc.collection_uri
Out[2]: 'http://pandora.nla.gov.au/col//12022'

Some web clients and web servers will not redirect these correctly and calling code should not have to create a workaround.

get_seed_metadata fails if called in the wrong order

The following function assumes that a user will have called function load_seed_metadata before calling get_seed_metadata.

https://github.com/oduwsdl/archiveit_utilities/blob/bdaa349623705dc2c70e1dc17316d8b433327e5b/aiu/archiveit_collection.py#L507-L513

If called in the wrong order, then get_seed_metadata throws the following exception.

Traceback (most recent call last):
  File "/Users/smj/Unsynced-Projects/MementoEmbed/mementoembed/services/errors.py", line 27, in handle_errors
    return function_name(urim, preferences)
  File "/Users/smj/Unsynced-Projects/MementoEmbed/mementoembed/services/memento.py", line 195, in seeddata
    output[\'metadata\'] = sr.seed_metadata()
  File "/Users/smj/Unsynced-Projects/MementoEmbed/mementoembed/seedresource.py", line 152, in seed_metadata
    metadata = self.aic.get_seed_metadata(self.urir)[\'collection_web_pages\']
  File "/Users/smj/.virtualenvs/MementoEmbed/lib/python3.7/site-packages/aiu/archiveit_collection.py", line 506, in get_seed_metadata
    d = self.seed_metadata["seeds"][uri]
KeyError: 'seeds'

Update AIU to provide NLA collection metadata

Collections at NLA are stored at https://webarchive.nla.gov.au/collection

The can contain mementos or sub-collections. This makes getting metadata more complicated that getting it from Archive-It.

Each collection has a numeric identifier, such as 15003. The collection identifier exists at the end of the URL for that collection, e.g., https://webarchive.nla.gov.au/collection/15003. NLA collection 15003 has nine sub-collections.

Collection 13742 contains mementos instead of sub-collections. Our solution needs to be able to paginate and load additional content from the collection page. The "Show 10 More" button loads more content, but, to save on resources like RAM, we want to access the content without having to use a headless browser.

Fortunately, after some analysis with Chrome's developer tools, I've discovered that we can acquire a JSON representation of the collection via a URL like https://webarchive.nla.gov.au/bamboo-service/collection/13742 where we replace the last part of the path with the collection identifier.

Using ArchiveItCollection as an example, we will need to create another class that allows anyone to acquire this content via Python. Once this is done, it can be called from Hypercane or MementoEmbed, as needed.

Implement get_collectedby for PandoraCollection and PandoraSubject

I did not see a get_collectedby (or equivalent) for PandoraCollection or PandoraSubject. Such a function would need to descend into TEPs and acquire the organization names that created these collections. This way a resulting story would have access to this metadata for visualization.

AIU cannot process Pandora Subject 22

Any loop that examines TEP pages terminates if any page fails. It should recover gracefully if one of them fails. This occurs whenever get_metadata_from_tep is called. Because of this Hypercane cannot process some Pandora Subjects, like 22.

Here is a minimal working example:

In [1]: from aiu import PandoraSubject

In [2]: ps = PandoraSubject(22)

In [3]: memento_list = ps.list_memento_urims()
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in get_metadata_from_tep(res, data)
    106     try:
--> 107         json_data = json.loads(res.text)
    108         #print(json_data)

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    356             parse_constant is None and object_pairs_hook is None and not kw):
--> 357         return _default_decoder.decode(s)
    358     if cls is None:

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py in decode(self, s, _w)
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py in raw_decode(self, s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-3-69a1a780eaf3> in <module>
----> 1 memento_list = ps.list_memento_urims()

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in list_memento_urims(self)
    390         """Lists the memento URIMs of an NLA Trove collection."""
    391
--> 392         self.load_subject_metadata()
    393         #print(self.metadata["main"]["urims"])
    394         return self.metadata["main"]["urims"]

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in load_subject_metadata(self)
    358         if not self.metadata_loaded:
    359             soup = BeautifulSoup(self.firstpage_response.text, 'html5lib')
--> 360             self.metadata["main"] = extract_main_subject_data(soup,self.subject_id)
    361             #self.metadata["optional"] = extract_optional_collection_data(self.session.get(self.collection_json_uri))
    362             self.metadata_loaded = True

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in extract_main_subject_data(soup, subject_id)
    307         #print(tep_id)
    308         tep_json_uri = tep_json_prefix + tep_id
--> 309         tep_dic = get_metadata_from_tep(requests.get(tep_json_uri),data)
    310         #print(tep_json_uri)
    311         #Some tep urls give server error subject 12, https://webarchive.nla.gov.au/bamboo-service/tep/75101

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in get_metadata_from_tep(res, data)
    115                 return tep
    116             else:
--> 117                 raise ValueError("Could not find 'Problem accessing /bamboo-service/collection/' string in response")
    118
    119         except IndexError as e:

ValueError: Could not find 'Problem accessing /bamboo-service/collection/' string in response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.