Code Monkey home page Code Monkey logo

portal-leading's People

Contributors

50jonesh avatar cozettec avatar hikingpanda68 avatar vphill avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

cozettec brimwats

portal-leading's Issues

Investigate and implement Frictionless Data standard to describe datasets

This is a ticket to investigate and then start with a sample implementation of frictionless data for the datasets that we have been creating so far for this project. This will take a bit go digging to understand what this is and how it can be used for the project but there is a ton of documentation and examples on github.

https://frictionlessdata.io/

https://www.youtube.com/watch?v=lWHKVXxuci0

https://carpentries-incubator.github.io/frictionless-data-agriculture/

https://github.com/Swiss-Polar-Institute/frictionless-data-packages

Those are a few links that are probably a good start. Feel free to put additional links in the comments of this ticket if you find some that are useful.

Create Python script to automate running keyword count queries and exporting the results

What I need is, for each UNTL-BS term, a range of keyword counts across all Portal records that share the browse term (e.g. for all the records that share the term "Agriculture," how many have 1 keyword, how many have 2 keywords, etc.). I have two possible avenues in mind for accomplishing this:

  1. The first is to run a series of queries like the one below, each of which would capture the keyword counts for a single UNTL-BS term.

https://digital.library.unt.edu/solrparse/raw/?q=:
&fq=(aubrey_system:PTH+OR+untl_institution:UNTA)+AND+dc_rights_access:public
&fq=dc_subject.UNTL-BS_facet:%22Agriculture%22
&facet=true
&facet.field=dc_subject.UNTL-BS_count
&facet.field=dc_subject.KWD_count
&facet.limit=-1
&facet.mincount=1
&rows=0
&wt=xml

In this case, a script would be needed to iterate through all 1,076 UNTL-BS terms, and extract and export the relevant count ranges.

  1. The second is to run a single query that captures the keyword count ranges for every UNTL-BS term at the same time:

https://digital.library.unt.edu/solrparse/raw/?q=:
&fq=(aubrey_system:PTH+OR+untl_institution:UNTA)
&fq=dc_rights_access:public
&facet.pivot=dc_subject.UNTL-BS_facet,dc_subject.KWD_count
&facet=true
&facet.field=dc_subject.UNTL-BS_facet
&rows=0
&wt=xml
&indent=true

In this case, a script would be needed only to extract and export the relevant information - but it might be more difficult because the layout of the data in this query's output is more complicated than the single-term query.

Texas County dataset

Should be pretty straightforward.

We need a tabular (tsv or csv) file listing the 254 counties in Texas, along with a few other bits of data.

I suggest we include the following

  • County Name (with 'County' added to the end like Denton Count)
  • FIPS Code
  • geonames ID
  • wikidata ID
  • UNTL Place Name String
  • UNTL County String (see note)

Note: You can verify against this page - https://texashistory.unt.edu/search/?q=&t=fulltext&sort=added_d&fq= then go to the county facet and type "TX". It should get you how the Texas county is represented in this system (Aubrey).

This can get added to the github repository

create data directory in the untl-bs folder.

In the untl-bs folder let's go ahead and create a data directory and move the json files we are creating into that directory.

This way we will have a code and a data directory to better sort things.

You can do this in the web-based github but it might be easier with Github Desktop or using the command line.

If you haven't used git on the command line, I suggest working through this Software Carpentry Lesson - https://swcarpentry.github.io/git-novice/

Capture and export all collocated subject values for one UNTL-BS term

Create a test Solr query that grabs all subject terms (with whatever qualifier) collocated with one specific UNTL-BS term.

Create a script (or modify an existing script?) to run the Solr query and export the resulting terms as individual documents in a corpus.

[prep for clustering/topic modeling, Question 3 Approach C]

Metadata Variability Analysis

I should be able to use pivot facets in Solr to examine the range of unique values in several key fields (i.e. dc_type, untl_collection, and untl_institution) across records that share specific UNTL-BS terms of interest. This will help me identify any terms that have a high actual or proportional occurrence count because of improper or imbalanced usage.

Texas Library List

I think it make sense to use the 2019-All dataset for our work.

This is from the TSLAC Public Library Accreditation and Statistics program - https://www.tsl.texas.gov/ldn/statistics

I suggest we create a thinner tabular dataset (tsv or csv) for use with R or Pandas.

With a quick look I could see the fields of

  • Library Name
  • Population of the Legal Service Area
  • Legal Establishment
  • Region?
  • County

All being the most important with the County being the field that we will use to match this against our other datasets.

This can go into this repository with a citation to the datasets that we used.

Portal to Texas History Photographic Dataset - Items by County and Decade

Similar to #5, this URL provides the Solr output from the Portal related to photographs held.

https://digital.library.unt.edu/solrparse/raw/?q=(aubrey_system:PTH%20OR%20untl_institution:UNTA)%20AND%20dc_rights_access:public&fq=dc_type:image_photo&&facet=true&rows=0&facet.pivot.mincount=1&facet.pivot=str_location_county,untl_decade&facet.limit=-1&&fq=str_location_state:Texas

This should be converted into a tabular data format that can be used for subsequent analysis.

It should be noted that there are going to be non-Texas counties represented in this data so they will need to be removed for the final dataset because we are focusing on Texas.

Turn collocation counts into percentages

Is there a way for me to generate collocation counts as percentages of a term's total occurrence count, rather than as straight numbers? That would help me identify the top most-collocated pairs across the entire Portal much more easily.

(Does the fact that, when querying in Solr, the UNTL-BS term being queried always appears as the top value in the facet list help at all? Is there a way of automatically dividing the collocation count for each other term by the count on the first term in the list?)

Texas Newspapers from Chronicling America

Chronicling America maintains a list of a large number of newspaper titles from around the US

https://chroniclingamerica.loc.gov/search/titles/

You can limit to the 5,863 titles that they have records of from Texas - https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas

by adding a format=json to the URL you can get a json dataset that can try to use for our work.
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&format=json

We can also adjust the number of rows in the results and the page that we are on.
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&rows=100&format=json&page=1
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&rows=100&format=json&page=2

One of the things to keep in mind when working with these kinds of systems is to "not be greedy" for example, we have 5863 titles to grab from this. A greedy option would be to try and adjust rows=5863 and try and grab everything in one request. This may be just fine, but it also may be an unplanned sort of request in their system.

I will generally try and break this kind of request into a shorter set of requests, say 100 or maybe 250 per request and then just do a sequence of requests with a few seconds waiting between the requests. It might take a few minutes on my end but usually won't make the data providers upset because someone is hammering on a system when they don't need to.

It will be good to have the output finally end up in a tabular (tsv or csv) form that we can use in R or Pandas. We will need to do a bit of processing to get the info we are wanting. Just a quick look I think we might want

  • Title
  • LCCN (this is the unique identifier for these titles in this system, very useful)
  • start_year
  • end_year
  • frequency
  • language
  • place

You will also want to do some work with the place field in the records to try and extract the "County" for each titles so that we can have things align.

I suggest creating a directory in the repository called newspaper-titles and include both the dataset as well as any scripts used in making the final tabular dataset from the json.

One final suggestion, I suggest downloading the data as .json files and then writing any tools/scripts for working with the data to work off of those files as input instead of trying to have the script do the requests from the database. Just a suggestion.

Texas Digital Newspaper Program - Issues by County and Decade dataset

The following URL shows the Solr output of the Texas Digital Newspaper Program (TDNP)

https://digital.library.unt.edu/solrparse/raw/?q=*:*&fq=untl_collection:TDNP&fq=str_location_state:Texas&facet=true&rows=0&facet.pivot.mincount=1&facet.pivot=str_location_county,untl_decade&facet.limit=-1

This can be parsed into a tabular dataset that we can use for further analysis.

While most of the values are going to be Texas counties, they should be checked because there is also the possibility that non-Texas county information could be included.

Basic statistical analysis of occurrence values

Once I've got the list of occurrence counts from issue #13 I can run a basic statistical analysis: min/max, mean, median and mode, standard deviation, interquartile range, and/or a 5 number summary displayed as a box and whisker plot.

[Question 1, Approach A]

Texas choropleth map notebook.

Starting off with Jupyter Notebooks as a framework for building/documenting/sharing python code let's take a look at building some choropleth maps of the counties in Texas using the different datasets we have created.

Some example python for this.
https://docs.bokeh.org/en/latest/docs/gallery/texas.html

Might be useful when we get to sharing - https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/01-sharing-github/ the first part is about setting up an account/repo which we have already.

Parent/Child Collocation

For each of the 16 main headings, and for any additional terms on the terms of interest list, run a Solr query to determine how many of these terms owe their high occurrence counts to collocation with broader or narrower terms within the same branch. This may help eliminate some terms from later analysis.

Redo "spread" vs. usage analysis using a Solr query as starting point

Once I've got the list of counts from issue #13, I can use that to redo my original spread vs. usage analysis

For each of the 14 main UNTL-BS categories:
Calculate the occurrence count and number of narrower terms as a percentage of the total and compare the two

Starting on the first level of narrower terms:
Determine which "branches" of the UNTL Browse Structure are "top-heavy" (wider terms have higher occurrence counts than narrower terms) or "off-balance" (one narrower term has a high occurrence count relative to its siblings).

For now, I'll have to do the analysis in Excel.

[Question 1, Approach A]

Topic Model Dataset Creation

We need to create datasets for the following UNTL-BS fields. (supplied below by @50jonesh

We want to have tab delimited with

aubrey_identifier
display_title
dc_description
dc_subject

We know that there will be an uneven number of columns per row across the dataset and that shouldn't cause issue for the topic modeling.

We will give Mallet a shot first for topic modeling - https://mimno.github.io/Mallet/topics.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.