unt-libraries / portal-leading Goto Github PK

The Portal to Texas History projects to support the IMLS-funded project, LIS Education and Data Science Integrated Network Group (LEADING)

Python 9.98% Jupyter Notebook 90.02%

portal-leading's People

Contributors

Stargazers

Watchers

Forkers

cozettec brimwats

portal-leading's Issues

Investigate and implement Frictionless Data standard to describe datasets

This is a ticket to investigate and then start with a sample implementation of frictionless data for the datasets that we have been creating so far for this project. This will take a bit go digging to understand what this is and how it can be used for the project but there is a ton of documentation and examples on github.

https://frictionlessdata.io/

https://www.youtube.com/watch?v=lWHKVXxuci0

https://carpentries-incubator.github.io/frictionless-data-agriculture/

https://github.com/Swiss-Polar-Institute/frictionless-data-packages

Those are a few links that are probably a good start. Feel free to put additional links in the comments of this ticket if you find some that are useful.

Create Python script to automate running keyword count queries and exporting the results

What I need is, for each UNTL-BS term, a range of keyword counts across all Portal records that share the browse term (e.g. for all the records that share the term "Agriculture," how many have 1 keyword, how many have 2 keywords, etc.). I have two possible avenues in mind for accomplishing this:

The first is to run a series of queries like the one below, each of which would capture the keyword counts for a single UNTL-BS term.

https://digital.library.unt.edu/solrparse/raw/?q=:
&fq=(aubrey_system:PTH+OR+untl_institution:UNTA)+AND+dc_rights_access:public
&fq=dc_subject.UNTL-BS_facet:%22Agriculture%22
&facet=true
&facet.field=dc_subject.UNTL-BS_count
&facet.field=dc_subject.KWD_count
&facet.limit=-1
&facet.mincount=1
&rows=0
&wt=xml

In this case, a script would be needed to iterate through all 1,076 UNTL-BS terms, and extract and export the relevant count ranges.

The second is to run a single query that captures the keyword count ranges for every UNTL-BS term at the same time:

https://digital.library.unt.edu/solrparse/raw/?q=:
&fq=(aubrey_system:PTH+OR+untl_institution:UNTA)
&fq=dc_rights_access:public
&facet.pivot=dc_subject.UNTL-BS_facet,dc_subject.KWD_count
&facet=true
&facet.field=dc_subject.UNTL-BS_facet
&rows=0
&wt=xml
&indent=true

In this case, a script would be needed only to extract and export the relevant information - but it might be more difficult because the layout of the data in this query's output is more complicated than the single-term query.

Texas County dataset

Should be pretty straightforward.

We need a tabular (tsv or csv) file listing the 254 counties in Texas, along with a few other bits of data.

I suggest we include the following

County Name (with 'County' added to the end like Denton Count)
FIPS Code
geonames ID
wikidata ID
UNTL Place Name String
UNTL County String (see note)

Note: You can verify against this page - https://texashistory.unt.edu/search/?q=&t=fulltext&sort=added_d&fq= then go to the county facet and type "TX". It should get you how the Texas county is represented in this system (Aubrey).

This can get added to the github repository

create data directory in the untl-bs folder.

In the untl-bs folder let's go ahead and create a data directory and move the json files we are creating into that directory.

This way we will have a code and a data directory to better sort things.

You can do this in the web-based github but it might be easier with Github Desktop or using the command line.

If you haven't used git on the command line, I suggest working through this Software Carpentry Lesson - https://swcarpentry.github.io/git-novice/

Investigate the use of Raw Graphs for data visualization

Would be interesting to take a look at this tool

https://rawgraphs.io/

And see if it could be helpful for visualizing some of this collection data in different ways.

Texas County Data Sources

Converting this pdf into a tabular (csv or tsv) dataset that can be used in something like R or Pandas would be useful.

Texas County Population every ten years - https://texasalmanac.com/sites/default/files/images/topics/ctypophistweb2010.pdf

It can be added to this github repository and we will need to add a citation to the source of the data.

The dataset

Texas Library List.xlsx

Capture and export all collocated subject values for one UNTL-BS term

Create a test Solr query that grabs all subject terms (with whatever qualifier) collocated with one specific UNTL-BS term.

Create a script (or modify an existing script?) to run the Solr query and export the resulting terms as individual documents in a corpus.

[prep for clustering/topic modeling, Question 3 Approach C]

Establish methodology for performing topic modeling on collocated term lists

Write a Solr query to capture the range of occurrence counts of UNTL-BS terms

I've already done this through edit.texashistory.unt.edu, but the numbers were were not entirely accurate (the Portal filter excludes some records that do display on the Portal). I should replicate that same query using Solr.

Create visualization(s) to illustrate the UNTL-BS "tree" and/or the results of the spread vs. usage analysis

Metadata Variability Analysis

I should be able to use pivot facets in Solr to examine the range of unique values in several key fields (i.e. dc_type, untl_collection, and untl_institution) across records that share specific UNTL-BS terms of interest. This will help me identify any terms that have a high actual or proportional occurrence count because of improper or imbalanced usage.

Texas Library List

I think it make sense to use the 2019-All dataset for our work.

This is from the TSLAC Public Library Accreditation and Statistics program - https://www.tsl.texas.gov/ldn/statistics

I suggest we create a thinner tabular dataset (tsv or csv) for use with R or Pandas.

With a quick look I could see the fields of

Library Name
Population of the Legal Service Area
Legal Establishment
Region?
County

All being the most important with the County being the field that we will use to match this against our other datasets.

This can go into this repository with a citation to the datasets that we used.

Review UNTL-BS json structure.

Review the json structure and load a copy into a repository folder called untl-bs and into a file with the date in the format of untlbs-yyyy-mm-dd.json

You can download the existing json data from this url.

https://digital2.library.unt.edu/subjects/untl-bs.json

Portal to Texas History Photographic Dataset - Items by County and Decade

Similar to #5, this URL provides the Solr output from the Portal related to photographs held.

https://digital.library.unt.edu/solrparse/raw/?q=(aubrey_system:PTH%20OR%20untl_institution:UNTA)%20AND%20dc_rights_access:public&fq=dc_type:image_photo&&facet=true&rows=0&facet.pivot.mincount=1&facet.pivot=str_location_county,untl_decade&facet.limit=-1&&fq=str_location_state:Texas

This should be converted into a tabular data format that can be used for subsequent analysis.

It should be noted that there are going to be non-Texas counties represented in this data so they will need to be removed for the final dataset because we are focusing on Texas.

Turn collocation counts into percentages

Is there a way for me to generate collocation counts as percentages of a term's total occurrence count, rather than as straight numbers? That would help me identify the top most-collocated pairs across the entire Portal much more easily.

(Does the fact that, when querying in Solr, the UNTL-BS term being queried always appears as the top value in the facet list help at all? Is there a way of automatically dividing the collocation count for each other term by the count on the first term in the list?)

Texas Newspapers from Chronicling America

Chronicling America maintains a list of a large number of newspaper titles from around the US

https://chroniclingamerica.loc.gov/search/titles/

You can limit to the 5,863 titles that they have records of from Texas - https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas

by adding a format=json to the URL you can get a json dataset that can try to use for our work.
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&format=json

We can also adjust the number of rows in the results and the page that we are on.
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&rows=100&format=json&page=1
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&rows=100&format=json&page=2

One of the things to keep in mind when working with these kinds of systems is to "not be greedy" for example, we have 5863 titles to grab from this. A greedy option would be to try and adjust rows=5863 and try and grab everything in one request. This may be just fine, but it also may be an unplanned sort of request in their system.

I will generally try and break this kind of request into a shorter set of requests, say 100 or maybe 250 per request and then just do a sequence of requests with a few seconds waiting between the requests. It might take a few minutes on my end but usually won't make the data providers upset because someone is hammering on a system when they don't need to.

It will be good to have the output finally end up in a tabular (tsv or csv) form that we can use in R or Pandas. We will need to do a bit of processing to get the info we are wanting. Just a quick look I think we might want

Title
LCCN (this is the unique identifier for these titles in this system, very useful)
start_year
end_year
frequency
language
place

You will also want to do some work with the place field in the records to try and extract the "County" for each titles so that we can have things align.

I suggest creating a directory in the repository called newspaper-titles and include both the dataset as well as any scripts used in making the final tabular dataset from the json.

One final suggestion, I suggest downloading the data as .json files and then writing any tools/scripts for working with the data to work off of those files as input instead of trying to have the script do the requests from the database. Just a suggestion.

Texas Digital Newspaper Program - Issues by County and Decade dataset

The following URL shows the Solr output of the Texas Digital Newspaper Program (TDNP)

https://digital.library.unt.edu/solrparse/raw/?q=*:*&fq=untl_collection:TDNP&fq=str_location_state:Texas&facet=true&rows=0&facet.pivot.mincount=1&facet.pivot=str_location_county,untl_decade&facet.limit=-1

This can be parsed into a tabular dataset that we can use for further analysis.

While most of the values are going to be Texas counties, they should be checked because there is also the possibility that non-Texas county information could be included.

Basic statistical analysis of occurrence values

Once I've got the list of occurrence counts from issue #13 I can run a basic statistical analysis: min/max, mean, median and mode, standard deviation, interquartile range, and/or a 5 number summary displayed as a box and whisker plot.

[Question 1, Approach A]

Texas choropleth map notebook.

Starting off with Jupyter Notebooks as a framework for building/documenting/sharing python code let's take a look at building some choropleth maps of the counties in Texas using the different datasets we have created.

Some example python for this.
https://docs.bokeh.org/en/latest/docs/gallery/texas.html

Might be useful when we get to sharing - https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/01-sharing-github/ the first part is about setting up an account/repo which we have already.

Look for existing standards for identifying problematic terms in controlled vocabularies

Parent/Child Collocation

For each of the 16 main headings, and for any additional terms on the terms of interest list, run a Solr query to determine how many of these terms owe their high occurrence counts to collocation with broader or narrower terms within the same branch. This may help eliminate some terms from later analysis.

Redo "spread" vs. usage analysis using a Solr query as starting point

Once I've got the list of counts from issue #13, I can use that to redo my original spread vs. usage analysis

For each of the 14 main UNTL-BS categories:
Calculate the occurrence count and number of narrower terms as a percentage of the total and compare the two

Starting on the first level of narrower terms:
Determine which "branches" of the UNTL Browse Structure are "top-heavy" (wider terms have higher occurrence counts than narrower terms) or "off-balance" (one narrower term has a high occurrence count relative to its siblings).

For now, I'll have to do the analysis in Excel.

[Question 1, Approach A]

Topic Model Dataset Creation

We need to create datasets for the following UNTL-BS fields. (supplied below by @50jonesh

We want to have tab delimited with

aubrey_identifier
display_title
dc_description
dc_subject

We know that there will be an uneven number of columns per row across the dataset and that shouldn't cause issue for the topic modeling.

We will give Mallet a shot first for topic modeling - https://mimno.github.io/Mallet/topics.html

Write a Solr query to capture list of values from a field likely to be duplicated

Write and run a Solr query to capture a list of all values visible on the Portal coming from a field we think is likely to have a lot of term redundancy with UNTL-BS terms (e.g. Coverage). Export the list of terms for comparison with the UNTL-BS terms.