Code Monkey home page Code Monkey logo

mag_sample's Introduction

mag_sample

The present repository is an extract from a project that uses data from Microsoft Academic Graph (MAG) and ProQuest. The programs work on a remote computer that has the data stored outside of the repository. The path to the data is defined in src/analysis/setup.R and src/dataprep/helpers/variables.py.

Directory structure

  • src/: data preparation and linking; analyze the publication careers of scientists.
  • output/: destination for tables and figures generated in src/.
  • snapshots/: contains files to reproduce the environments (for now, yml files for conda).

Contents

directory src/dataprep

The pipeline.sh script calls consecutively the scripts for

  • setting up the sqlite database with MAG data
  • preparing additional tables for analysis
  • loading the ProQuest data into the database
  • linking records between MAG and ProQuest
  • making some reports about the data and linking quality

directory src/analysis

The pipeline.sh script calls consecutively the scripts for

  • comparing graduation trends in ProQuest data and official statistics
  • assess the quality of current links
  • figures of publication dynamics over the career
  • analysis of career duration and becoming an advisor later in the career
  • supporting scripts are in setup.R and in helpers/

Notes

See open issues for some features that are currently lacking.

  • The python helpers are stored in a different location and we need to come up with a clean and simple way to make them accessible across the repository.
  • There are some redudancies between the Rscripts in analysis and dataprep because the scripts come originally from two different repositories.

mag_sample's People

Contributors

f-hafner avatar chrished avatar monadap avatar

Watchers

 avatar  avatar

Forkers

qcx201

mag_sample's Issues

improve first last name split by using language specific rules

For example for spanish we currently have:

firstname : juan
lastname : gonzalez
middlename : eugenio iglesias

However, the main last name is iglesias (the first last name)

Proposal: use https://nationalize.io to predict which country/language a name is from and implement specific rules for those.

Caveat: For spanish names, sometimes people give just the first lastname and sometimes both. So it is not obvious how to handle it automatically

Adapt to new modular structure

pipeline.sh will not work as-is because calling a python script in a sub-directory, but loading a module that is not in that subdirectory will fail.
To fix: execute the scripts as modules, like this:

  1. add a file __init__.py to all directories where there is a python script
  2. in the pipeline, adjust the code for calling scripts from "python3 /some/path/script.py" to "python3 -m some.path.script"

See also https://peps.python.org/pep-0338/

Affiliation info: cites, pubs and fields

I think we can make the tables for paper/citation counts at affiliation-year-field0 as well as the keyword list at the same level in one go:

  1. Create temporary table with unique PaperId, Year, AffiliationId
  2. summarise paper outcomes at AffiliationId-Field0-Year level
  3. summarise paper keywords at AffiliationId-Field0-Year level

To consider/note

  • We will replace the table affiliation_outcomes, which is only at the AffiliationId level, with the output from step 2 above. I don't know right now where we use this old table as an input, and we should check
  • There are two ways of assigning papers to "departments" at institutions: from the author's main field, or from the paper's main field. The latter is probably more accurate
  • The output from step 3 contains additionally the columns FieldOfStudyId and a Score. We can use the next lower integer of the score as a frequency weight to calculate tf-idf (but not sure how to exactly implement frequency weights)
  • Not sure how to add the count of researcher in this query

Here are the queries, which we can use to replace the query in affiliation_outcomes.py.

-- ## 1. create temp table 
CREATE TEMP TABLE paper_affiliation_year AS 
SELECT DISTINCT AffiliationId, Year, PaperId
FROM (
    SELECT a.AuthorId, a.AffiliationId, a.Year, b.Paperid
    FROM AuthorAffiliation a -- ## if an author has 2 main affiliations in the same year, we count their papers at both institutions
    INNER JOIN (
        SELECT PaperId, AuthorId, Year
        FROM PaperAuthorUnique
        INNER JOIN (
            SELECT PaperId, Year
            FROM Papers
        ) USING(PaperId)
    ) b
    ON a.AuthorId=b.AuthorId AND a.Year=b.Year
    -- reduces size of the data set 
    INNER JOIN (
        SELECT PaperId
        FROM paper_outcomes
    ) USING(PaperId)
)

CREATE INDEX ON idx_paper_temp ON paper_affiliation_year (PaperId)

-- ## 2. create table with citation/paper counts
CREATE TABLE affiliation_outcomes AS  -- this is already defined, where? can we replace it? where do we use it?? 
SELECT COUNT(PaperId) AS PaperCount
    , SUM(CitationCount_y10) AS CitationCount_y10
    , AffiliationId
    , Year 
    , Field0
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, CitationCount_y10 
    FROM paper_outcomes 
) USING(PaperId)
INNER JOIN ( -- each field is unique per paper, so it is ok to join only here 
    SELECT PaperId, Field0 
    FROM PaperMainFieldsOfStudy
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_affo_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)


-- ## 3. table with keywords
CREATE TABLE affiliation_fields AS 
SELECT SUM(Score) AS Score
    , FieldOfStudyId
    , AffiliationId 
    , Field0
    , Year 
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, FieldOfStudyId, Score
    FROM PaperFieldsOfStudy 
    INNER JOIN (
        SELECT FieldOfStudyId 
        FROM FieldsOfStudy 
        WHERE level < 2 -- choose appropriate level 
    ) USING(FieldOfStudyId)
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_afff_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)

improve advisor linking

  • use comparator functions as for grants
  • restrict mag sample to be linked with people at US institutions

try one field, see difference in performance, then do the rest

understand low advisor links in some fields, and fix

see output/quality_linking_advisors.pdf

for instance, goid 2449294030 has the first advisor that only has an affiliation in 2009. among others, he has a science paper w/o any affiliation reported.

check some more examples of non-linked theses in these fields

Package src/helpers

(This is a copy from chrished/MAG/#2)

The helpers/ directory is used in most of the scripts above. helpers/ is easily accessed if it is a subdirectory but when the scripts are in separate subdirectories, the proper way to do this is to package the scripts in helpers/. There are 2 options:

  • use PyPI, but then the package is public
  • host a private PyPI server, for instance using devpi. devpi seems straightforward to use, but I don't know if it will be easy to share packages across contributors.

Main benefits:

  • tidier code, cleanly separated and probably easier to maintain
  • the packaged helpers could be used as well in other projects (e.g., researcher_gender)

Clean MAG FOS for NA etc

Some abstracts contain text like "abstract not available", currently MAG field of study is predicted for those. Need to filter them out on the go/make a separate list which are NA.

Correct type of AuthorId in links_nsf_mag

GrantID|AuthorId|Position|firstname_similarity|lastname_similarity
0000050|2440992798.0|0|1.0|1.0
0000135|686086669.0|1|1.0|1.0
0000433|327948395.0|1|1.0|1.0

AuthorId has a decimal point. This is wrong. It should be a (big) integer.

I check now where this happens and fix it.

main field per author-year?

do we need to calculate the main field1, field0 for each year an author is active?

it may be helpful in #27, but not sure it is worth the time right now.

rename "institution" column in author_info_linking

d1c3076 changes column names in table author_info_linking (in script with same name in main/prep_mag) to the various meanings of institution. we should also rename the "institution" column to "first_main_institution" to be consistent, and change this in the linking scripts.

fix DeprecationWarning: escape sequences for normalize_string()

DeprecationWarning for normalize_string():

src/dataprep/helpers/functions.py:41
  /home/flavio/projects/mag_sample/src/dataprep/helpers/functions.py:41: DeprecationWarning: invalid escape sequence \w
    s = s.replace("[^\w\d]", " ", regex = True)

src/dataprep/helpers/functions.py:42
  /home/flavio/projects/mag_sample/src/dataprep/helpers/functions.py:42: DeprecationWarning: invalid escape sequence \s
    s = s.replace("\s+", " ", regex = True)

these should be replaced with r"[^\w\d]" and with r"\s+" respectively.

see also here

in proquest, use new mag fields at level 0

currently, we're selecting based on pq_fields_mag which are imputed in load_proquest/correspond_fieldofstudy.py. Now that we have the predicted MAG fields, we could use this for selecting on field level 0.

This query computes a table similar to pq_fields_mag:

create table pq_fields_mag_future as 
select goid, ParentFieldOfStudyId AS FieldOfStudyId, sum(score) as score  
from (
    select goid, fieldofstudyid, ParentFieldOfStudyId, score 
    from pq_magfos 
    -- join parents at level 0
    inner join (
        select childfieldofstudyid, ParentFieldOfStudyId
        from crosswalk_fields
        where parentlevel = 0
    ) as crosswalk 
    on (pq_magfos.fieldofstudyid = crosswalk.childfieldofstudyid)
)
group by goid, parentfieldofstudyid

It could replace pq_fields_mag if we use the rank by score to replace the column position in the current table.
It would break some things: setup_linking may need to be adjusted, as well as link/topic_similarity_functions.py

data loading from db for dedupe: tuples not as expected?

not all records have their year_range as an expected tuple. this gives errors in the training and linking steps for NSF-MAG: we expect the year_range from NSF to be as (year, ) and the year_range from MAG to be (startyear, endyear). But sometimes tuples do not have the expected length, and sometimes other types are passed (see traceback below).

Traceback (most recent call last):
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/site-packages/dedupe/core.py", line 137, in __call__
    self.fieldDistance(record_pairs)
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/site-packages/dedupe/core.py", line 148, in fieldDistance
    distances = self.data_model.distances(records)
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/site-packages/dedupe/datamodel.py", line 84, in distances
    distances[i, start:stop] = compare(record_1[field],
  File "/home/christoph/mag_sample/src/dataprep/helpers/comparator_functions.py", line 167, in compare_range_from_tuple
    elif len(b) == 1:
TypeError: object of type 'int' has no len()

most probably a problem with loading the data from the database to the form for dedupe.

compute topic similarity more efficiently?

Currently, we iterate over each graduation year, but for each iteration, we load a window of data +/- 5 years into memory. If we compute the similarity for a 2 or more neighboring graduation years, we only have to add data for two additional years. This could speed up the calculations. The trade-off is that this needs more memory.

update conda env

on main branch:

  • patch yml file from flavio/link_grants
  • add the following dependencies: geopy, xlrd, ipykernel, openpyxl, pandoc

Biology write links process stops: split sample automation

Biology linking stopped during the create links script, to solve we split the data into two subsamples. This works mechanically, but it seems to create a break in the linking rate.

  • Check what is the cause of linking rate break (compare with putting the split at different years) and check whether the selection of the highest link score is the issue.
  • tidy up the biology split and make the whole process automatic

improve graduates linking

  • 1. extend time range up to 2015
  • 2. compare name similarity of documents titles: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents. idea: by definition, people that we link publish something, and this is likely related to the topic of their dissertation.
  • 3. condition mag sample on people at some point/at beginning of career employed in US? It is faster and likely more precise (less noise from unrelated entities). It may reduce the size of the linked sample, but we are interested in the US anyway.
  • 4. save training and setting files in specific folder for graduates
  • 5. Create benchmarking as described below
  • 6. keywords comparator: count number of overlaps / dummy for at least one overlap or not
  • 7. title comparator: larger ngrams? stemming before tfidf? -- use benchmark for intuition
  • 8. extend list of first papers: now those in first 7 years of publication; maximum 10
  • 9. Increase sampled fields from MAG side for people whose field can fall in different major fields (biology vs chemistry)

PaperCount in author_fields_detailed

something I came across when calculating topic similarity: what is PaperCount in author_fields_detailed created/used for? should it be dropped b/c it is calculated elsewhere?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.