Code Monkey home page Code Monkey logo

eche-api's People

Contributors

karniewski avatar tiagosimoes-euf avatar

Watchers

 avatar  avatar  avatar  avatar

eche-api's Issues

Add key at top level to indicate existing verified data

Motivation

The API includes fields for verified data but most entries do not have any content there. As such, exposing verified data for all entries is not desirable at this point.

However, it may be useful to indicate whether a given entry has verified data without requiring the inclusion of verified data for all entries. This way, client applications can perform an additional scoped request when desirable.

Data processing and API structure changes

A new processed API field hasVerifiedData should return true when a given entry has any number of non-emtpy verified data fields, or false otherwise.

Further work

Update OpenAPI specification and API documentation.

Add country related processed keys to the API

Motivation

As of March 2023 significant changes have been introduced in the source ECHE list, especially at the level of the country data format.

Before: country values were country names.

After: country values are country codes based on Annex A6 of the Interinstitutional Style Guide (ISG) of the Publications Office of the European Union.

Given the differences between the ISG country codes and the ISO 3166-1 alpha-2 standard, the API should provide both, as well as the country names as per the ISG.

Data processing for current and new API fields

  • The canonical API field country contains the raw value provided in the ECHE list.
  • Data processing should identify whether the country value contains an ISG country code or country name.
  • The processed API field countryCode should henceforth contain the ISG country code by:
    • either copying and correcting an ISG country code
    • or correcting and matching an ISG country name to an ISG country code.
  • A new processed API field countryName should contain the ISG country name by:
    • either copying and correcting an ISG country name
    • or correcting and matching an ISG country code to an ISG country name.
  • A new processed API field countryCodeIso should contain the corresponding ISO 3166-1 alpha-2 country code.
  • The processed API field erasmusCodeCountryCode should henceforth contain the ISG country code.
  • A new processed API field erasmusCodeCountryCodeIso should contain the ISO 3166-1 alpha-2 country code.

Further tasks

Update OpenAPI specification and API documentation.

Adding verified data to the application

Problem

Currently, there are issues with the data found in the ECHE list related to HEI names and city names:

  • the organisationLegalName refers to the ECHE holder and not necessarily the HEI; in some cases, the legal entity that holds the ECHE is the owner of the HEI, not the HEI itself;
  • the organisationLegalName sometimes appears as UPPERCASE or with wrong capitalisation when language is taken into consideration;
  • the city is, in fact, part of the postal address, which means it often carries additional information, such as district number and other postal related words (i.e. CEDEX in France);
  • the city sometimes appears in a native language, some other times in a different language (i.e. regional language or English) so the same city may in fact appear in more than one form;
  • the ECHE list does not contain enough data so that these issues may be sorted out without using other sources.

Proposed solution

Add verified data sources to the application, attach the verified data when available and expose it in the API.

Obtaining verified data

Given the known limitations of the data in the ECHE list and the foreseeable difficulties in collecting information at the individual HEI level, the best option would be to source verified data from either National Agencies or the relevant Ministries.

The verified data should include the HEI name as presented to the public, not necessarily the legal name of the ECHE holder, and the correct spelling of the city name without additional postal indications. If possible, these should be accompanied by an ISO 639-1 language code language code or even a complete IETF language tag when relevant.

Attaching the verified data

Besides the data points to be attached, it is also necessary for the verified data to include some unique identifiers of each HEI so that data can be correctly matched. Ideally both erasmusCode and pic, even oid when available, would provide the ability to match data between sources, correctly and with redundancy. For faster results, a country code can be included as well, so that the matching may occur on a subset of the ECHE list.

The verified data should be attached to the ECHE list data after the existing cleaning operations (so that normalized identifers are available for matching) and before the database is populated.

Exposing the verified data

In order to expose the verified data in addition to the ECHE list data, new API keys will be required, for example:

  • verifiedName
  • verifiedNameLang
  • verifiedCity
  • verifiedCityLang

The new API keys must be added to the specification as non-required, since it is not guaranteed that such data will exist.

Verified data should include Display Name (w/ lang)

There should be a distinction between the Organisation Legal Name, as published in the ECHE list and used in legally binding documents, and the Display Name, which is a much more useful data point for user facing applications.

For example, while an IIA may be established by THE PROVOST, FELLOWS, FOUNDATION SCHOLARS & THE OTHER MEMBERS OF BOARD, OF THE COLLEGE OF THE HOLY & UNDIVIDED TRINITY OF QUEEN ELIZABETH NEAR DUBLIN, one could argue that Trinity College Dublin is easier to identify in a colloquial setting.

Other example include cases where an Erasmus Charter is awarded to a legal entity that owns an educational institution of a different name, just like EIA - ENSINO E INVESTIGACAO E ADMINISTRACAO SA owns Atlântica - Instituto Universitário.

Because this data cannot be drawn from the ECHE list, it should be an optional component of verified data.

Coding standards

This project could use a clean up, so the following packages are suggested:

  • flake8
  • isort

Flatten processed fields to be on par with canonical fields

Motivation

Since the API processed fields are named by appending descriptions to the canonical field names, it is safe to present both canonical and processed fields at the same level. This reduces the complexity of API calls when dealing with canonical and processed fields.

The bandwidth impact is negligible: at the time of this writing, the unfiltered output with canonical fields was 389 kB and the unfiltered output with both canonical and processed fields was 470 kB.

Separation of concerns: verified data

All ECHE list entries are processed, but only a subset will have verified data attached. As such, this flattening proposal does not include verified data fields in the API.

Further work

Update OpenAPI specification and API documentation.

NaN values in date columns

The ECHE List with filename 20231220_List_of_Accredited_HEIs_within_the_Erasmus+_Programme_2021-2027_0.xlsx contains incorrect values in the ECHE Start Date column (where the ECHE End Date is 31-07-2024).

This causes the current processing to produce NaN entries, which in turn are not being correctly output in JSON, resulting in invalid API output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.