Code Monkey home page Code Monkey logo

solr_search's Introduction

Introduction

The Data and Analytics section (D&A) of UNICEF collects, integrates, analyzes, models, and publishes hundreds of indicators on the state of children and women at the global, regional, national, and subnational levels. Owing to the wide variety of its data that span many thematic areas it has historically found it challenging to house these data in a central location that is easy to discover, query, and consume.

The BMGF-funded Helix project enhances UNICEF’s ability to co-locate and make these public indicators available using a single set of services that conform to the UN-preferred Statistical Data and Metadata eXchange standard (SDMX). In addition, the Helix project works to integrate the business process to deliver consistent and coordinated data.

This project aims at making searches of indicator values convenient, removing some of the query limitations imposed by SDMX, the standard used to exchange statistical data within the Helix project. You can access an Apache Solr API to answer many additional questions, like retrieving all values of indicators with a specific value on a dimension, filtering results by a subset of domains, retrieving all values of indicators related to a specific data source, etc.

The Solr index is available here

Guests can query the index using the credentials:

Username: DAPMRead

Password: solrread!

This user can query the index, but it has no write permissions.

Structure of the index

The index is composed of a set of objects, each one representing a specific indicator of a country in a year. The structure is hierarchical since an indicator is composed of a set of Disaggregations, each one composed of Attributes and Dimensions. There are properties (in Solr they are called fields; “property” is a word coming from object-oriented programming) at the level of the ROOT of the indicator (like country, country code, year, total value, data source of total value, etc.) and properties at the level of each disaggregation (like value and data source for that specific disaggregation). Dimensions and Attributes are highly flexible, and they are represented as triples Name, Value, Code, so they are not hardcoded in the index. Still, we can always have new dimensions or attributes and the schema does not change.

About the data source, in SDMX sometimes it is modeled as a dimension, sometimes as an attribute. In Solr, the data source is a property of disaggregation to make things easier and to enable more powerful queries. In addition, since the ROOT element also contains properties related to the total value of the indicator, the data source related to the total value is repeated there (when the total value is available).

A clarification: the "total value" is the value when all dimensions are “Total”.

As an example of a record (in JSON):

         "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014",
         "HelixCode": "PT_ADLS_10-14_LBR_HC",
         "Indicator": "Percentage of adolescents (aged 10-14 years) engaged in household chores",
         "Year": "2014",
         "CountryCode": "AFG",
         "Country": "Afghanistan",
         "content_type": "indicator",
         "TotalValue": null,
         "DATA_SOURCE": null,
         "Disaggregations": [
             {
             "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T",
             "TotalValue": "9.3",
             "DATA_SOURCE": "Living Conditions Survey 2013-14",
             "Dimensions": [
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|SEX",
                     "Name": "SEX",
                     "Value": "Male",
                     "Code": "M"
                 },
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|AGE",
                     "Name": "AGE",
                     "Value": "10 to 14 years old",
                     "Code": "Y10T14"
                 },
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|EDUCATION_LEVEL",
                     "Name": "EDUCATION_LEVEL",
                     "Value": "Total",
                     "Code": "_T"
                 },
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|WEALTH_QUINTILE",
                     "Name": "WEALTH_QUINTILE",
                     "Value": "Total",
                     "Code": "_T"
                 },
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|RESIDENCE",
                     "Name": "RESIDENCE",
                     "Value": "Total",
                     "Code": "_T"
                 },
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|NUMBER_CHLD",
                     "Name": "NUMBER_CHLD",
                     "Value": "Total",
                     "Code": "_T"
                 }
             ],
             "Attributes": [
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|UnitOfMeasure",
                     "Name": "UnitOfMeasure",
                     "Value": "%",
                     "Code": "PCNT"
                 },
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|TimePeriodMethod",
                     "Name": "TimePeriodMethod",
                     "Value": "End of fieldwork"
                 },
                 {
                     "ID": "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|Coverage",
                     "Name": "Coverage",
                     "Value": "2013-14"
                 }
             ]
         }]

How to query Solr (examples)

Apache Solr has strong support for nested documents. To enable nested documents, all children must have a unique identifier. In our case, children's identifiers are an extension of the parent's identifiers. So, for example, the attribute with ID:
"UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T|UnitOfMeasure"
belongs to the disaggregation with ID:
"UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014|M|Y10T14|_T|_T|_T|_T"
which belongs to the indicator:
"UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014".

Solr keeps track of the hierarchy in a field called _nestpath. In our index, the hierarchy is expressed in this way:

 _nest_path_:"/Disaggregations"  
 _nest_path_:"/Disaggregations/Attributes"  
 _nest_path_:"/Disaggregations/Dimensions"

When you query Solr, you can use the field q to type your query. In addition, the field fl can be used to retrieve specific fields of documents that match the query. For example, if you are matching a ROOT document and you want to see the whole hierarchy, fl must be *,[child]. By default, only 10 children are displayed. If your document has more than 10 children, you can change this parameter using the "limit" option: *,[child limit=100] Other combinations are possible, also filtering on fields of children. Solr has a very powerful tool that allows doing a lot of things with hierarchies, for example matching children and retrieving parents. It’s the Block Join Query Parser. You can read this article if you want to learn more about the queries that you can build.

The rest of this document provides some examples of queries:

Retrieve an indicator by ID with all the disaggregations

Back to all examples

The ID of an indicator is the indicator identifier available in Solr, e.g. "UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|AFG|2014". It is not the indicator Helix Code.

q

ID:"UNICEF|PT|1.0|PT_ADLS_10-14_LBR_HC|COD|2014"

fl

*,[child limit=50]

Remove the fl filter to retrieve only the ROOT, so the indicator with total value and without disaggregations.

Link

Retrieve all indicators from a given country in a given year

Back to all examples

q

+content_type:"indicator" +CountryCode:"COD" +Year:2013

fl

HelixCode

Link

Filter by one or more domains

Back to all examples

Domains are encoded in indicators' IDs. Thus, whatever query you are writing, you can add a filter that limits indicators to specific domains. For example, to filter only by Nutrition and Child Protection:
Existing Query AND (ID:(UNICEF|PT*) OR ID:(UNICEF|NT*)) AND content_type: "indicator"

Content type filter is needed if you want to match ROOT documents, i.e. indicators.

Get all the disaggregations whose dimension WEALTH_QUINTILE is LOWEST

Back to all examples

q

{!parent which='_nest_path_:"/Disaggregations"'}+Value:"Lowest" +Name:"WEALTH_QUINTILE"

fl

*,[child limit=50]

link

This query searches the Value “Lowest” in all Dimensions with name “WEALTH_QUINTILE” and returns all Disaggregation (the parent filter with the specification of the nest path of the hierarchy) with all Dimensions and Attributes (the fl filter). If you prefer to search by code, you can search for "Q1", which is the code for "Lowest":

q

{!parent which='_nest_path_:"/Disaggregations"'}+Code:"Q1" +Name:"WEALTH_QUINTILE"

fl

*,[child limit=50]

link

Limit the output to some children's fields

Back to all examples

If we want to limit the returned hierarchy only to the dimension wealth quintile, we can express FL as:

fl

*,[child childFilter=Name:WEALTH_QUINTILE]

In this case, the output contains all the disaggregations matching the query with all the properties (indicator value and data source), but without attributes and with the only dimension "wealth quintile".

link

Limit the output to a subset of domains

Back to all examples

q

{!parent which='_nest_path_:"/Disaggregations" AND (ID:(UNICEF|PT*) OR ID:(UNICEF|NT*))'}+Value:"Lowest" +Name:"WEALTH_QUINTILE"

[link](https://unicef.searchstax.com/solr/IndicatorsData/select?fl=*%2C[child]&q={!parent which%3D'nest_path%3A%22%2FDisaggregations%22 AND (ID%3A(UNICEF|PT*) OR ID%3A(UNICEF|NT*))'}%2BValue%3A%22Lowest%22 %2BName%3A%22WEALTH_QUINTILE%22)

Limit the output to specific parent fields

Back to all examples

If in output we want to see only the total value and the disaggregation ID, we can use FL to filter only on that attribute:

fl

ID,TotalValue

link

Query on the code of a dimension (not on the value)

Back to all examples

In case we don’t want to use the value “Lowest”, but the code “Q1”, the query becomes:

q

{!parent which='_nest_path_:"/Disaggregations"'}+Code:"Q1" +Name:"WEALTH_QUINTILE"

fl

*,[child limit=50]

link

Get all WEALTH_QUINTILE dimensions available in the index

Back to all examples

q

+_nest_path_:"/Disaggregations/Dimensions" +Name:"WEALTH_QUINTILE"

link

Filter on data sources

Back to all examples

The Data Source is available at the level of disaggregations. This is important because, if two data sources provide a different value for the same combination of dimensions of an indicator, we maintain both values in the index. To get all disaggregations (with all Attributes and Dimensions) of a specific data source:

q

+_nest_path_:"/Disaggregations" +DATA_SOURCE:"Living Conditions Survey 2013-14"

fl

*,[child limit=50]

link

When disaggregations are not available but the total value is available, the data source is represented as an attribute of the ROOT element, together with the value. To discover all indicators having a data source for their total value, we can query Solr in two ways:

q

-_nest_path_:* +DATA_SOURCE:*

q

+content_type:"indicator" +DATA_SOURCE:*

And then always:

fl

*,[child limit=50]

link

If we want to match a specific data source about indicators' total values:

q

+content_type:"indicator" +DATA_SOURCE:"TERCE 2013"

fl

*,[child]

If a data source provides values both at disaggregation level and non-disaggregated values, we can combine two queries:

q

(+content_type:"indicator" +DATA_SOURCE:"TERCE 2013") OR (+_nest_path_:"/Disaggregations" +DATA_SOURCE:"TERCE 2013")

fl

*,[child limit=50]

link

The previous query is just an example since in the current index there are no similar situations at this time. In case there will be, results will be a mix of disaggregations (when the data source provides values at disaggregation level) and ROOT elements (when the data source does not provide any disaggregation)

Get the most recent version of an indicator

Back to all examples

This query is very useful to access the latest value of an indicator in a country. Thus, in Q we filter by Helix Code and Country Code. Then we sort by year descending and we take only one element in the output:

q

+CountryCode:"COD" +HelixCode:"PT_ADLS_10-14_LBR_HC"

fl

*,[child limit=50]

sort

Year desc

rows

1

link

solr_search's People

Contributors

fcproj avatar

Watchers

Yves Jaques avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.