Code Monkey home page Code Monkey logo

public-person's Introduction

datasets-br

Describing the datasets-br directives and using this project as point of generic discussions.

Dataset-BR directives

  1. To post qualified datasets in the Datahub.io;
  2. To unify, by curatory process, a set of Wikidata fragments if items, or commom instances of an item;
  3. To unify terminology to express CSV colunm names, table and column semantics (SchemaOrg conventions when possible)
  4. Digital preservation (CSV files and data dumps from original soruces) of the curated datasets;
  5. Monitoring/auditing Wikidata and OpenStreetMap changes, in the context of the curated datasets.

Use as an ecosystem of datasets

Example of use with 2 BR's datasets, state-codes and city-codes.

Operating with pure SQL or SQL-unifier will be easy to merge with other datasets... With PopstgreSQL you can offer datasets in an standard API with PostgreREST (or its descendents pREST and PostGraphile), or plug-and-play with SchemaOrg standards, FrictionlessData standards (and tools), etc.

Documentation

... under construction

Conventions for data provenance and prepare.


  Contents and data of this project are dedicated to

public-person's People

Contributors

ppkrauss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

public-person's Issues

normalize and canonize names

Rules to normalize:

  • same name same ASCII representation, by unaccent(name) function.
  • similar name same Metaphone-pt

Rules to canonize (choice of the official):

  1. WHEN same CPF and similar name (or same first names)
    1.1. check official name with a CPF-resolver
    1.2. (if not possible) use the most recent

  2. WHEN no CPF and same name (and same birthDate)
    2.1. use the most recent when more than 6 years diff
    2.2. (when else) use the "most accented" version or "most standard pt-BR" (eg. preffer use of i insted y)

... Use some log to notice mesages in conflict resolutions ...

Example:

               name               | birthdate  |           source         
----------------------------------+------------+---------------------------
 ANTONIO SETUBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2016
 ANTÔNIO SETUBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2008
 ANTONIO SETÚBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2004
 ANTÔNIO SETÚBAL SILVESTRE        | 1963-01-04 | br:tse;ce:candidatos:2012
 FABRICIO JOSE SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2010
 FABRICIO JOSÉ SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2004
 FABRÍCIO JOSÉ SATIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2000
 FABRICIO JOSÉ SÁTIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2008
 FABRÍCIO JOSÉ SÁTIRO DE OLIVEIRA | 1975-07-01 | br:tse;sc:candidatos:2012

Most accented "ANTÔNIO SETÚBAL SILVESTRE" of 2012, most recent "ANTONIO SETUBAL SILVESTRE" of 2016...

After canonized, delete records and register all variants in the info JSON

{
 "etc":"etc",
 "synonymous":[
     {"name":"ANTONIO SETUBAL SILVESTRE", "source":123},
     {"name":"ANTÔNIO SETUBAL SILVESTRE", "source":456}
  ]
}

check best performance after stable CSV algorithms

Small and medium files

CREATE FUNCTION lib.get_csvfile(
  file text,
  delim_char char(1) = ',',
  quote_char char(1) = '"'
) RETURNS setof text[]  AS $f$
  import csv
  return csv.reader(
     open(file, 'rb'), 
     quotechar=quote_char, 
     delimiter=delim_char, 
     skipinitialspace=True, 
     escapechar='\\'
  )
$f$ immutable language PLpythonU;

CREATE FUNCTION lib.get_csvline(
    _line text,
    _delim_char char(1) = ',',  
    _quote_char char(1) = '"'
) RETURNS text[] AS $f$
    import csv
    return [row for row in csv.reader(
        [_line], 
        quotechar = _quote_char, 
        delimiter = _delim_char, 
        skipinitialspace = True, 
        escapechar = '\\'
    )][0]
$f$ immutable language PLpythonU;

big CSV files

For big CSV files, best is to use EXTENSION file_fdw. Complex but works fine, and need only one command after implemented.

    CREATE SCHEMA csv; -- a library, easy to drop 

    CREATE FUNCTION csv.split_csv2(
      line text,                  -- the input CSV UTF-8 line
      delim_char char(1) = ',',   -- can be also E'\t', ';', '|', etc.
      quote_char char(1) = '"'   -- field quotation
    ) RETURNS setof text[] AS $f$
      import csv
      return csv.reader(
          line.splitlines(), 
          quotechar=quote_char, 
          delimiter=delim_char, 
          skipinitialspace=True, 
          escapechar='\\'
      )
    $f$ IMMUTABLE language PLpythonU;

    CREATE FUNCTION csv.split_csv_line2(
      text, char(1) DEFAULT ',', char(1) DEFAULT '"'
    ) RETURNS text[] AS $f$
      SELECT x FROM csv.split_csv($1,$2,$3) x LIMIT 1;
    $f$ language SQL IMMUTABLE;

   -- not reliable but fast? http://stackoverflow.com/a/42926265/287948
   -- compare performance with csv.split_csv_line2()
   -- ideal a C implementation
    CREATE OR REPLACE FUNCTION csv.split_csv_line(
       line text,                 -- the input CSV string
       delim_char char(1) = ',',  -- can be also E'\t', ';', '|', etc.
       quote_char char(1) = '"',  -- field quotation
       OUT result text[]
    ) LANGUAGE plpgsql AS
    $f$DECLARE
       i integer;
       t text := '';
       c char(1);
       /*
        * 0 means unquoted
        * 1 means quoted
        * 2 means quoted, and we just read a quote
        */
       q integer := 0;
    BEGIN
       /* loop through the characters */
       FOR i IN 1..length(line) LOOP
          /* get i-th character */
          c := substring(line FROM i FOR 1);
          /* end of string is at an unquoted delimiter */
          IF c = delim_char AND q <> 1 THEN
             result := result || t;
             t := '';
             q := 0;
          ELSIF c = quote_char THEN
             CASE q
                WHEN 0 THEN
                   q := 1;
                WHEN 1 THEN
                   IF c = quote_char THEN
                      q := 2;
                   ELSE
                      t := t || c;
                   END IF;
                WHEN 2 THEN
                   q := 1;
                   t := t || quote_char;
             END CASE;
          ELSE
             IF q = 2 THEN
                q := 0;
             END IF;
             t := t || c;
          END IF;
       END LOOP;
       /* add the last string */
       result := result || t;
    END;$f$;

    CREATE EXTENSION file_fdw WITH SCHEMA csv;
    CREATE SERVER files FOREIGN DATA WRAPPER file_fdw;

    CREATE or replace FUNCTION csv.ftable_drop(tname text) RETURNS void AS $f$
      BEGIN
      EXECUTE format('DROP VIEW IF EXISTS %s', 'csv.'||$1);
      EXECUTE format('DROP FOREIGN TABLE IF EXISTS %s CASCADE', 'csv.fgn_'||$1);
      END
    $f$ language PLpgSQL;
    
    CREATE or replace FUNCTION csv.ftable_create(
         tname text,  -- table name for scan text and "csv_tname"  
         fpath text,  -- file path to the input CSV
         delim_char char(1) = ',',   -- can be also E'\t', ';', '|', etc.
         quote_char char(1) = '"',   -- field quotation
         p_check boolean default true  -- false to not check-and-drop.
    ) RETURNS text AS $f$
      DECLARE
        fgn_tname text;
        csv_tname text;
      BEGIN
      fgn_tname := 'csv.fgn_'||$1;
      csv_tname := 'csv.'||$1;
      IF p_check THEN PERFORM csv.ftable_drop($1); END IF;
      EXECUTE format(E'CREATE FOREIGN TABLE %s (line text) SERVER files OPTIONS ( filename %L, format \'text\')', fgn_tname, $2);
      EXECUTE format('CREATE VIEW %s AS SELECT csv.split_csv_line(line,%L,%L) as c FROM %s', csv_tname, delim_char,quote_char, fgn_tname);
      RETURN format(E'\ntables %I and %I created\n',fgn_tname,csv_tname);
      END
    $f$ language PLpgSQL;

    ---- TESTING ----
    SELECT csv.ftable_create('my','/tmp/TSE2/lix.csv',';');

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.