Code Monkey home page Code Monkey logo

ungegn_pdf_parser's Introduction

United Nations Group of Experts on Geographical Names (UN GEGN) (UNGEGN)

This is a failed attempt at parsing the E/CONF.105/13. First I migrated the pdf to text to text with pdftotext -layout. I proceeded to do minimal processing on the PDF with vim. Then I thought I could right a parser on it to get it into a database.

This effort was abandon because the Unicode formatting is very poor, and some of it lacks Unicode entirely; some of it is just text-pictures. This throws off the parser (a problem I may be able to fix and Q/A), but renders the text in the screen shot impossible.

  • Sri Lanka, Myanmar and Comoros are examples of screen-shot text.
  • Albania / Shqipëri has no text for the first language, and the "short name is actually the language itself"
  • All of the names are in ASCII with the exception of the "CÔTE D’IVOIRE"
  • There is no way to tell if a column extends down a row, which column it was the first or second.. All we know it that only one column extended down so the other column is empty.

You can find more information here,

	* https://opendata.stackexchange.com/questions/13693/does-the-ungegn-release-their-country-names-localized-in-a-format-thats-not-a-p
	* https://opendata.stackexchange.com/questions/13692/where-does-iso-3166-get-the-names-and-translations-of-the-countries
	* https://dba.stackexchange.com/questions/225996/unicode-storage-of-u202b-rle-and-u202c-pde-in-a-unicode-aware-database
	* [UNGEGN](https://unstats.un.org/unsd/geoinfo/UNGEGN/)
	* [Working Group on Country Names](https://unstats.un.org/unsd/geoinfo/UNGEGN/wg1.html)

ungegn_pdf_parser's People

Contributors

evancarroll avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.