Code Monkey home page Code Monkey logo

operations's People

Contributors

0x010c avatar belett avatar hugolpz avatar jitrixis avatar poslovitch avatar seb35 avatar wikilucas00 avatar

Watchers

 avatar  avatar  avatar

Forkers

pamputt

operations's Issues

Fileformat in ogg?

In lingua-libre/operations/create_datasets.sh, we can read on line 37, 43 and 52 that the fileformat asked is ogg (while the chosen format for Lingua Libre files on Commons is wav).
Is there a reason for using this format in the datasets, and is there a reason preventing us from changing it for wav format?

All the best

Add programatic query of existing languages

Replace the hard coded line #48 by a programmatic query.

Rapid Query to get relevant info from LLQS :

SELECT ?lang ?langLabel ?code ( count(DISTINCT ?record) as ?nb ) WHERE {
  ?lang prop:P2 entity:Q4 ; rdfs:label ?langLabel . FILTER (lang(?langLabel) = "en").
  OPTIONAL { ?record prop:P4 ?lang ; prop:P2 entity:Q2 . }
  OPTIONAL { ?lang prop:P13 ?code }
}
GROUP BY ?lang ?langLabel ?code
ORDER BY DESC(?nb)

See also :

  • lingua-libre/CommonsDownloadTool#2

Test CommonDownloadTool on target languages sorted by ASC number of recording

Commit: f3e93bb

Source MediaWiki:LanguagesGalleryData.js, https://jsfiddle.net/gxjqunbr/1/ (2022.01.22).

To do for @mickeybarber :

  • git pull operations
  • run create_datasets.sh on its server 🕺🏼

Explanation :
Je suis retombé sur la question de la génération de datasets.
Suite à l'analyse de Mickey j'ai réalisé qu'une modification 'en dure' pourrait permettre de tester davantage.
J'ai donc modifié le script pour include "en dur" la liste des langues à télécharger dans la parti 3.
Les langues sont organisées de la plus petite à la plus dotée, ce qui nous permettra :

  • de mettre a jour l'essentiel de nos datasets (les 140+ les plus petits sur 147).
  • de voir le seuil de casse de CommonsDownloadTool et sa requete SPARQL.

Migrate `crontab` from lingualibre.fr to lingualibre.org ?

@Jitrixis, @Poslovitch : is one of you able to define what is ./crontab for ? Which skills are needed ?

I suspect it requires a Mediawiki / PHP / Backend expertises.

/crontab - understanding

  • STUDY: Inspect and define what is this script for
# Run maintenance scripts on the production instance
00 4 * * * /usr/bin/php7.0 /home/www/lingualibre.fr/maintenance/cleanupUploadStash.php > /dev/null 2>&1
00 5 * * * /usr/bin/php7.0 /home/www/lingualibre.fr/maintenance/rebuildLocalisationCache.php > /dev/null 2>&1
# Run maintenance scripts on the testing instance
15 4 * * * /usr/bin/php7.0 /home/www/v2.lingualibre.fr/maintenance/cleanupUploadStash.php > /dev/null 2>&1
00 5 * * * /usr/bin/php7.0 /home/www/v2.lingualibre.fr/maintenance/rebuildLocalisationCache.php > /dev/null 2>&1
# Other stuff
30 2 * * 1 /opt/letsencrypt/letsencrypt-auto renew >> /var/log/le-renew.log
45 2 * * 1 /bin/systemctl reload nginx
30 4 * * * logrotate /etc/logrotate.conf

/home/www/ actual folder structure and /crontab file's paths

  • @mickeybarber : What are the actual folders and paths inside /home/www/ ? Please provide the /home/www/ folder's tree structure for 1, 2 or 3 levels, as necessary so see if it match the various ./crontab file's paths.

Interpretation of results

  • Does the server folder tree matches the paths within the ./crontab file ?
  • Should we edit and migrate ./crontab from lingualibre.fr to new paths values ?

Others

@mickeybarber, in the same server exploration you should bump into the following items which we would gain to document better...

  • V2: does /home/www/v2.lingualibre.fr/ path still exists on the server ? I think this LL version 2 is the current lingualibre.org. So I expect this v2 path and folder the be missing because it got renamed into a .org path and folder. Can you see such thing ?
  • dev: https://dev.lingualibre.org/ (dev version) is online and working. To which actual folder and path does this correspond ?

Note: the 5 points above are nearly the same question = share with us the directory structure so we may see the possible broken paths. This will help us to know which url are outdated and by which to replace them.

Zip names split by space

Statement

@pamputt reported :

In the name of the archive, the language name is cut when it contains a space. For example, before we had "Q115107-bcl-Central Bikol.zip" and now "Q115107-bcl-Central.zip" (Bikol has dissapeared). Is it possible to fix that quickly or should I open a bug report on Phabricator ?

See https://lingualibre.org/datasets/

Pointer

The bug is likely from this section create_datasets.sh#L53-L56, the pseudo regex.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.