Code Monkey home page Code Monkey logo

jp-tools's Introduction

May be required for some projects:
- Wakan
- SQLite3.pas
- SQLite3Dataset.pas

At runtime:
- sqlite3.dll
- EDICT2
- kanjidic
- radkfile
- ewarodai.txt
- yarxi.db

jp-tools's People

Contributors

himselfv avatar

Stargazers

 avatar

jp-tools's Issues

Look in EDICT to determine if an example is an example or an expression

Warodai has examples of two types:
- Expressions, ex. 気になる, i.e. stuff which should be converted to EDICT 
articles
- Actual examples, ex. 彼を殺した時に私は安心した (doesn't actualy 
have this line), which should eventially go to Tanaka corpus format file.

To separate these two, the only option I can think of now is to look the 
expression in English EDICT.

Original issue reported on code.google.com by [email protected] on 12 Jan 2013 at 2:12

Convert some templates to tags

Some templates, such as ~suru, should not be applied to the word, but replaced 
with appropriate grammar tags, such as suru-verb. This is the convention used 
by JMDict.

Perhaps we are also to check all expressions against the same set of rules 
because if there's only one template it could be precompiled, i.e.

> SOUZOUsuru .... translation

We're to notice that SOUZOUsuru matches ~suru and convert it appropriately.

Original issue reported on code.google.com by [email protected] on 9 Apr 2013 at 1:58

Extract grammar/topic markers from the article

There's a lot of informal markers in the articles:

> кн. рыбы.
> спорт. судья; рефери (в борьбе сумо́).
> бот. спора.
> энт. тли, Aphididae.

> ономат.:
> 1) со стуком (падать, ударяться и т. п.);
> 2) жадно, давясь (пить, есть);
> 3) одиноко, потерянно.

We're to find these and in each case determine if the marker can be safely 
removed and added as appropriate grammar/topic tags.
If not, we're to ban the article in the strict mode.

Original issue reported on code.google.com by [email protected] on 9 Apr 2013 at 2:04

Tags from EDICT entries are merged together

This is a problem of Wakan dictionary format, where it merges together all 
grammar tags from a list of entries.

But even when this is fixed in Wakan (by converting to EDICT2 style 
dictionaries), we'll have no way of knowing, for the same kana-kanji pair which 
translation entry on EDICT relates to which translation entry on Warodai.

So we'll have to skip the cases where grammar flags are not for the whole 
article.

At most we can try to find common flags, the ones repeating in every entry, and 
apply those as common for translation.

Original issue reported on code.google.com by [email protected] on 4 Jan 2013 at 2:19

Square brakets sometimes appear in kana-kanji

Ex.:

第一[の] [だいいち[の]] /(adv,n) первый; перен. 
наилучший/(P)/
第一[に] [だいいち[に]] /(adv,n) во-первых; прежде 
всего; первым делом/(P)/

This confuses EDICT parsers and they treat [の] as a reading.

Original issue reported on code.google.com by [email protected] on 12 Jan 2013 at 2:08

Parse source language information

Articles sometimes have source language information (examples are not 
exhaustive):

> (фр. adieu) до свидания!, прощай[те]!

> (ит. alto)
> 1) альт, контральто (голос);
> 2) альт (инструмент).

> (кит. цзяоцза) пельмени.

We're to extract it and store in JMDict's <lsource> tag and in appropriate 
EDICT/2 forms:

> <sense>
> <lsource xml:lang="dut">ontembaar</lsource>

Original issue reported on code.google.com by [email protected] on 9 Apr 2013 at 1:53

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.