Code Monkey home page Code Monkey logo

lingtrain-aligner's Introduction

Hi there, I'm Sergei 👋

Twitter Follow Linkedin: averkieff Habr Badge Ods.ai Badge Profile views

  • 🚀 Working in the field of ML and MLOps.
  • 🌱 My main interest in this area is mostly NLP.
  • 😄 Besides my work I like to learn languages (Chinese, Russian, English, German, Czech, Hungarian, Japanese).
  • 💬 Ask me about how to pronounce "Köszönöm" in Hungarian and what does 侍 mean.
  • 🖋️ I'm also writing articles time to time.

Channels

Habr

Medium

lingtrain-aligner's People

Contributors

antimirov avatar averkij avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lingtrain-aligner's Issues

Unintended retention of '%%%%%' term after forced paragraph separation

I am using the "%%%%%." term to force separate lines into individual paragraphs. Here is an example:

Neil%%%%%h5.
We’ll discuss that more in a moment and find out if chatbots really think for themselves. But first I have a question for you, Rob. The first computer program that allowed some kind of plausible conversation between humans and machines was invented in 1966, but what was it called? Was it:

a) ALEXA %%%%%.

b) ELIZA %%%%%.

c) PARRY %%%%%.

While this process successfully separates the items into different paragraphs, it doesn't remove the "%%%%%" term, which is consequently retained in the final document.

image

File Already Exists

Делаю
docker pull lingtrain/aligner:v4
Загружаю текстовый файл и...

image

После вот такого предупреждения ничего не происходит
Причём оно вылазит на любой текстовый файл

A error when I use “splitter.split_by_sentences_wrapper”,please help check the error

when I use “splitted_from = splitter.split_by_sentences_wrapper(text1_prepared, lang_from)” return list,

But I see that there will be a conflict when insert sqlite ,specific error:

File "ling_test.py", line 36, in
aligner.fill_db(db_path, splitted_from, splitted_to)
File "lingtrain_aligner/aligner.py", line 498, in fill_db
db.executemany("insert into languages(key, val) values(?,?)", [("from", lang_from), ("to", lang_to)])
sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.

Add text splitting into small parts

The current version ignores the H1-H5 headers that were added by user. But when book was translate text from chapter 1 will be translate as a chapter 1 text into another language.
You can use this fact and split a big text to small parts.

Next idea - try split a big text to small blocks automatically:
Select a few sentences from original text(for example 10 sentences) and using loop try to find translate block in the thanslated text.

You can use the next psedocode:

left_array = original_sentences[100:110]
sum=[]
for i=50;i<150 do:
   right_array_candidate=translated_sentences[i:i+10]
   sum[i]=sum(cosunuse_distance(left_array,right_array_candidate))

rigth_array=get_index_with_max_value(sum)

left_text_split_index=left_array[0]
rigth_text_split_index=rigth_array[0]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.