Code Monkey home page Code Monkey logo

epub-translator's Introduction

epub-translator is a tool that translates EPUB files using large language models. It attempts to convert HTML codes into plain text, which is then sent to a LLM for translation.

"Kusamakura" by Natsume Sōseki, translated into English with gpt-3.5-turbo. Licensed under CC-BY-SA 3.0.

"Kusamakura" by Natsume Sōseki, translated into English with gpt-3.5-turbo. Licensed under CC-BY-SA 3.0.

Getting Started

  1. Populate config.json with your API key and modify the prompts to suit your needs.

  2. Unzip an EPUB file into input folder.

  3. Run the executable or run from source.

npm install
npm run start
  1. Follow the on-screen instructions.

  2. The translated files should appear in output folder.

Config

Key Type Description
OPENAI_API_KEY string OpenAI API key.
PROMPTS.TEXT string Prompt for translating paragraphs that are free of HTML tags.
PROMPTS.HTML string Prompt for translating paragraphs that are mixed with HTML tags.
PROMPTS.PASSTHROUGH string Prompt for translating raw HTML codes with no pre-processing.
PROMPTS.SENTENCE string Prompt for translating a sentence that is free of HTML tags.
PROMPTS.SENTENCE_HTML string Prompt for translating a sentence that is mixed with HTML tags.

Options

Side by Side Mode

Default: No

Display the original and translated text side by side. This is simply a naive line printer that prints the original line first and the translated line second. As the translation result might have a different number of lines, the lines may get out of order. You might want to try lowering the prompt character limit if you want it to be strictly in order.

Remove Ruby Annotations

Default: Yes

Remove <ruby> and <rt> tags in the source material before sending it for translation. This might affect the quality of the translation.

Model

Default: gpt-3.5-turbo

Possible values: gpt-3.5-turbo, gpt-3.5-turbo-0613, gpt-4, gpt-4-0314 or gpt-4-0613

Recommended: gpt-3.5-turbo or gpt-4-0314

gpt-3.5-turbo offers reasonable translation quality at a low cost. gpt-4-0314 provides better translation quality and overall behaves better, especially in handling texts with HTML codes. gpt-4 and gpt-4-0613 are not recommended as their translation results are somehow worse than the older GPT-4 model.

Prompt Character Limit

Default: 1024

NOTE: This limit is enforced in terms of the number of characters, not tokens. As a rule of thumb, you might presume that every Japanese character equals 1 to 1.2 tokens and that every 750 English words equal 1000 tokens.

This setting limits how many characters will be accumulated before being sent to the LLM API for translation. Different models have different token limits, and these limits are shared between input and output. Therefore, even though GPT-4 offers 8k token support, you should not attempt to fill the prompt to its full capacity. Otherwise, there will be no space left for providing a response. Although it might be expected that a longer prompt provides better results, it seems that GPT loses focus when processing long messages and tends to repeat itself. This setting might not be respected if the source material contains HTML blocks that cannot be split.

To Do

  • Write tests
  • Parse <nav> element

epub-translator's People

Contributors

slyh2 avatar slyh avatar

Stargazers

 avatar  avatar  avatar  avatar Alexis Zucco avatar Simon avatar  avatar Dongmin Kim avatar

Watchers

 avatar

Forkers

wooodhead

epub-translator's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.