Code Monkey home page Code Monkey logo

turjuman's Introduction



GitHub release Documentation GitHub license Documentation Status GitHub stars GitHub forks

AraT5

Turjuman is a neural machine translation toolkit. It translates from 20 languages into Modern Standard Arabic (MSA). Turjuman is described in this paper: TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation.

Turjuman exploits our AraT5 model. This endows Turjuman with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value.


Requirements and Installation

  • To install turjuman and develop directly using pip:
    pip install -U turjuman
  • To install turjuman and develop directly GitHub repo using pip:
    pip install -U git+https://github.com/UBC-NLP/turjuman.git
  • To install turjuman and develop locally:
    git clone https://github.com/UBC-NLP/turjuman.git
    cd turjuman
    pip install .

Getting Started

The full documentation contains instructions for getting started, translation using diffrent methods, intergrate Turjuman with your code, and provides more examples.

Colab Examples

(1) Command Line Interface

Command ContentColab link
turjuman_translate
  • Usage and Arguments
  • Translate using greedy search
  • Translate using beam search (default)
  • Translate using sampling search
  • Read and translate text from file
colab
turjuman_interactive
  • Usage and Arguments
  • Examples
colab
turjuman_score
  • Usage and Arguments
  • Input files format
  • Example
colab

(2) Integrate Turjuman with your python code

Functions ContentColab link
translate
translate_from_file
  • Install Turjuman
  • Initial turjuman object
  • Translate using greedy search
  • Translate using beam search (default)
  • Translate using sampling search
  • Read and translate text from file
colab

License

turjuman(-py) is Apache-2.0 licensed. The license applies to the pre-trained models as well.

Citation

If you use TURJUMAN toolkit or the pre-trained models for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):

@inproceedings{nagoudi-osact5-2022-turjuman,
  title={TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation},
  author={Nagoudi, El Moatez Billah and Elmadany, AbdelRahim and Abdul-Mageed, Muhammad},
  booktitle = "Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5)",
  month = "June",
  year = "2022",
  address = "Marseille, France",
  publisher = "European Language Resource Association",
}

Acknowledgments

We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-2018-04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435-2018-0576; 895-2020-1004; 895-2021-1008), Canadian Foundation for Innovation (CFI; 37771), ComputeCanada (CC), UBC ARC-Sockeye and Advanced Micro Devices, Inc. (AMD). Any opinions, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSERC, SSHRC, CFI, CC, AMD, or UBC ARC-Sockeye.

turjuman's People

Contributors

elmadany avatar nagoudi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

turjuman's Issues

Token ids generated instead of translation

Hey there, I hope you're doing fine.
when running the command: turj.translate
it returns the token ids instead of the actual translation?
(see the output below)
2022-07-07 10:41:43 | INFO | turjuman.translate | Using beam search
tensor([[ 0, 6538, 2, 76, 6380, 1]])

Error in Colab Example

When running the following command, the error presented below is raised:

Beam search is the default generation method on Turjuman
!turjuman_translate --text "As US reaches one million COVID deaths, how are Americans coping?"

IndexError: too many indices for tensor of dimension 2

/usr/local/bin/turjuman_translate:8 in <module>                              │
│                                                                              │
│   5 from turjuman_cli.translate import translate_cli                         │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(translate_cli())                                            │
│   9                                                                          │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/turjuman_cli/translate.py:76 in       │
│ translate_cli                                                                │
│                                                                              │
│   73 │                                                                       │
│   74 │   torj = turjuman(logger, args.cache_dir)                             │
│   75 │   if input_source=="text":                                            │
│ ❱ 76 │   │   torj.translate_from_text (args.text, args.search_method, args.s │
│   77 │   elif input_source=="file":                                          │
│   78 │   │   torj.translate_from_file (args.input_file, args.search_method,  │
│   79                                                                         │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/turjuman/turjuman.py:93 in            │
│ translate_from_text                                                          │
│                                                                              │
│    90 │   │   outputs = self.translate(sources, search_method, seq_length, m │
│    91 │   │                                                                  │
│    92 │   │   if max_outputs==1:                                             │
│ ❱  93 │   │   │   targets = outputs['target'][0]                             │
│    94 │   │   else:                                                          │
│    95 │   │   │   targets = outputs[str(max_outputs)+'_targets'][0]          │
│    96 │   │   if type(targets) == list:     

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.