Code Monkey home page Code Monkey logo

imk_e-legal_intern's Introduction

IMK_E-legal_Intern

Live Demo: https://lexiapp.netlify.app/

Introduction Video:

intro

Front-end

Features

  • Modern Vibe All the Way Around ๐Ÿ’–
  • Customizable Bot ๐Ÿค–
  • Auto-Save Your Chat Locally ๐Ÿ”

Previews

preview

To Run

  • clone the repo and ship into the project folder
  • create .env file under e-legal-intern folder with the content:
      REACT_APP_OPENAI_API_KEY=your-key
    
  • run npm install
  • run npm start

Next...

  • Export to PDF of Chat History
  • Integrate with the real engine instead of OpenAI
  • Get accurate answers from the engine
  • Deploy to the cloud

Artificial Intelligence part

Data scraping

We use https://njt.hu/ for scraping. We scraped the Well-known legislation(all the pages). We extract law published from 1950 to 2010.

Data cleaning

Once we have collected the data, we will need to preprocess it by cleaning, tokenizing, and formatting it to ensure that it is in a suitable format for training. This step typically involves removing special characters, stop words, and other noise, as well as converting the text into a numerical format that can be fed into the model.

Tokenization

We dived the data in 6 files each file contains 10 000 words We build a custom tokenization model for Hungarian well-known laws.

Model Architecture

RobertaTokenizerFast is a fast and efficient tokenizer designed specifically for the RoBERTa model, which is a variant of the popular BERT (Bidirectional Encoder Representations from Transformers) model. The tokenizer is implemented in Python using the Hugging Face Tokenizers library, and is optimized for speed and memory efficiency. The RobertaTokenizerFast works by splitting the input text into words or subwords, and then mapping each word or subword to a corresponding integer ID. This allows the input text to be represented as a sequence of integers, which can be fed into the RoBERTa model for processing. Compared to other tokenizers, the RobertaTokenizerFast is optimized for speed and can tokenize large amounts of text quickly and efficiently. It also supports a wide range of languages and can handle complex input text with ease. A masked attention model is a type of neural network architecture commonly used in natural language processing tasks, such as machine translation or language modeling. It involves the use of a self-attention mechanism, which allows the model to weigh the importance of different parts of an input sequence when generating an output. The term "masked" refers to a technique used to prevent the model from attending to certain parts of the input sequence during training or inference. This is typically done to handle sequences of variable length, such as sentences of different lengths in a language model or source and target sequences of different lengths in a machine translation task. In a masked attention model, a mask is applied to the attention weights in the self-attention mechanism to prevent the model from attending to certain positions in the input sequence. For example, in a language modeling task, the model may be trained to predict the next word in a sentence given the previous words, but during training, the model should not have access to future words in the sequence. Therefore, a mask can be applied to the attention weights to prevent the model from attending to future positions in the input sequence. The use of masked attention has been shown to improve the performance of neural network models on natural language processing tasks, particularly those involving variable-length input sequences. Some popular implementations of masked attention models include the Transformer model and its variants. The parameters;

  1. Epochs=50
  2. Learning_rate= 0.04
  3. per_device_train_batch_size=100
  4. per_device_eval_batch_size=100
  5. save_steps=8192
  6. eval_steps=4096
  7. save_total_limit=1

Future

  1. Having an updated dataset for training is very important for the model quality and results. we are planning to deploy the model on Cloud ( AWS, Azure, ..) and implement a celery task for scraping the website to have new data and train the model on the new year laws

imk_e-legal_intern's People

Contributors

cagyj avatar sarahelmasryy avatar abdulrahimiliasu avatar

Watchers

 avatar

Forkers

sarahelmasryy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.