Code Monkey home page Code Monkey logo

himanshumahajan138 / similar_document_template_matching_algorithm_flask Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 174.67 MB

This system utilizes Optical Character Recognition (OCR) extracts text, while computer vision techniques map document layout. Then, SIFT (Scale-Invariant Feature Transform) cleverly matches documents to pre-defined templates, even with variations. This intelligent matching helps identify potential fraud for further investigation.

Python 57.51% HTML 17.01% JavaScript 12.98% CSS 12.50%
document-matching fraud-detection nlp-machine-learning ocr-recognition opencv python similar-document automation flask machienlearning

similar_document_template_matching_algorithm_flask's Introduction

Similar Document Template Matching for Fraud Detection Using Flask

Introduction

This repository contains the source code for a document template matching algorithm designed to automate document processing and enhance fraud detection capabilities in the healthcare domain. The project addresses the challenge faced by organizations like Bajaj Finserv Health Ltd., which manage a high volume of medical invoices, prescriptions, and lab reports from various providers and customers.

Project Goals

  • Automate document processing tasks (text extraction, layout analysis)
  • Identify similar document templates across diverse formats
  • Analyze discrepancies within similar templates for potential fraud detection

Technical Stack

  • Front-End (Optional):
    • HTML, CSS, JavaScript (for user interface - optional)
  • Back-End:
    • Python (core scripting language)
    • Flask (web framework for handling requests and responses)
    • Jinja Templating (for dynamic web pages)
  • Computer Vision Libraries:
    • OpenCV (cv2): Image manipulation, SIFT-based template matching, layout analysis
    • EasyOCR: Optical Character Recognition (OCR) for text extraction
  • Multithreading (Optional): Improves processing efficiency for handling multiple documents concurrently

Project Functionality

  1. User Interface (Optional): Users can upload documents and select document types (invoice, prescription, lab report).
  2. Input Validation: The system validates the uploaded file format and size.
  3. Feature Extraction:
    • Text Extraction: EasyOCR extracts text content from the uploaded document using OCR.
    • Layout Analysis: OpenCV (cv2) analyzes the document layout, potentially identifying visual elements.
  4. Feature Vector Creation: Extracted text, layout information, and visual elements (if applicable) are combined into a feature vector representing the document.
  5. Template Matching (SIFT): The document's feature vector is compared against pre-defined template representations for each document type using SIFT in OpenCV. A similarity score is calculated based on the comparison.
  6. Match Evaluation:
    • High Similarity: Documents with a high similarity score exceeding a predefined threshold proceed to fraud analysis.
    • Low Similarity: Documents with low similarity scores are likely not matches and the user is notified.
  7. Fraud Analysis (High Similarity): Critical regions specific to the document type (e.g., patient details in prescriptions) are analyzed for discrepancies.
  8. Fraud Detection: Significant discrepancies flag the document for further manual review as potential fraud.
  9. Display Results: The user is shown the outcome:
    • Successful match with the specific template type (if applicable).
    • No match found.
    • Flagged for potential fraud review (if applicable).

Benefits

  • Reduced Manual Processing Time: Automates document processing tasks, freeing up staff resources.
  • Improved Accuracy: Consistent and automated template matching reduces human error in document classification.
  • Enhanced Fraud Detection: Identifies potential fraudulent claims based on template discrepancies.
  • Scalability and Adaptability: Designed for diverse document formats and can be expanded to include machine learning models for automatic classification or handling complex variations.

Requirements

This project requires the following Python libraries:

  • Flask
  • Jinja2
  • OpenCV-python
  • easyocr

Installation

  1. Make sure you have Python 3.x installed.

  2. Open a terminal or command prompt and navigate to the root directory of your project.

  3. Install the required libraries using pip:

    pip install -r requirements.txt

Running the Application (Using Flask)

  1. Create a file named requirements.txt in your project directory and add the following lines:

    Flask
    Jinja2
    OpenCV-python
    easyocr
    
  2. Run the following command in your terminal:

    pip install -r requirements.txt
  3. Assuming your main application script is named flask_app.py, run the Flask development server using:

    flask run

    This will start the server and make the application accessible by default at http://127.0.0.1:5000/.

Further Development

  • Integrate machine learning models for automatic template classification.
  • Enhance the system's ability to learn complex variations in document formats.
  • Explore cloud deployment options for scalability.

Contributing

We welcome contributions to this project! Please refer to the CONTRIBUTING.md file (if available) for guidelines on how to contribute code, report issues, and participate in the development process.

License

This project is licensed under the MIT License.

Disclaimer

This project is for educational purposes only and is not intended for production use without further testing and validation.

similar_document_template_matching_algorithm_flask's People

Contributors

himanshumahajan138 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.