TextractAI is a Python-based project that extracts text from PDF files, processes the extracted text using the OpenAI API, and generates a final processed document. The project utilizes various libraries and tools to perform PDF-to-image conversion, image-to-text extraction, and text processing.
Despite advancements in OCR technology, achieving perfect accuracy remains a challenge. The quality of the extracted text often depends on factors such as image resolution, font variations, layout complexities, and noise in the document.
TextractAI aims to address these limitations and enhance the accuracy of text recognition from documents. By leveraging a combination of OCR techniques and LLMs (OpenAI), TextractAI goes beyond traditional OCR approaches to deliver more reliable and precise text extraction results.
- Converts PDF files to images using
pdf2image
- Extracts text from images using
pytesseract
- Processes the extracted text using the OpenAI API
- Handles rate limiting and retries for the OpenAI API calls
- Supports parallel processing for faster execution
- Allows customization of the OpenAI prompt and API key
- Provides a command-line interface for easy usage
- Python 3.6 or higher
- OpenAI API key
- Tesseract OCR (required by
pytesseract
)
- Clone the repository:
git clone https://github.com/your-username/textractai.git
- Change to the project directory:
cd textractai
- Create a virtual environment:
python3 -m venv venv
- Activate the virtual environment:
source venv/bin/activate
- Install required dependencies:
pip install -r requirements.txt
- Set up the OpenAI API key:
- Either set the
OPENAI_API_KEY
environment variable with your API key, or - Provide the API key using the
--api-key
or-k
command-line argument when running the script
- Either set the
To run the TextractAI script, use the following command:
python src/main.py --file <path_to_pdf_file> [--output <output_folder>] [--text-output <text_output_folder>] [--api-key <openai_api_key>] [--prompt <custom_prompt>] [--force]
Options:
--file
or-f
: Path to the PDF file to process (required)--output
or-o
: Path to the output folder for generated images (default: ./output/images_output)--text-output
or-t
: Path to the output folder for extracted text files (default: ./output/text_output)--api-key
or-k
: OpenAI API key (required ifOPENAI_API_KEY
environment variable is not set)--prompt
or-p
: Custom OpenAI prompt (optional, default prompt will be used if not provided)--force
: Force overwrite of existing outputs without prompting
python src/main.py --file data/example.pdf --output output/images --text-output output/text --api-key your_api_key --prompt "Custom prompt for processing the text"
data/
: Directory to store input PDF filesoutput/
: Directory to store output filesimages_output/
: Generated images from PDF conversiontext_output/
: Extracted text files from images
src/
: Source code directorymain.py
: Main script to run the TextractAI pipelineargument_parse_handler.py
: Handles command-line argument parsingconstants.py
: Contains constant values used across the projectconvert_jpg_to_png.py
: Converts JPEG images to PNG formatconvert_pdf_to_images.py
: Converts PDF files to imagescustom_openai_handler.py
: Handles OpenAI API calls and processingextract_text_from_images.py
: Extracts text from images using OCRutils.py
: Utility functions used in the project
requirements.txt
: Lists the required Python dependencies
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
This project is licensed under the MIT License.
While TextractAI aims to enhance the accuracy and quality of the extracted text by leveraging the OpenAI API, it's important to note that the results may not be perfect. The output quality depends on various factors, such as the quality of the input PDF files, the complexity of the document layout, and the limitations of the OCR technology and the language model.
TextractAI is designed to assist in the text extraction process and provide improved results compared to raw OCR output. However, it is not a substitute for human review and validation. It is recommended to review the generated output, especially for critical or sensitive information, to ensure accuracy and correctness. Legal and Confidential Information
Exercise caution when processing legal documents or confidential information. TextractAI does not provide any guarantees regarding the accuracy, completeness, or reliability of the extracted text for legal or sensitive purposes.
It is the responsibility of the user to review, verify, and validate the output generated by TextractAI before relying on it for any critical decisions or applications. The developers and contributors of TextractAI shall not be held liable for any errors, inaccuracies, or consequences arising from the use of this tool.
TextractAI is an independent open-source project and is not affiliated with, endorsed by, or associated with any companies or services that may have similar names or offer similar functionalities. The use of the name "TextractAI" is intended to describe the functionality of the project and does not imply any connection or affiliation with any existing trademarks or services.
The TextractAI project aims to provide a distinct and unique solution for extracting and processing text from PDF files using the OpenAI API. While the project name may share some similarity with existing trademarks or services, it is not intended to infringe upon or violate any intellectual property rights.
If you have any concerns regarding the use of the name "TextractAI" or potential trademark infringement, please contact the project maintainers to discuss a resolution.
By using TextractAI, you acknowledge and accept these disclaimers and limitations. Always exercise prudence and human judgment when working with automatically extracted and processed text.
If you have any concerns or encounter any issues with the output, please refer to the original PDF files and consult with the appropriate subject matter experts or legal professionals as needed.