Code Monkey home page Code Monkey logo

magic-doc's Introduction

license issue resolution open issues

👋 join us on Discord and WeChat

English | 简体中文

Install

Prerequisites: python3.10

Install Dependencies

linux/osx

apt-get/yum/brew install libreoffice

windows

install libreoffice 
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH

Install Magic-Doc

pip install fairy-doc[cpu] # cpu version
or
pip install fairy-doc[gpu] # gpu version

Introduction

Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.

Example

# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config

s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)

Performance

ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7

File Type Speed
PDF (digital) 347 (page/s)
PDF (ocr) 2.7 (page/s)
PPT 20 (page/s)
PPTX 149 (page/s)
DOC 600 (page/s)
DOCX 1482 (page/s)

All Thanks To Our Contributors:

image

Acknowledgments

🖊️ Citation

@misc{2024magic-doc,
    title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
    author={Magic-Doc Contributors},
    howpublished = {\url{https://github.com/InternLM/magic-doc}},
    year={2024}
}

License

This project is released under the Apache 2.0 license.

🔼 Back to top

magic-doc's People

Contributors

dependabot[bot] avatar drunkpig avatar dt-yy avatar icecraft avatar lollipopsandwine avatar myhloli avatar wufan-tb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

magic-doc's Issues

Building wheel for cchardet (setup.py) ... error

Building wheels for collected packages: cchardet
Building wheel for cchardet (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [22 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-311
creating build\lib.win-amd64-cpython-311\cchardet
copying src\cchardet\version.py -> build\lib.win-amd64-cpython-311\cchardet
copying src\cchardet_init_.py -> build\lib.win-amd64-cpython-311\cchardet
running build_ext
building 'cchardet._cchardet' extension
creating build\temp.win-amd64-cpython-311
creating build\temp.win-amd64-cpython-311\Release
creating build\temp.win-amd64-cpython-311\Release\src
creating build\temp.win-amd64-cpython-311\Release\src\cchardet
creating build\temp.win-amd64-cpython-311\Release\src\ext
creating build\temp.win-amd64-cpython-311\Release\src\ext\uchardet
creating build\temp.win-amd64-cpython-311\Release\src\ext\uchardet\src
creating build\temp.win-amd64-cpython-311\Release\src\ext\uchardet\src\LangModels
D:\VS2022\Community\VC\Tools\MSVC\14.37.32822\bin\HostX86\x64\cl.exe /c /nologo /O2 /W3 /GL /DNDEBUG /MD -Isrc/ext/uchardet/src -ID:\Anconda2
024\include -ID:\Anconda2024\Include -ID:\VS2022\Community\VC\Tools\MSVC\14.37.32822\include -ID:\VS2022\Community\VC\Tools\MSVC\14.37.32822\ATLMFC
\include -ID:\VS2022\Community\VC\Auxiliary\VS\include "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x8
6)\Windows Kits\10\include\10.0.22621.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\shared" "-IC:\Program Files (x86)\Wi
ndows Kits\10\include\10.0.22621.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\cppwinrt" /EHsc /Tpsrc/cchardet_cchardet.cpp /Fobuild\temp.win-amd64-cpython-311\Release\src/cchardet_cchardet.obj
_cchardet.cpp
src/cchardet_cchardet.cpp(196): fatal error C1083: 无法打开包括文件: “longintrepr.h”: No such file or directory
error: command 'D:\VS2022\Community\VC\Tools\MSVC\14.37.32822\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for cchardet
Running setup.py clean for cchardet
Failed to build cchardet
ERROR: Could not build wheels for cchardet, which is required to install pyproject.toml-based projects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.