Code Monkey home page Code Monkey logo

text-cleaner's Introduction

text-cleaner, simple text preprocessing tool

Introduction

  • Support Python 2.7, 3.3, 3.4, 3.5.
  • Simple interfaces.
  • Easy to extend.

Install

pip install text-cleaner

WARNING FOR PYTHON 2.7 USERS: Only UCS-4 build is supported(--enable-unicode=ucs4), UCS-2 build (see this) is NOT SUPPORTED in the latest version.

Usage

from text_cleaner import remove, keep

from text_cleaner.processor.common import ASCII
from text_cleaner.processor.chinese import CHINESE, CHINESE_SYMBOLS_AND_PUNCTUATION
from text_cleaner.processor.misc import RESTRICT_URL

# remove url and ascii characters.
# return: u'点击  查看 '
remove(
    '点击http://t.cn/RtU0mZ1 查看,123456,test',
    [RESTRICT_URL, ASCII],
)

# remove only Chinese punctuation.
# return: u'点击 http://t.cn/RtU0mZ1  查看,123456,test '
remove(
    '点击:http://t.cn/RtU0mZ1, 查看,123456,test。!?',
    [RESTRICT_URL, ASCII],
)

# keep chinese characters and url.
# return: u'点击 http://t.cn/RtU0mZ1 查看'
keep(
    '点击http://t.cn/RtU0mZ1 查看,123456,test',
    [CHINESE, RESTRICT_URL],
)

# use processor directly.
# return: u'点击  查看'
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1 查看')
# return: u'点击<URL> 查看'
RESTRICT_URL.replace('<URL>').remove('点击http://t.cn/RtU0mZ1 查看')

Interfaces

text_cleaner.remove(text, processors):

  • text: str or bytes (unicode or str for Python 2).
  • processors: iterable of processors. remove invokes remove of each processor to handle text.

text_cleaner.keep(text, processors):

  • same as remove, but invoke keep method of processors instead.

Processors

DEFAULT_REPLACE_TEXT: ' ', single space.

RegexProcessor(regex, replace_text=DEFAULT_REPLACE_TEXT)

  • contruct a regex processor for regex, replace unmatched components with replace_text.
  • replace(self, new_replace_text): create a new processor, with new replace_text is set.
  • remove(self, text): remove all occurences of regex from text.
  • keep(self, text): keep only the occurences of regex, remove all unmatched components from text.
  • verify(self, text): return True if text match regex, otherwise returns False.

UnicodeRange(begin, end):

  • begin: int, the begin of unicode range.
  • end: int, the end of unicode range.

UnicodeRangeProcessor(ranges, replace_text=DEFAULT_REPLACE_TEXT)

  • subclass of RegexProcessor.
  • ranges: iterable of instances of UnicodeRange.

Built-in Processors

Following processors are defined by UnicodeRange and regex. Read the source code if you are sure about what's going on.

text_cleaner.processor.common, for common usage:

  • ALPHA
  • DIGIT
  • SYMBOLS_AND_PUNCTUATION
  • ASCII
  • ALPHA_EXTENSION
  • DIGIT_EXTENSION
  • SYMBOLS_AND_PUNCTUATION_EXTENSION
  • GENERAL_PUNCTUATION

text_cleaner.processor.misc, misellanious processors:

  • URL
  • RESTRICT_URL
  • ESCAPED_WHITESPACE
  • WECHAT_EMOJI_EN
  • WECHAT_EMOJI_ZHCN
  • WECHAT_EMOJI

text_cleaner.processor.chinese, Chinese processing:

  • CHINESE_CHARACTER: only common characters.
  • CHINESE: common characters + symbols and puntuations.
  • CHINESE_ALL: all CJK characters.
  • CHINESE_EXTENSION
  • CHINESE_COMPATIBILITY
  • CHINESE_SYMBOLS_AND_PUNCTUATION

URL vs. RESTRICT_URL

How to define URLs is a complex problem. We provide two choices for our users.

  • URL: truncate urls till whitespaces.
  • RESTRICT_URL: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)

For Chinese users, we recommend using RESTRICT_URL.

from text_cleaner.processor.misc import RESTRICT_URL, URL

URL.remove('点击http://t.cn/RtU0mZ1 查看')
# '点击 查看'

URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 '

RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 查看'

text-cleaner's People

Contributors

guangyi-z avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

text-cleaner's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.