est / cx-extractor Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 6.0 3.39 MB

Automatically exported from code.google.com/p/cx-extractor

Java 5.12% Shell 0.07% PHP 1.98% HTML 76.93% C# 1.55% C++ 4.30% Perl 10.05%

cx-extractor's Introduction

Hi there 👋

🔭 Backend developer
💬 Read my blog here https://blog.est.im/
📫 Email me? i@ with TLD above.
⚡ Fun fact: My github account was hacked once lol

cx-extractor's People

Stargazers

Watchers

Forkers

buzzingpig iaceob baotong netsafer surfingit guomin

cx-extractor's Issues

损耗时间的一步

source = links.matcher(source).replaceAll("");

样例：http://news.itxinwen.com/2013/0802/515691.shtml

单是这一步 将耗时90s+

建议：可以直接通过source = source.replaceAll("<[^>]+>", "");  
移除所有Tag?

Original issue reported on code.google.com by [email protected] on 2 Aug 2013 at 8:01

文件中有乱码

如题。

Original issue reported on code.google.com by [email protected] on 16 Oct 2013 at 3:12

文件名乱码

Perl版本中，生成的txt文件名都是乱码，文件内容正常。

Original issue reported on code.google.com by [email protected] on 21 May 2013 at 7:55

在博客和文字、图片、脚本代码过多的情况下匹配不理想的问题

cx-extractor 
算法不错，提供了一种新的思路，以前我做过的是分析提取��
�页面中所有的TABLE和DIV区块，按区块字段的大小多少来判断��
�

我按cx-extractor算法做了一下，碰到以下几个问题：我是用C#来
做的

1、preProcess不能过滤标签中有脚本的情况，如其中的IMG

   http://developer.51cto.com/art/201012/236066.htm

2、是否考虑以下2个方面的进一步改进；即在第一次匹配失败
后进行下面2中再次过滤
   1、正文一般是DIV或者TABLE（TR/TD） 进行包围的，将这些标签换成特殊标签；在行和块合并时把这些特殊标签作为一种参考界定
   2、类似下文中，正文中<p>应用较多，P中间的标签可以替换掉，计算连续的P标签

   http://hi.baidu.com/jrckkyy/blog/item/a0c70a995e3579196f068c4e.html

3、博客方面还不是很理想

   http://www.cnblogs.com/zhoujg/archive/2010/12/04/1895887.html 
   http://sarin.javaeye.com/blog/830831
   http://blog.sina.com.cn/s/blog_4c4fd3070100nbvt.html?tj=1

4、这篇新闻好像也出了点问题

   http://news.sina.com.cn/c/2010-12-04/100718432475s.shtml

Original issue reported on code.google.com by [email protected] on 4 Dec 2010 at 3:07

英文中空格被删去

如果网页中含有英文，单词间的空格不应该被删去。

Original issue reported on code.google.com by [email protected] on 21 May 2013 at 7:56

est / cx-extractor Goto Github PK

cx-extractor's Introduction

Hi there 👋

cx-extractor's People

Stargazers

Watchers

Forkers

cx-extractor's Issues

损耗时间的一步

文件中有乱码

文件名乱码

在博客和文字、图片、脚本代码过多的情况下匹配不理想的问题

英文中空格被删去

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent