dotnetcore / dotnetspider Goto Github PK

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

License: MIT License

C# 50.56% HTML 43.39% Shell 0.17% JavaScript 1.28% CSS 3.69% Dockerfile 0.03% TSQL 0.88%

crawler cross-platform csharp distributed dotnetcore

dotnetspider's People

Contributors

Stargazers

Watchers

Forkers

dewdad alexlcc walterwhatwater inuyasha-monster zzms sevenboy2012 ayzhanglei sundebin smartfire dingdou lovewitty hlzfr hbei luchaoshuai yigedakoudaiya jetzfly yaozd zhoujunshao wzh880801 edgevagrant sharptogether jiangpan qcjxberin reinhardhsu zhangmeiliang5200 feng2012 charygao 522592 benbenlijie iraychen hefnernew ly2099 lhd24 jackswei kongyazhou lifekiller haoljp fred-lee iericzheng lizongshen yhhno tiger2014 yigexys kenchen1101 littletao08 debugoftheroad chinafather l1183479157 cloupid nonempty snoways mslycn freeboygirl chantysothy naishengbiao andyshao wuyou201400 chenruoyun tinkerc luciferaaa lmllouk lxh023 sky-gu neavers wfind lanyur kq2011 tigerphz applog czfei09 daohunliwei cdzhoubin itisi00 yalunwang skyder2008 jackwangcumt ithanshui xjt927 szp11 vebin expansion perrypal dut3062796s hihayden super-rain zpzgone yuntianming026 staymo fengxing666 wangnai1116 uponmoon aurpple jesse1205 smallred1024 modulexcite fydexx jasonwang1109 windygu kiss96803 ahweb

dotnetspider's Issues

Could please add some comments

Could you please add some comments on classes,methods and properties so that when use visual studio i can hover on them to see the comments and get a rough idear about what it is used for

nuget上的AddTargetUrlExtractor不好使

nuget上的AddTargetUrlExtractor不好使，同样的代码引用代码生成的就好使

关于PageProcessor中不调用AddResultItem就无法解析动态添加的url问题

作者在设计时默认ResultItem为空时动态添加的url不列入解析队列，需要配置spider的SkipTargetRequestsWhenResultIsEmpty
示例

Spider spider = Spider.Create(
	new QueueDuplicateRemovedScheduler(),
	new xxxProcessor(),
	new yyyProcessor()).
	AddPipeline(new MyPipeline());
// 添加初始采集链接
spider.AddRequests("xxxxx");
//配置ResultItem为空时不跳过目标请求
spider.SkipTargetRequestsWhenResultIsEmpty = false;//默认为true
// 启动爬虫
spider.Run();

能否发Release

能否发布Nuget的时候发release，可以看到不同版本的代码。现在想找某个Nuget版本的源码不太容易。

如何在Linux、MacOS下使用vscode和dotnet命令行成功运行？

在项目的.shproj文件中发现有一些依赖visualstudio的引用项。
水平有限，不知道如何去掉有关的依赖。

两个关于DbRequestBuilder的错误

在使用DbRequestBuilder类的过程中遇到了两个错误，作者可以确认一下：

在QueryDatas方法中，这一句var dataItem = item as Dictionary<string, dynamic>转换失败，返回为空，解决方法：改为var dataItem = (item as IDictionary<string, dynamic>).ToDictionary(kvp => kvp.Key, kvp => kvp.Value)
Build方法调用之后并没有把生成的Request加入_requests里面

支持”断点续传“吗？如果程序中途由于意外断掉，重启 spider 现在有处理机制么？

Will the spider automatically skip the urls which has beed downloaded?

Can this scrape javascript? Also what is this based on?

I've a few questions - can dotnetspider scrape:

Javascript? How does this compare to scrapy?
Whole tables in one line?

100线程就开始有408了.

100M电信专用线,
每个请求 2.6kb

100个线程就time out了..

Left 0 Success 6100 Error 0 Total 5973 Dowload 109 Extract 0 Pipeline 15367

How to schedule the spider to run daily job on 1am and are there any duplicate content check?

Hi,
I need to run the spider everyday on 1am or some specific time, are there any schedule available for this?

Another question is that are there any content duplicate check? for example, I do crawling everyday for website www.abc.com/aa.html for its xpath '/html/body/div[3]/div/div[2]/section', but if the content of '/html/body/div[3]/div/div[2]/section' is exactly the same as my last crawling, then I will just ignore it.

Thank you.

搞点wiki,大佬

没Wiki很难上手啊

headers 中添加了 cookie 但是请求头中始终没有携带

var site = new Site
{
CycleRetryTimes = 1,
SleepTime = 200,
Headers = new Dictionary<string, string>()
{
{ "Accept","/" },
{ "Referer", "https://ad.tt.com/login/"},
{ "Cookie","tt_webid=6582711285758166536" },
{ "Connection","keep-alive" },
{ "Content-Type","application/x-www-form-urlencoded" },
{ "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
}
};

异步启动多个Spider，第一个Spider的Task状态始终为Running

Spider启用：
foreach (DotnetSpider.Core.Spider s in SpiderList) { s.OnClosed += s_OnClosed; SpiderTaskList.Add(s.RunAsync()); }
在s_OnClosed中循环处理SpiderTaskList[i].Status中的状态，第一个Task始终为Running

求文档和案例

没得文档。。用起来好痛苦，案例好像也没完整，大佬整整

Typo in file name "SpiderExceptoin.cs"

Path: src/DotnetSpider.Core/SpiderExceptoin.cs
Should be "SpiderException.cs"

there is a bug of HttpClientDownloader

the DowloadContent method of HttpClientDownloader
code:
// TODO: 代理模式下: request.DownloaderGroup 再考虑
var proxy = spider.Site.HttpProxyPool.GetProxy();
request.Proxy = proxy;
httpClientItem = HttpClientPool.GetHttpClient(spider, this, CookieContainer, proxy?.GetHashCode(), CookieInjector);
httpClientItem.Handler.Proxy = httpClientItem.Handler.Proxy ?? proxy;

there is a issue at this line "httpClientItem.Handler.Proxy=httpClientItem.Handler.Proxy ?? proxy;"

if you reuse httpClient instance, the httpClientItem.Handler.Proxy can not modify the Proxy.
it will thow exceotion :This instance has already started one or more requests. Properties can only be modified before sending the first request.

是否可以获取ajax加载后的网页

网页是在ajax结束之后，才有了内容数据，而我又需要在ajax渲染完成之后爬取，请问是否支持，如果支持的话具体应该怎么配置呢？

无法从起始页生成新的地址

如果没有从页面得到ResultItem,
AddTargetRequest 增加的页面没有作用? 这是个bug嘛?

一个页面不一定会有结果,但会生成新的地址阿

我们可以将主要描述性语言切换成中文吗？

Can we talk in Chinese?

How to support multiple starturls and multiple page processors

Hi,
For example,
www.aaa.com and www.bbb.com, they require different page processor, processorX and processorY.
I only see AddStartUrls() method, but how to assign different page processor for each start url?

Thanks,

v3.0之后登陆应该如何做？

1.x的时候可以使用这个：
spider.Downloader = new WebDriverDownloader(Browser.Chrome, new Option()
{
Login =
new LoginHandler()
{
Url = "https://wstj.bjchfp.gov.cn/apex/f?p=701:LOGIN_DESKTOP:5475297590252",
UserSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[name='userID']" },
PassSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[name='password']" },
SubmitSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[type='submit']" },
User = "400820454-C",
Password = "XWZY0454-"
}
});
3.x之后呢？？？
WebDriverCommonCookieInjector 这个接口如何用能否给个文档说明清楚？

爬取全站数据时报Uri错误

错误原因在request里面，希望作者看一下。

Please tell me the development environment requirements

Please tell me the development environment requirements, such as visual studio detail version.

自带Sample运行有问题

是因为现在DotnetSpider库的更新太快了吗？导致自带Sample都跟不上步子了？

AfterDownloadCompleteHandlerSpider Sample运行起来之后
protected override void OnInit(params string[] arguments)
{
AddRequest($"http://api.search.sina.com.cn/?c=news&t=&q=赵丽颖&pf=2136012948&ps=2130770082&page=0&stime={DateTime.Now.AddYears(-7).AddDays(-1).ToString("yyyy-MM-dd")}&etime={DateTime.Now.AddDays(1).ToString("yyyy-MM-dd")}&sort=rel&highlight=1&num=10&ie=utf-8&callback=jQuery1720001955628746606708_1508996230766&_=1508996681484", new Dictionary<string, dynamic> { { "keyword", "赵丽颖" } });
AddPipeline(new ConsoleEntityPipeline());
Downloader.AddAfterDownloadCompleteHandler(new ReplaceHandler());
AddEntityType();
}

Downloader是null，导致运行出错，能更新下Sample吗？

更新数据模式优化, 不以主键为条件

MySqlEntityPipeline在InsertNewAndUpdateOld模式插入数据时，生成的SQL语句报语法错误

MySqlEntityPipeline/GenerateInsertNewAndUpdateOldSql()
var sql =
$"INSERT INTO {adapter.Table.Database}.{tableName} ({cols}) {colsParams} ON DUPLICATE KEY UPDATE {setParams};";
改成
var sql =
$"INSERT INTO {adapter.Table.Database}.{tableName} ({cols}) VALUES ({colsParams}) ON DUPLICATE KEY UPDATE {setParams};";
就没问题了。

有没有增加 Cookie 池与 UserAgent 池设计的计划

能否将文档更新为DotnetSpider.Core

首先感谢开源此库，在看了issues后我发现DotnetSpider.Core2并不是最新版的，于是改为DotnetSpider.Core，但是我发现所有的文档都是DotnetSpider.Core2的，而且DotnetSpider.Core的注释没有DotnetSpider.Core2的完整。。。

Source file '\netfx45\..\RuthSpider.cs' could not be found

你好项目下载下来编译报错找不到RuthSpider.cs文件

关于使用代理报：此实例已经启动一个或多个请求。只能在发送第一个请求之前修改属性

使用代理采集下一个链接的时候会报：“此实例已经启动一个或多个请求。只能在发送第一个请求之前修改属性”，我看代理类是实现了IDisposable接口的，是不是因为没释放资源的缘故？在哪里释放呢？我的代理类如下：

public class HttpProxyPool : IHttpProxyPool
    {
        public void Dispose()
        {           
        }

        public UseSpecifiedUriWebProxy GetProxy()
        {
            var uri = new Uri("http://125.126.162.105:45504");
            return new UseSpecifiedUriWebProxy(uri);           
        }
        public void ReturnProxy(UseSpecifiedUriWebProxy proxy, HttpStatusCode statusCode)
        {            
        }
    }

设置代理代码：
site.HttpProxyPool = new HttpProxyPool();

第一个简单的爬虫的例子报错

1、例子报错 https://github.com/dotnetcore/DotnetSpider/wiki/1.-第一个简单的爬虫报错：Download https://github.com/zlzforever failed:发生一个或多个错误。
环境：net45，DotnetSpider2.Core.2.4.4 .
2、如何爬带搜索参数的页面？例如这个页面 http://list.youku.com/category/show/c_96_s_1_d_1_p_{i}.html 中，搜索框分别是 "后来的我们", "少林足球","羞羞的铁拳"，如何优雅地爬到这3个页面呢？
3、这个组件，会自动切换ip爬么？要怎么切换ip呢？

mysql数据库

我获取到了项目运行例子需要数据库，能上传下数据库吗？

关于DefaultProxyValidator的问题

我是从网上爬的免费代理，在执行这句代码验证的时候 var host = Dns.GetHostEntry(httpProxy.Host)，绝大部分的代理都会抛出异常，但其实大部分代理都是能用的。可以考虑改一下验证的方式，比如直接用代理访问这个网站http://httpbin.org/ip

最佳实践

你好，我想请教你一下关于使用 DotnetSpider Framework 的最佳实践：

场景：我需要从一个网站的首页中拿到所有种类的一级链接，然后再通过抓起到的一级链接组装成一个新的二级链接，我需要将所有的二级链接执行抓起数据。

问题：我创建了一个 Spider，然后在 Spider 中通过创建一个 Processor来执行首页中所有一级链接的抓起，但是我该如何将这些抓到的一级链接拿到后直接放到一个新的 Spider 中执行新的抓取任务呢？

    还请不吝赐教，给一些最佳实践的灵感，谢谢。

More details about installation.

获取不到完整的Url

ISelectablel.Links()
当Url 是https://img.alicdn.com/bao/uploaded/i4/143391948/TB2jv7vaC1I.eBjy0FjXXabfXXa_!!143391948.jpg
有叹号的时候截断了，获取不到完整的url

NuGet-Packages

Hi
I have a question regarding NuGet-Packages. Which Project are intended to be a Nuget-Package? Only Dotnetspider.Core?
Best
f

两个Pipeline取数据取不到的问题

JsonFileEntityPipeline中第66行，直接取entry.ToString()返回不了数据。

ExcelEntityPipeline中第95行，data[column]返回不了数据，因为data是实体类。

Sample not working

When trying to run BaseUsage sample i get an error in Spider.cs in CheckIfSettingsCorrect() method.

if (Site.RemoveOutboundLinks && (Site.Domains == null || Site.Domains.Length == 0)) {
throw new SpiderException($"When you want remove outbound links, the domains should not be null or empty."); }

I guess the Domains property is required if RemoveOutboundLinks is true, but I don't know what is the purpose of that property.

运行Sample时，Spider.cs/OnComplete()抛出NullReferenceException

windows 10 enterprise
visual studio 2017
.NETCoreApp 1.1

运行DotnetSpider.Smple/BaseUsage.CustmizeProcessorAndPipeline();：

随后我在全局查找了OnComplete字段，发现并没有被注册：

是否考虑改为：OnCompleted?.Invoke()？

NuGet Package?

Any plan to publish to NuGet?

获取数据的时候，数据重复插入

通过 AddEntityType();
获取busstoplist 下所有P元素的内容
http://wapapp.dy4g.cn/bus/auto/test.php?t=linhtml&busline=1

`

        /// <summary>
        /// 获取车站信息
        /// </summary>
        [Schema("dybus", "BusStation")]
        [Entity(Expression = ".//div[@class='busstoplist']/div//p", Type = SelectorType.XPath)]
        class BusStation : BaseEntity
        {
            /// <summary>
            /// 车次信息
            /// </summary>
            [Column]
            [Field(Expression = "Keyword", Type = SelectorType.Enviroment)]
            public string Keyword { get; set; }

            /// <summary>
            /// 车站唯一ID
            /// </summary>
            [Column]
            [Field(Expression = "./@Id")]
            public string BusStationId { get; set; }


            /// <summary>
            /// 车次路线编号
            /// </summary>
            [Column]
            [Field(Expression = "./strong/text()")]
            public string StationNumber { get; set; }

            /// <summary>
            /// 车站名称
            /// </summary>
            [Column]
            [Field(Expression = "./span/text()")]
            public string Name { get; set; }

            /// <summary>
            /// 车站方向
            /// </summary>
            [Column]
            [Field(Expression = "../@class")]
            public string BusDirection { get; set; }

        }

数据是能够获取到，
但是获取到的同一条数据插入了两次
请老师看看，是不是我使用姿势不对

waitCount导致Spider结束

Spider开了两个线程，有一个入口地址，第一个线程拿到Url以后去处理，第二个线程循环等待，恰巧这个地址处理了很长时间，第二个线程等待waitCount后将Spider状态设为Finished，但第一个Url其实还在处理。

所以是否应该判断：所有线程都空闲的时候，再等待waitCount视为结束。

TableInfo中的IsAutoIncrementPrimary的逻辑是不是有问题？

TableInfo中：

internal bool IsAutoIncrementPrimary => Primary.Count == 1 && Columns.Count(f => f.DataType == DataType.Int || f.DataType == DataType.Long) == 1;

判定数据库表是否应使用自增主键，当前逻辑为：

如果只有一个主键，且表中整型变量数量为1，那么这个主键就是自增的。
存在以下问题：

当使用者不需要自增主键时，是否应强制指定？
如果我只有一个string类型的主键，还需要设置自增吗？（mysql直接报错）
明显原本设计意图是只有一个主键且主键为整型时，设置自增。

不是netcoreapp1.0啊

大哥，类库的project.json 里面不是netcoreapp1.0 ，应该是netstandard1.6平台标准啊

EntitySpider ，一对多的关系怎么处理呢？

请问假如一篇文章有多个作者，每个作者对应一张图片，在entitysipder 里面改如何定义字段类型呢？

现在Demo 里面好像都是单一的类型。
[PropertyDefine(Expression = ".//div[@Class='p-name']/a/em", Length = 100)]
public string Name { get; set; }

能实现类似的么？
[PropertyDefine(Expression = ".//div[@Class='p-name']/a/em", Length = 100)]
public List‘<string’> Name { get; set; }

或者给每个字段解析成功后能定义个callback 函数也行。
谢谢！

New Package for last commits (NLog/Serilog)

Hi. First of all: cool project!
When is next NuGet-Package update planned? I would really love to use my own Logger which is not possible with the current Version but should be with the next one (commit ae9bb7e) :)
thanks