dotnetcore / dotnetspider Goto Github PK
View Code? Open in Web Editor NEWDotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
License: MIT License
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
License: MIT License
Could you please add some comments on classes,methods and properties so that when use visual studio i can hover on them to see the comments and get a rough idear about what it is used for
nuget上的AddTargetUrlExtractor不好使,同样的代码 引用代码生成的就好使
作者在设计时默认ResultItem为空时动态添加的url不列入解析队列,需要配置spider的SkipTargetRequestsWhenResultIsEmpty
示例
Spider spider = Spider.Create(
new QueueDuplicateRemovedScheduler(),
new xxxProcessor(),
new yyyProcessor()).
AddPipeline(new MyPipeline());
// 添加初始采集链接
spider.AddRequests("xxxxx");
//配置ResultItem为空时不跳过目标请求
spider.SkipTargetRequestsWhenResultIsEmpty = false;//默认为true
// 启动爬虫
spider.Run();
能否发布Nuget的时候发release,可以看到不同版本的代码。现在想找某个Nuget版本的源码不太容易。
在项目的.shproj文件中发现有一些依赖visualstudio的引用项。
水平有限,不知道如何去掉有关的依赖。
解析页面时如何获取请求返回的Cookie信息
Will the spider automatically skip the urls which has beed downloaded?
I've a few questions - can dotnetspider scrape:
100M电信专用线,
每个请求 2.6kb
100个线程 就time out了..
Left 0 Success 6100 Error 0 Total 5973 Dowload 109 Extract 0 Pipeline 15367
如题!请问框架如何模拟用户登录。
Hi,
I need to run the spider everyday on 1am or some specific time, are there any schedule available for this?
Another question is that are there any content duplicate check? for example, I do crawling everyday for website www.abc.com/aa.html for its xpath '/html/body/div[3]/div/div[2]/section', but if the content of '/html/body/div[3]/div/div[2]/section' is exactly the same as my last crawling, then I will just ignore it.
Thank you.
没Wiki很难上手啊
var site = new Site
{
CycleRetryTimes = 1,
SleepTime = 200,
Headers = new Dictionary<string, string>()
{
{ "Accept","/" },
{ "Referer", "https://ad.tt.com/login/"},
{ "Cookie","tt_webid=6582711285758166536" },
{ "Connection","keep-alive" },
{ "Content-Type","application/x-www-form-urlencoded" },
{ "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
}
};
Spider启用:
foreach (DotnetSpider.Core.Spider s in SpiderList) { s.OnClosed += s_OnClosed; SpiderTaskList.Add(s.RunAsync()); }
在s_OnClosed中循环处理SpiderTaskList[i].Status中的状态,第一个Task始终为Running
没得文档。。用起来好痛苦,案例好像也没完整,大佬整整
Path: src/DotnetSpider.Core/SpiderExceptoin.cs
Should be "SpiderException.cs"
the DowloadContent method of HttpClientDownloader
code:
// TODO: 代理模式下: request.DownloaderGroup 再考虑
var proxy = spider.Site.HttpProxyPool.GetProxy();
request.Proxy = proxy;
httpClientItem = HttpClientPool.GetHttpClient(spider, this, CookieContainer, proxy?.GetHashCode(), CookieInjector);
httpClientItem.Handler.Proxy = httpClientItem.Handler.Proxy ?? proxy;
there is a issue at this line "httpClientItem.Handler.Proxy=httpClientItem.Handler.Proxy ?? proxy;"
if you reuse httpClient instance, the httpClientItem.Handler.Proxy can not modify the Proxy.
it will thow exceotion :This instance has already started one or more requests. Properties can only be modified before sending the first request.
网页是在ajax结束之后,才有了内容数据,而我又需要在ajax渲染完成之后爬取,请问是否支持,如果支持的话具体应该怎么配置呢?
如果没有从页面得到ResultItem,
AddTargetRequest 增加的页面没有作用? 这是个bug嘛?
一个页面不一定会有结果,但会生成新的地址阿
Can we talk in Chinese?
Hi,
For example,
www.aaa.com and www.bbb.com, they require different page processor, processorX and processorY.
I only see AddStartUrls() method, but how to assign different page processor for each start url?
Thanks,
1.x的时候可以使用这个:
spider.Downloader = new WebDriverDownloader(Browser.Chrome, new Option()
{
Login =
new LoginHandler()
{
Url = "https://wstj.bjchfp.gov.cn/apex/f?p=701:LOGIN_DESKTOP:5475297590252",
UserSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[name='userID']" },
PassSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[name='password']" },
SubmitSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[type='submit']" },
User = "400820454-C",
Password = "XWZY0454-"
}
});
3.x之后呢???
WebDriverCommonCookieInjector 这个接口如何用 能否给个文档说明清楚?
Please tell me the development environment requirements, such as visual studio detail version.
是因为现在DotnetSpider库的更新太快了吗?导致自带Sample都跟不上步子了?
AfterDownloadCompleteHandlerSpider Sample运行起来之后
protected override void OnInit(params string[] arguments)
{
AddRequest($"http://api.search.sina.com.cn/?c=news&t=&q=赵丽颖&pf=2136012948&ps=2130770082&page=0&stime={DateTime.Now.AddYears(-7).AddDays(-1).ToString("yyyy-MM-dd")}&etime={DateTime.Now.AddDays(1).ToString("yyyy-MM-dd")}&sort=rel&highlight=1&num=10&ie=utf-8&callback=jQuery1720001955628746606708_1508996230766&_=1508996681484", new Dictionary<string, dynamic> { { "keyword", "赵丽颖" } });
AddPipeline(new ConsoleEntityPipeline());
Downloader.AddAfterDownloadCompleteHandler(new ReplaceHandler());
AddEntityType();
}
Downloader是null,导致运行出错,能更新下Sample吗?
MySqlEntityPipeline/GenerateInsertNewAndUpdateOldSql()
var sql =
$"INSERT INTO
{adapter.Table.Database}.
{tableName} ({cols}) {colsParams} ON DUPLICATE KEY UPDATE {setParams};";
改成
var sql =
$"INSERT INTO
{adapter.Table.Database}.
{tableName} ({cols}) VALUES ({colsParams}) ON DUPLICATE KEY UPDATE {setParams};";
就没问题了。
有没有增加 Cookie 池 与 UserAgent 池设计的计划
首先感谢开源此库,在看了issues后我发现DotnetSpider.Core2并不是最新版的,于是改为DotnetSpider.Core,但是我发现所有的文档都是DotnetSpider.Core2的,而且DotnetSpider.Core的注释没有DotnetSpider.Core2的完整。。。
你好 项目下载下来编译报错 找不到RuthSpider.cs文件
使用代理采集下一个链接的时候会报:“此实例已经启动一个或多个请求。只能在发送第一个请求之前修改属性”,我看代理类是实现了IDisposable接口的,是不是因为没释放资源的缘故?在哪里释放呢?我的代理类如下:
public class HttpProxyPool : IHttpProxyPool
{
public void Dispose()
{
}
public UseSpecifiedUriWebProxy GetProxy()
{
var uri = new Uri("http://125.126.162.105:45504");
return new UseSpecifiedUriWebProxy(uri);
}
public void ReturnProxy(UseSpecifiedUriWebProxy proxy, HttpStatusCode statusCode)
{
}
}
设置代理代码:
site.HttpProxyPool = new HttpProxyPool();
1、例子报错 https://github.com/dotnetcore/DotnetSpider/wiki/1.-第一个简单的爬虫 报错:Download https://github.com/zlzforever failed:发生一个或多个错误。
环境:net45,DotnetSpider2.Core.2.4.4 .
2、如何爬 带搜索参数的页面? 例如这个页面 http://list.youku.com/category/show/c_96_s_1_d_1_p_{i}.html 中 ,搜索框分别是 "后来的我们", "少林足球","羞羞的铁拳",如何优雅地爬到这3个页面呢?
3、这个组件,会自动切换ip爬么?要怎么切换ip呢?
我获取到了项目运行例子需要数据库,能上传下数据库吗?
你好,我想请教你一下关于使用 DotnetSpider Framework 的最佳实践:
场景:我需要从一个网站的首页中拿到所有种类的一级链接,然后再通过抓起到的一级链接组装成一个新的二级链接,我需要将所有的二级链接执行抓起数据。
问题:我创建了一个 Spider,然后在 Spider 中通过创建一个 Processor来执行首页中所有一级链接的抓起,但是我该如何将这些抓到的一级链接拿到后直接放到一个新的 Spider 中执行新的抓取任务呢?
还请不吝赐教,给一些最佳实践的灵感,谢谢。
ISelectablel.Links()
当Url 是https://img.alicdn.com/bao/uploaded/i4/143391948/TB2jv7vaC1I.eBjy0FjXXabfXXa_!!143391948.jpg
有叹号的时候截断了,获取不到完整的url
Hi
I have a question regarding NuGet-Packages. Which Project are intended to be a Nuget-Package? Only Dotnetspider.Core?
Best
f
When trying to run BaseUsage
sample i get an error in Spider.cs
in CheckIfSettingsCorrect()
method.
if (Site.RemoveOutboundLinks && (Site.Domains == null || Site.Domains.Length == 0)) {
throw new SpiderException($"When you want remove outbound links, the domains should not be null or empty."); }
I guess the Domains
property is required if RemoveOutboundLinks
is true
, but I don't know what is the purpose of that property.
Any plan to publish to NuGet?
通过 AddEntityType();
获取busstoplist 下所有P元素的内容
http://wapapp.dy4g.cn/bus/auto/test.php?t=linhtml&busline=1
`
/// <summary>
/// 获取车站信息
/// </summary>
[Schema("dybus", "BusStation")]
[Entity(Expression = ".//div[@class='busstoplist']/div//p", Type = SelectorType.XPath)]
class BusStation : BaseEntity
{
/// <summary>
/// 车次信息
/// </summary>
[Column]
[Field(Expression = "Keyword", Type = SelectorType.Enviroment)]
public string Keyword { get; set; }
/// <summary>
/// 车站唯一ID
/// </summary>
[Column]
[Field(Expression = "./@Id")]
public string BusStationId { get; set; }
/// <summary>
/// 车次路线编号
/// </summary>
[Column]
[Field(Expression = "./strong/text()")]
public string StationNumber { get; set; }
/// <summary>
/// 车站名称
/// </summary>
[Column]
[Field(Expression = "./span/text()")]
public string Name { get; set; }
/// <summary>
/// 车站方向
/// </summary>
[Column]
[Field(Expression = "../@class")]
public string BusDirection { get; set; }
}
`
数据是能够获取到,
但是获取到的同一条数据插入了两次
请老师看看,是不是我使用姿势不对
Spider开了两个线程,有一个入口地址,第一个线程拿到Url以后去处理,第二个线程循环等待,恰巧这个地址处理了很长时间,第二个线程等待waitCount后将Spider状态设为Finished,但第一个Url其实还在处理。
所以是否应该判断:所有线程都空闲的时候,再等待waitCount视为结束。
internal bool IsAutoIncrementPrimary => Primary.Count == 1 && Columns.Count(f => f.DataType == DataType.Int || f.DataType == DataType.Long) == 1;
判定数据库表是否应使用自增主键,当前逻辑为:
如果只有一个主键,且表中整型变量数量为1,那么这个主键就是自增的。
存在以下问题:
大哥,类库的project.json 里面不是netcoreapp1.0 ,应该是netstandard1.6平台标准啊
请问 假如一篇文章有多个作者,每个作者对应一张图片, 在entitysipder 里面改如何定义字段类型呢?
现在Demo 里面好像都是单一的类型。
[PropertyDefine(Expression = ".//div[@Class='p-name']/a/em", Length = 100)]
public string Name { get; set; }
能实现类似的么?
[PropertyDefine(Expression = ".//div[@Class='p-name']/a/em", Length = 100)]
public List‘<string’> Name { get; set; }
或者给 每个字段解析成功后能定义个callback 函数也行。
谢谢!
Hi. First of all: cool project!
When is next NuGet-Package update planned? I would really love to use my own Logger which is not possible with the current Version but should be with the next one (commit ae9bb7e) :)
thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.