Light

oa414 / appcrawler Goto Github PK

View Code? Open in Web Editor NEW

65.0 7.0 31.0 6 KB

Spider for extract Android' app infomation in App Market

Home Page: https://github.com/oa414/AppCrawler

Python 100.00%

appcrawler's Introduction

App Crawler

中文介绍及讨论

Collect app infomation from Baidu / Google Play app market using python Scrapy and Mongodb

for Scrapy 1.0+，change ｀app/settings.py｀ 's ITEM_PIPELINE to

ITEM_PIPELINES = {
  'scrapy_mongodb.MongoDBPipeline': 100
}

appcrawler's People

Contributors

Stargazers

Watchers

appcrawler's Issues

ImportError: No module named 'sgmllib'

你好。你的这个项目非常有意义，我也有相同的需求。
我是一个Python初学者。对于代码部分有些疑问。
我安装了Python3.5的版本，我暂时还没弄懂如何启动MongoDB，所以先先把结果保存为csv文件：
scrapy crawl google -o test.csv JOBDIR=app/jobs

但是我得到如下错误信息：
ImportError: No module named 'sgmllib'

我在网上查找原因，得知SgmlLinkExtractor & LinkExtractor都需要sgmllib的支持。而Python3.0不支持sgmllib。所以我是不是需要重新安装Python2.7的环境？还有别的替代方法吗？

另外我也很好奇，在Google Play “Viber”页面下，获取app id和下载次数后，爬虫又是如何去爬下一个App的，这个循环是如何实现的？
rules = [
Rule(LinkExtractor(allow=("https://play.google.com/store/apps/details", )), callback='parse_app',follow=True),
] # CrawlSpider 会根据 rules 规则爬取页面并调用函数进行处理
这一段代码看不明白。

我自己可以使用BeautifulSoup+Request爬取某一个App的名称和下载量信息，但是我做不到爬取所有App的一个循环，也没办法让任务中断后，可接着爬取，而不用重新开始。

有个问题

爬取googleplay不用翻墙代理吗？googleplay是动态加载的只用scrapy可以是实现吗？

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.