Code Monkey home page Code Monkey logo

appcrawler's Introduction

App Crawler

中文介绍及讨论

Collect app infomation from Baidu / Google Play app market using python Scrapy and Mongodb

for Scrapy 1.0+,change `app/settings.py` 's ITEM_PIPELINE to

ITEM_PIPELINES = {
  'scrapy_mongodb.MongoDBPipeline': 100
}

appcrawler's People

Contributors

oa414 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

appcrawler's Issues

ImportError: No module named 'sgmllib'

你好。你的这个项目非常有意义,我也有相同的需求。
我是一个Python初学者。对于代码部分有些疑问。
我安装了Python3.5的版本,我暂时还没弄懂如何启动MongoDB,所以先先把结果保存为csv文件:
scrapy crawl google -o test.csv JOBDIR=app/jobs

但是我得到如下错误信息:
ImportError: No module named 'sgmllib'

我在网上查找原因,得知SgmlLinkExtractor & LinkExtractor都需要sgmllib的支持。而Python3.0不支持sgmllib。所以我是不是需要重新安装Python2.7的环境?还有别的替代方法吗?

另外我也很好奇,在Google Play “Viber”页面下,获取app id和下载次数后,爬虫又是如何去爬下一个App的,这个循环是如何实现的?
rules = [
Rule(LinkExtractor(allow=("https://play.google.com/store/apps/details", )), callback='parse_app',follow=True),
] # CrawlSpider 会根据 rules 规则爬取页面并调用函数进行处理
这一段代码看不明白。

我自己可以使用BeautifulSoup+Request爬取某一个App的名称和下载量信息,但是我做不到爬取所有App的一个循环,也没办法让任务中断后,可接着爬取,而不用重新开始。

有个问题

爬取googleplay不用翻墙代理吗?googleplay是动态加载的只用scrapy可以是实现吗 ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.