Code Monkey home page Code Monkey logo

spider_smooc's Introduction

前段时间安装了一个慕课网app,发现不用注册就可以在线看其中的视频,就有了想爬取其中的视频,用来在电脑上学习。 决定花两天时间用学了一段时间的python做一做。

我的新书《Python爬虫开发与项目实战》出版了,喜欢的话可以看一下样章

我使用的是pycharm进行开发,使用BeautifulSoup模块解析html,整个代码进行了比较详细的注释。 整个工程结构:

----entity

--------init.py

--------fileinfor.py用来描述视频文件信息

----filedeal

--------init.py

--------file_downloader.py用于视频文件的下载

----spider 爬虫的核心内容

--------init.py

--------html_downloader.py html下载器

--------html_parser.py html解析器

--------spiderman.py 爬虫核心逻辑

----test test文件夹主要是用来测试一些用例,不参与整个程序运行

----conf.py 一些全局变量

----index.py 程序启动入口

运行环境: python 2.7.X
需要安装的支持模块:
BeautifulSoup (pip install或者下载源代码包setup.py)
下载链接:https://pypi.python.org/pypi/beautifulsoup4/4.3.2

运行:在windows上直接双击start.bat,linux上没试
我的微信公众号:qiye_python

博客:http://blog.csdn.net/qiye_/http://www.cnblogs.com/qiyeboy/
使用效果图:



-----------------------2016年10月31号更新------------------------
修改为新的解析方式,突破慕课网的封锁,添加使用说明截图

spider_smooc's People

Contributors

qiyeboy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spider_smooc's Issues

抓取视频只能获取少量字节

你好,您的项目虽然说每个视频用一个线程去抓取,但是每个视频,只抓取到一部分二进制文件后,便出现了异常,有什么好的办法可以将每个视频都完整的抓取下来吗。部分异常信息如下:

Exception` in thread Thread-47:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 520384 out of 47830612 bytes

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532059 out of 13004076 bytes

Exception in thread Thread-18:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 585460 out of 6128527 bytes

当前下载进度:---------------->>>>>>>> 6.47%Exception in thread Thread-48:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 582540 out of 24403607 bytes

Exception in thread Thread-36:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532065 out of 10005207 bytes

Exception in thread Thread-35:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532058 out of 49727052 bytes

Exception in thread Thread-40:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586084 out of 62159002 bytes

当前下载进度:---------------->>>>>>>> 6.50%Exception in thread Thread-7:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532063 out of 20505701 bytes

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532065 out of 61492854 bytes

Exception in thread Thread-46:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 527684 out of 14292045 bytes

当前下载进度:---------------->>>>>>>> 6.53%Exception in thread Thread-2:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586084 out of 10502982 bytes

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586087 out of 9053251 bytes

config.py文件中,PERLIST定义为空列表

然而在file_downloader.py中conf.PERLIST[self.__id]= per是这样使用,这样python会报错的。使用python中的列表特性比较合适如:conf.PERLIST.append(per)

python项目开发与实战

qiye你好,我买你写的书在第五章爬取盗墓笔记内容时存储为json格式报错invalid error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.