asmaier / immospider Goto Github PK

View Code? Open in Web Editor NEW

181.0 181.0 49.0 5.43 MB

Immospider is a crawler for the Immoscout24 website.

Jupyter Notebook 99.49% Python 0.48% Dockerfile 0.01% Shell 0.03%

immospider's People

Contributors

Stargazers

Watchers

Forkers

rerlinger german-m-garcia faboo8 sfakir unfug jvandenbroeck detompl rrossier timowunderlich mcflis damahe88 mixam24 mstraubac aaliani student1304 simonharlacher akramsadat xilaew defjm drericebert annmii freddorick jundengdeng soerenschneider ashev87 smourapina saintsat swaggerino german-dev-2000 03l54rd1n3 christianwesthoff dstamou mcrip2 jank84 zeilenschubser muenzi1 haiq31 bmgmonteiro tiansuyu monkeydg stefanbritting victorrom aks861999 alexcoc destefani leah-aaron kmutagene jfuentesr

immospider's Issues

Locations

The url for the search request contains
Bundesland and Stadt
But this often does not work
Examples:
Bayern/Ebersberg
Bayern/Erding
Bayern/Landsberg

immoscout returns error 410

Do you have any hint what is wrong with these cities?

scrapy crawl immoscout -o apartments.csv makes an empty file

I followed the instruction that is mentioned in Readme.
In the Simple scraping step is mentioned the output of the following command should have the list of the apartments in Berlin in apartments.csv. But, the output file is an empty file. Did I miss something?
I copy the log of the command for a better understanding.

Thanks
Sajjad

$ scrapy crawl immoscout -o apartments.csv -a url=https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00 -L INFO
2021-01-19 13:41:20 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: immospider)
2021-01-19 13:41:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 (default, Sep 25 2020, 09:36:53) - [GCC 10.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 3.0, Platform Linux-5.8.0-36-generic-x86_64-with-glibc2.32
2021-01-19 13:41:20 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'immospider',
'LOG_LEVEL': 'INFO',
'LOG_STDOUT': True,
'NEWSPIDER_MODULE': 'immospider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['immospider.spiders']}
2021-01-19 13:41:20 [scrapy.extensions.telnet] INFO: Telnet Password: 4a1f6f3d22013ab8
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'immospider.extensions.SendMail']
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled item pipelines:
['immospider.pipelines.GooglemapsPipeline',
'immospider.pipelines.DuplicatesPipeline']
2021-01-19 13:41:20 [scrapy.core.engine] INFO: Spider opened
2021-01-19 13:41:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-01-19 13:41:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-19 13:41:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00>: HTTP status code is not handled or not allowed
2021-01-19 13:41:20 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-19 13:41:20 [immospider.extensions] INFO: No new items found. No email sent.
2021-01-19 13:41:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 910,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 18012,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/405': 1,
'elapsed_time_seconds': 0.256602,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 1, 19, 12, 41, 20, 624024),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/405': 1,
'log_count/INFO': 12,
'memusage/max': 57483264,
'memusage/startup': 57483264,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 1, 19, 12, 41, 20, 367422)}
2021-01-19 13:41:20 [scrapy.core.engine] INFO: Spider closed (finished)

Issue? Copying url from Immoscout doesn't work

(Hope it is fine if I post this as an issue. It is very likely I just don't understand the script and it's just a question.)

Hi @asmaier,

Your ImmoSpider script looks super useful and I am trying to get the basic scraping working. In your readme you provide a basic example url looking like this:

url=https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00

I can run the script with that url and I manage to generate the csv output. But once I specify my own search through the immoscout web-interface, the resulting url looks very different to the one you provide:

url=https://www.immobilienscout24.de/Suche/radius/wohnung-kaufen?centerofsearchaddress=Hamburg;;;1276006001005;;Altona-Altstadt&numberofrooms=3.0-&price=-800000.0&livingspace=50.0-&geocoordinates=53.54883;9.9477;2.0&enteredFrom=one_step_search

It obviously doesn't work to plug such url into the scrapy crawl immoscout CLI command. I tried to reconstruct an url similar to yours but that failed as well.

Could you explain how or where you get that url from so that it is compatible with your script?

Thanks a lot,
mo

Follow up:
I noticed I can specify the URL within the spider immoscout.py file, constructing an url using & to link search filters such as &numberofrooms=4.0- .

url = "https://www.immobilienscout24.de/Suche/de/hamburg/hamburg/eppendorf/wohnung-kaufen?numberofrooms=4.0-&sorting=2&pagenumber=0"

But it doesn't work using the CLI. So I am good, I can use the script. But would still be great to get the CLI command running. Am I missing something?

Thanks,
Jan

Not working anymore

Hi,

This is really a nice script! It seems due to newest restriction of IS this do not work anymore. Is this right? I've startet the script but it don't find any items.

Thanks!

scrapy 405 method can not handle

Hi everyone,

until two days ago I used almost every the immoscout spider with docker. Now I get a 405 method error. scrapy can scrapy the landing page with a 200 response but not a search url as described in the tutorial. Immobilienscout detects scrapy as a bot and directs to a recaptcha site...

the question is what is the most simple way to circumevent the 405 method error. ?

I am not the best python expert yet, so I would be happy for any help 👍

best regards
Gettlar

feat/add-similar-objects

Hi @asmaier

I'd like to create a pull request to add "similar objects" in the parsing section of immoscout.py.
Can you give me the rights so I can do so? Currently I don't seem to have permission.

Thanks a lot!
Jan

# -*- coding: utf-8 -*-
import scrapy
import json
from immospider.items import ImmoscoutItem


class ImmoscoutSpider(scrapy.Spider):
    name = "immoscout"
    allowed_domains = ["immobilienscout24.de"]
    # start_urls = ['https://www.immobilienscout24.de/Suche/S-2/Wohnung-Miete/Berlin/Berlin']
    # start_urls = ['https://www.immobilienscout24.de/Suche/S-2/Wohnung-Miete/Berlin/Berlin/Lichterfelde-Steglitz_Nikolassee-Zehlendorf_Dahlem-Zehlendorf_Zehlendorf-Zehlendorf/2,50-/60,00-/EURO--800,00/-/-/']

    # The immoscout search results are stored as json inside their javascript. This makes the parsing very easy.
    # I learned this trick from https://github.com/balzer82/immoscraper/blob/master/immoscraper.ipynb .
    script_xpath = './/script[contains(., "IS24.resultList")]'
    next_xpath = '//div[@id = "pager"]/div/a/@href'

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):

        print(response.url)

        for line in response.xpath(self.script_xpath).extract_first().split('\n'):
            if line.strip().startswith('resultListModel'):
                immo_json = line.strip()
                immo_json = json.loads(immo_json[17:-1])

                #TODO: On result pages with just a single result resultlistEntry is not a list, but a dictionary.
                #TODO: So extracting data will fail.
                for result in immo_json["searchResponseModel"]["resultlist.resultlist"]["resultlistEntries"][0]["resultlistEntry"]:
                    item = self.parse_result(result, response)
                    yield item

                    # check for and parse "similar objects" with additional matching results in json body
                    if "similarObjects" in result:
                        for i in result["similarObjects"][0]["similarObject"]:
                            item = self.parse_data_object(i, response)
                            yield item

        next_page_list = response.xpath(self.next_xpath).extract()
        if next_page_list:
            next_page = next_page_list[-1]
            print("Scraping next page", next_page)
            if next_page:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

    def parse_result(self, result, response):
        """parse json result for each site

        :param result: [description]
        :type result: [type]
        """
        item = ImmoscoutItem()
        data = result["resultlist.realEstate"]

        item["immo_id"] = data["@id"]
        item["url"] = response.urljoin("/expose/" + str(data["@id"]))
        item["title"] = data["title"]
        address = data["address"]
        try:
            item["address"] = address["street"] + " " + address["houseNumber"]
        except:
            item["address"] = None
        item["city"] = address["city"]
        item["zip_code"] = address["postcode"]
        item["district"] = address["quarter"]

        item["rent"] = data["price"]["value"]
        item["sqm"] = data["livingSpace"]
        item["rooms"] = data["numberOfRooms"]

        if "calculatedPrice" in data:
            item["extra_costs"] = (
                data["calculatedPrice"]["value"] - data["price"]["value"]
            )
        if "builtInKitchen" in data:
            item["kitchen"] = data["builtInKitchen"]
        if "balcony" in data:
            item["balcony"] = data["balcony"]
        if "garden" in data:
            item["garden"] = data["garden"]
        if "privateOffer" in data:
            item["private"] = data["privateOffer"]
        if "plotArea" in data:
            item["area"] = data["plotArea"]
        if "cellar" in data:
            item["cellar"] = data["cellar"]

        try:
            contact = data["contactDetails"]
            item["contact_name"] = contact["firstname"] + " " + contact["lastname"]
        except:
            item["contact_name"] = None

        try:
            item["media_count"] = len(data["galleryAttachments"]["attachment"])
        except:
            item["media_count"] = 0

        try:
            item["lat"] = address["wgs84Coordinate"]["latitude"]
            item["lng"] = address["wgs84Coordinate"]["longitude"]
        except Exception as e:
            # print(e)
            item["lat"] = None
            item["lng"] = None

        print(item)

        return item

Is this crawler alive

Requests get stuck at captcha page. What to do?

asmaier / immospider Goto Github PK

immospider's People

Contributors

Stargazers

Watchers

Forkers

immospider's Issues

Cant open the scraper

Locations

scrapy crawl immoscout -o apartments.csv makes an empty file

Issue? Copying url from Immoscout doesn't work

Not working anymore

scrapy 405 method can not handle

feat/add-similar-objects

Is this crawler alive

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent