Code Monkey home page Code Monkey logo

immospider's People

Contributors

asmaier avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

immospider's Issues

Locations

The url for the search request contains
Bundesland and Stadt
But this often does not work
Examples:
Bayern/Ebersberg
Bayern/Erding
Bayern/Landsberg

immoscout returns error 410

Do you have any hint what is wrong with these cities?

scrapy crawl immoscout -o apartments.csv makes an empty file

I followed the instruction that is mentioned in Readme.
In the Simple scraping step is mentioned the output of the following command should have the list of the apartments in Berlin in apartments.csv. But, the output file is an empty file. Did I miss something?
I copy the log of the command for a better understanding.

Thanks
Sajjad

$ scrapy crawl immoscout -o apartments.csv -a url=https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00 -L INFO
2021-01-19 13:41:20 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: immospider)
2021-01-19 13:41:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 (default, Sep 25 2020, 09:36:53) - [GCC 10.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 3.0, Platform Linux-5.8.0-36-generic-x86_64-with-glibc2.32
2021-01-19 13:41:20 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'immospider',
'LOG_LEVEL': 'INFO',
'LOG_STDOUT': True,
'NEWSPIDER_MODULE': 'immospider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['immospider.spiders']}
2021-01-19 13:41:20 [scrapy.extensions.telnet] INFO: Telnet Password: 4a1f6f3d22013ab8
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'immospider.extensions.SendMail']
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-01-19 13:41:20 [scrapy.middleware] INFO: Enabled item pipelines:
['immospider.pipelines.GooglemapsPipeline',
'immospider.pipelines.DuplicatesPipeline']
2021-01-19 13:41:20 [scrapy.core.engine] INFO: Spider opened
2021-01-19 13:41:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-01-19 13:41:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-19 13:41:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00>: HTTP status code is not handled or not allowed
2021-01-19 13:41:20 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-19 13:41:20 [immospider.extensions] INFO: No new items found. No email sent.
2021-01-19 13:41:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 910,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 18012,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/405': 1,
'elapsed_time_seconds': 0.256602,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 1, 19, 12, 41, 20, 624024),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/405': 1,
'log_count/INFO': 12,
'memusage/max': 57483264,
'memusage/startup': 57483264,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 1, 19, 12, 41, 20, 367422)}
2021-01-19 13:41:20 [scrapy.core.engine] INFO: Spider closed (finished)

Issue? Copying url from Immoscout doesn't work

(Hope it is fine if I post this as an issue. It is very likely I just don't understand the script and it's just a question.)

Hi @asmaier,

Your ImmoSpider script looks super useful and I am trying to get the basic scraping working. In your readme you provide a basic example url looking like this:

url=https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00

I can run the script with that url and I manage to generate the csv output. But once I specify my own search through the immoscout web-interface, the resulting url looks very different to the one you provide:

url=https://www.immobilienscout24.de/Suche/radius/wohnung-kaufen?centerofsearchaddress=Hamburg;;;1276006001005;;Altona-Altstadt&numberofrooms=3.0-&price=-800000.0&livingspace=50.0-&geocoordinates=53.54883;9.9477;2.0&enteredFrom=one_step_search

It obviously doesn't work to plug such url into the scrapy crawl immoscout CLI command. I tried to reconstruct an url similar to yours but that failed as well.

Could you explain how or where you get that url from so that it is compatible with your script?

Thanks a lot,
mo


Follow up:
I noticed I can specify the URL within the spider immoscout.py file, constructing an url using & to link search filters such as &numberofrooms=4.0- .

url = "https://www.immobilienscout24.de/Suche/de/hamburg/hamburg/eppendorf/wohnung-kaufen?numberofrooms=4.0-&sorting=2&pagenumber=0"

But it doesn't work using the CLI. So I am good, I can use the script. But would still be great to get the CLI command running. Am I missing something?

Thanks,
Jan

Not working anymore

Hi,

This is really a nice script! It seems due to newest restriction of IS this do not work anymore. Is this right? I've startet the script but it don't find any items.

Thanks!

scrapy 405 method can not handle

Hi everyone,

until two days ago I used almost every the immoscout spider with docker. Now I get a 405 method error. scrapy can scrapy the landing page with a 200 response but not a search url as described in the tutorial. Immobilienscout detects scrapy as a bot and directs to a recaptcha site...

the question is what is the most simple way to circumevent the 405 method error. ?

I am not the best python expert yet, so I would be happy for any help ๐Ÿ‘

best regards
Gettlar

feat/add-similar-objects

Hi @asmaier

I'd like to create a pull request to add "similar objects" in the parsing section of immoscout.py.
Can you give me the rights so I can do so? Currently I don't seem to have permission.

Thanks a lot!
Jan

# -*- coding: utf-8 -*-
import scrapy
import json
from immospider.items import ImmoscoutItem


class ImmoscoutSpider(scrapy.Spider):
    name = "immoscout"
    allowed_domains = ["immobilienscout24.de"]
    # start_urls = ['https://www.immobilienscout24.de/Suche/S-2/Wohnung-Miete/Berlin/Berlin']
    # start_urls = ['https://www.immobilienscout24.de/Suche/S-2/Wohnung-Miete/Berlin/Berlin/Lichterfelde-Steglitz_Nikolassee-Zehlendorf_Dahlem-Zehlendorf_Zehlendorf-Zehlendorf/2,50-/60,00-/EURO--800,00/-/-/']

    # The immoscout search results are stored as json inside their javascript. This makes the parsing very easy.
    # I learned this trick from https://github.com/balzer82/immoscraper/blob/master/immoscraper.ipynb .
    script_xpath = './/script[contains(., "IS24.resultList")]'
    next_xpath = '//div[@id = "pager"]/div/a/@href'

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):

        print(response.url)

        for line in response.xpath(self.script_xpath).extract_first().split('\n'):
            if line.strip().startswith('resultListModel'):
                immo_json = line.strip()
                immo_json = json.loads(immo_json[17:-1])

                #TODO: On result pages with just a single result resultlistEntry is not a list, but a dictionary.
                #TODO: So extracting data will fail.
                for result in immo_json["searchResponseModel"]["resultlist.resultlist"]["resultlistEntries"][0]["resultlistEntry"]:
                    item = self.parse_result(result, response)
                    yield item

                    # check for and parse "similar objects" with additional matching results in json body
                    if "similarObjects" in result:
                        for i in result["similarObjects"][0]["similarObject"]:
                            item = self.parse_data_object(i, response)
                            yield item

        next_page_list = response.xpath(self.next_xpath).extract()
        if next_page_list:
            next_page = next_page_list[-1]
            print("Scraping next page", next_page)
            if next_page:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

    def parse_result(self, result, response):
        """parse json result for each site

        :param result: [description]
        :type result: [type]
        """
        item = ImmoscoutItem()
        data = result["resultlist.realEstate"]

        item["immo_id"] = data["@id"]
        item["url"] = response.urljoin("/expose/" + str(data["@id"]))
        item["title"] = data["title"]
        address = data["address"]
        try:
            item["address"] = address["street"] + " " + address["houseNumber"]
        except:
            item["address"] = None
        item["city"] = address["city"]
        item["zip_code"] = address["postcode"]
        item["district"] = address["quarter"]

        item["rent"] = data["price"]["value"]
        item["sqm"] = data["livingSpace"]
        item["rooms"] = data["numberOfRooms"]

        if "calculatedPrice" in data:
            item["extra_costs"] = (
                data["calculatedPrice"]["value"] - data["price"]["value"]
            )
        if "builtInKitchen" in data:
            item["kitchen"] = data["builtInKitchen"]
        if "balcony" in data:
            item["balcony"] = data["balcony"]
        if "garden" in data:
            item["garden"] = data["garden"]
        if "privateOffer" in data:
            item["private"] = data["privateOffer"]
        if "plotArea" in data:
            item["area"] = data["plotArea"]
        if "cellar" in data:
            item["cellar"] = data["cellar"]

        try:
            contact = data["contactDetails"]
            item["contact_name"] = contact["firstname"] + " " + contact["lastname"]
        except:
            item["contact_name"] = None

        try:
            item["media_count"] = len(data["galleryAttachments"]["attachment"])
        except:
            item["media_count"] = 0

        try:
            item["lat"] = address["wgs84Coordinate"]["latitude"]
            item["lng"] = address["wgs84Coordinate"]["longitude"]
        except Exception as e:
            # print(e)
            item["lat"] = None
            item["lng"] = None

        print(item)

        return item

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.