deleetdk / okcubot2 Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 620 KB

A scraper for OKCupid.

Python 100.00%

okcubot2's People

Contributors

Stargazers

Watchers

okcubot2's Issues

Parsing error

Most scrapings do not give errors, but I found one that did:

2016-03-29 14:59:00 [scrapy] ERROR: Spider error processing <GET https://www.okcupid.com/profile/LukaBGesserit?cf=regular> (referer: https://www.okcupid.com/home)
Traceback (most recent call last):
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/OKCubot2/OKcu/spiders/okcubot_spider.py", line 192, in parse_user_info
    target_info = self.parse_misc(response.meta["user_name"], target_info, response)
  File "/home/mint/Desktop/OKCubot2/OKCubot2/OKcu/spiders/okcubot_spider.py", line 403, in parse_misc
    misc = response.xpath("//table[contains(@class, 'misc')]//td[2]/text()")[0].extract().strip().split(",")
IndexError: list index out of range

From looking at the profile https://www.okcupid.com/profile/LukaBGesserit it looks fairly normal to me. The only thing I noticed is that the user has not filled out his information re. smoking etc. (i.e. the third block in the profile, top right area).

Add command line argument to set the number of users to scrape

Currently, the number of users to scrape is hardcoded in okcubot_spider.py in the function OKcuSpider. This should not be the case. It should have a default value (e.g. 1) and possibly take a value from the command line.

Error on start

You have introduced a bug. Please test the program before you push to the repository.

2016-03-31 00:33:25 [scrapy] ERROR: Spider error processing <GET https://www.okcupid.com/profile/LukaBGesserit/questions> (referer: https://www.okcupid.com/home)
Traceback (most recent call last):
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/OKCubot2/OKcu/spiders/okcubot_spider.py", line 188, in check_skip
    with open(path, "rb") as f:
IOError: [Errno 2] No such file or directory: 'data//users.csv'

Commandline arguments not interpreted as Unicode

This is kind of a meta-bug. Because this project uses Python 2.x, it has ... complex Unicode support. The commandline arguments are interpreted as ASCII while they are actually given as Unicode (on linux). This causes problems with the use of --u to test bugs!

The best solution is to switch to Python 3. However, this may cause a lot of problems and I do not have time to sort them out today.

The next best solution is to make it understand that the input are given in Unicode. I used:

type=lambda s : unicode(s, sys.getfilesystemencoding())

Source: http://stackoverflow.com/questions/24552854/python-argparse-unicode-argument-issue

Unicode error

Whenever I run the script, I get:

2016-03-31 12:57:41 [py.warnings] WARNING: /home/mint/Desktop/OKCubot2/OKCubot2/OKcu/spiders/okcubot_spider.py:194: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if usr_info[0][1:-1] == response.meta["user_name"] and usr_info[1][1:-1] == answer_num
```:

User file/avoid scraping the same user twice if not necessary

Currently, the scraper picks users at semi-random and scrapes them. Then picks some more, etc. Altho there are hundreds of thousands of users, doing it this way makes it possible that the same user will get scraped twice. This potentially wastes the scraper's time.

To avoid this problem, one can create a user file that keeps track of which users were scraped and number of questions answered and when they were scraped.

Creating the users.csv will be easy enough. Just a little change to save_as_csv. Then a change must be made to the scraper function (get_target I think?) so that it skips users that have not changed their data.

Scraper is possibly missing a few questions

The switch from using the question categories to using the list of all questions helped towards getting more questions. However, I scraped some users and the scraper still may be missing some questions.
E.g. user https://www.okcupid.com/profile/storming/questions has 682 questions in common with the scraper profile, but the datafile has 677 of them. So, 5 are missing. This may be a bug in reading the last page of questions. Or it may be the user has hidden their answers to 5 questions.

I also scraped user https://www.okcupid.com/profile/casjgoon666/questions. The scraper gets 86 and the user has 86, so there is no error there.

Images saved are thumbnails

The current version saves images, but there are two problems:

The images saved are thumbnails (225x225), not the real size images.

The profile picture is saved twice. The second saving is an extra small thumbnail (82x82).

Parsing error

I found another parsing error.

2016-03-31 13:28:23 [scrapy] ERROR: Spider error processing <GET https://www.okcupid.com/profile/Lkngfrlov/questions> (referer: https://www.okcupid.com/profile/Lkngfrlov/personality)
Traceback (most recent call last):
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/OKCubot2/OKcu/spiders/okcubot_spider.py", line 303, in parse_questions
    target_info = self.get_question_res(response.meta['user_name'], response.meta['target_info'], response)
  File "/home/mint/Desktop/OKCubot2/OKCubot2/OKcu/spiders/okcubot_spider.py", line 324, in get_question_res
    answer = question.xpath(".//p[contains(@class, 'target')]//span[1]/text()")[0].extract().strip()
IndexError: list index out of range

Does not skip users with special characters in their names when it should

From looking at the datafiles, I can see that users with special characters in their names are not skipped even tho the questions number matches.

Here's the last couple of scraped users:

"Le_Phlâneur","331","2016-04-11"
"värmeflaska","1103","2016-04-11"
"Blumenstück","1436","2016-04-11"
"Le_Phlâneur","331","2016-04-11"
"BOØWY","744","2016-04-11"
"entzückt","136","2016-04-11"
"ErøsVon","238","2016-04-11"
"värmeflaska","1103","2016-04-11"
"entzückt","136","2016-04-11"
"AlmostNotKnowing","503","2016-04-11"
"entzückt","136","2016-04-11"
"Đoko85","917","2016-04-11"
"Anaïs_Wilde","430","2016-04-11"
"Le_Phlâneur","331","2016-04-11"
"JëStraße","331","2016-04-11"
"Đoko85","917","2016-04-11"
"BOØWY","744","2016-04-11"
"entzückt","136","2016-04-11"
"ErøsVon","238","2016-04-11"
"Le_Phlâneur","331","2016-04-11"
"Đoko85","917","2016-04-11"
"mldstriker9","691","2016-04-11"
"burningtc","921","2016-04-11"
"BOØWY","744","2016-04-11"
"ErøsVon","238","2016-04-11"
"Eisbärlicious","99","2016-04-11"
"Le_Phlâneur","331","2016-04-11"
"Blumenstück","1436","2016-04-11"

So, we see that they tend to be the same users repeated and the questions umber matches, so it should be skipped. For instance, Le_Phlâneur has 331 questions and was scraped 5 times above. This wastes a considerable amount of space and time.

My guess is that it is a string comparison that fails due to incorrect Unicode handling.

Image order

I note that the primary picture (the one shown in the profile) is saved with the last number. It should be saved with the first (i.e. 1). This makes it easier to do analyses of the primary pictures.

E.g. https://www.okcupid.com/profile/Cloudy7_taco, her primary picture is saved as number 10.

Scrape a specific user

For debugging purposes, it is useful if one can specify a specific user to scrape. I propose to add a command line argument (--u) to make this possible.

For instance, bugs like #5 are rare, so testing whether the code works for these cases may take some time.

The location of the data output folder depends on the calling location

I noticed that the location of the data folder depends on where one calls the scraper from. If I call it from within the folder, it gets placed in OKCubot2/OKcu/data. However, if I call it from e.g. the OKCubot2 folder, the data gets saved in OKCubot2/data.

The data should always be saved in the same place by default. I proposed that we save it in OKCubot2/data.

One could add a call parameter, so that it gets saved into a specified location. I will look into adding another calling parameter.

Problematic characters are not escaped in profile texts.

This causes the file to be difficult or impossible to parse. An example is given below. This file cannot be parsed (very easily) because the quote character is not escaped.

List index out of range when starting

Sampling is not random

From looking at the scraped data and looking at the script while its running, it is clear that the sampling is not random. This is because the first day, many users were scraped. The second day, only a few and lots of those were duplicates due to an error (#13).

Currently the script uses these two URLs to get users:

https://www.okcupid.com/match?filter1=0,16&filter2=2,18,99&filter3=1,1&locid=0&timekey=1&fromWhoOnline=0&mygender=&update_prefs=1&sort_type=0&sa=1&count=50
https://www.okcupid.com/match?filter1=0,32&filter2=2,18,99&filter3=1,1&locid=0&timekey=1&fromWhoOnline=0&mygender=&update_prefs=1&sort_type=0&sa=1&count=50

Aside from the gender, they look fairly random to me. But perhaps OKCupid uses the requester's IP to match locations despite using the "everywhere" setting. I can test this by scraping while setting my VPN to somewhere new each day.

Scraper is missing various profile data

Information that is currently not scraped:

height (d_height*)
the other languages than the first (d_languages)
astrological sign (d_astrology_sign)
astrological importance (d_astrology_seriosity)
religious type (d_religion_type)
religious importance (d_religion_seriosity)
eating habit (d_eat*)

For an example of the more than one language not being scraped, see user värmeflaska.

The * indicates that this field was not mentioned in the fields.csv file. However, height was explicitly mentioned in the instructions file. Eating habit (e.g. vegan) was somehow missed by me.

This information is important to scrape.

Parsing error (no profile text)

2016-04-11 08:47:46 [scrapy] ERROR: Spider error processing <GET ?cf=regular&subject_id=16625519237492337449&picid=13997124/questions> (referer: https://www.okcupid.com/profile/PictoUK?cf=regular&subject_id=16625519237492337449&picid=13997124/personality)
Traceback (most recent call last):
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/env/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/mint/Desktop/OKCubot2/OKCubot2/OKcu/spiders/okcubot_spider.py", line 311, in parse_questions
    page_enable = response.xpath("//li[contains(@class, 'next')]/@class")[0].extract()
IndexError: list index out of range

Looking at the profile (http://www.okcupid.com/profile/PictoUK), the error seems to be due to the user not having any profile text, so the headers fail to match (e.g. nothing to match for "My self summary").

deleetdk / okcubot2 Goto Github PK

okcubot2's People

Contributors

Stargazers

Watchers

okcubot2's Issues

Recommend Projects

Recommend Topics

Recommend Org