User file/avoid scraping the same user twice if not necessary about okcubot2 HOT 16 CLOSED

Deleetdk commented on September 4, 2024

User file/avoid scraping the same user twice if not necessary

from okcubot2.

Comments (16)

tomwalter2287 commented on September 4, 2024

Sounds good!

On Tue, Mar 29, 2016 at 5:22 AM, Emil Kirkegaard [email protected]
wrote:

Currently, the scraper picks users at semi-random and scrapes them. Then
picks some more, etc. Altho there are hundreds of thousands of users, doing
it this way makes it possible that the same user will get scraped twice.
This potentially wastes the scraper's time.

To avoid this problem, one can create a user file that keeps track of
which users were scraped and number of questions answered and when they
were scraped. This should be simple enough, so I will try.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#3

from okcubot2.

Deleetdk commented on September 4, 2024

I have implemented saving the user info to users.csv in 0450111

It works on my end, but some settings were incorrectly changed I see.

from okcubot2.

tomwalter2287 commented on September 4, 2024

So you want to remove the changes in .gitignore and settings.py in my part?

On Tue, Mar 29, 2016 at 6:27 AM, Emil Kirkegaard [email protected]
wrote:

I have implemented this feature in 0450111
0450111

It works on my end, but some settings were incorrectly changed I see.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

from okcubot2.

Deleetdk commented on September 4, 2024

Just use git pull before you use git commit. This updates your local version with the server's, so that there are no conflicts. Make sure that your .gitignore file is correct because otherwise you are uploading temporary files (those ending with ~) and data files (those in data/) to the repository. If you look at https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will see that I have excluded these.

from okcubot2.

tomwalter2287 commented on September 4, 2024

I pushed new version.

Looking forward to your response.

On Tue, Mar 29, 2016 at 6:36 AM, Emil Kirkegaard [email protected]
wrote:

Just use git pull before you use git commit. This updates your local
version with the server's, so that there are no conflicts. Make sure that
youre .gitignore file is correct because otherwise you are uploading
temporary files (those ending with ~) and data files (those in data/) to
the repository. If you look at
https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will
see that I have excluded these.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

from okcubot2.

tomwalter2287 commented on September 4, 2024

I pull your original project.
So this issue must be solved.

from okcubot2.

tomwalter2287 commented on September 4, 2024

I think this issue is already solved.

from okcubot2.

Deleetdk commented on September 4, 2024

We need to test it. I tried testing it with the --u option. However, it still scrapes the user twice.

from okcubot2.

tomwalter2287 commented on September 4, 2024

Twice?

It scrapes the user only once.

from okcubot2.

Deleetdk commented on September 4, 2024

If I run the scraper twice with the same user in --u, the profile is scraped twice. It should be skipped the second time if the number of questions answered in the profile is the same as the number in users.csv. Make sure that this skipping feature is disableable with a command line argument. E.g. --noskip.

from okcubot2.

tomwalter2287 commented on September 4, 2024

Hi, Emil.

I pushed new version.
This issue is solved in this version.

from okcubot2.

Deleetdk commented on September 4, 2024

I tested this with user mama_crossasaur. This user has not answered any hidden questions. The code correctly skips her. I also tried the --noskip argument. Also worked.

python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur
python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur --noskip

There is a problem with users that have answered some questions privately. The scraper does not scrape them, so they do not get counted. However, the scraper uses the number shown in the profile to make the comparison, hence it still scrapes them.

The solution is to change the number used for the comparison to use the one shown in the profile as well. I will make this change myself.

from okcubot2.

Deleetdk commented on September 4, 2024

Can you add the number of questions answered to target_info? Call it m_numberanswered. In that case, I can get the information I need in the save_as_csv function.

from okcubot2.

tomwalter2287 commented on September 4, 2024

I sent you new version.

from okcubot2.

Deleetdk commented on September 4, 2024

Good. I am currently doing a large test of the scraper (scraping 1000 users). I am looking to see if there are more bugs that we just haven't seen yet.

I will try your new version after I'm done with that.

from okcubot2.

Deleetdk commented on September 4, 2024

Fixed in 4b9a67a

from okcubot2.

User file/avoid scraping the same user twice if not necessary about okcubot2 HOT 16 CLOSED

Comments (16)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent