Code Monkey home page Code Monkey logo

fs-crawler's People

Contributors

rappdw avatar

Stargazers

 avatar

Watchers

 avatar  avatar

fs-crawler's Issues

HTTP status 301 & 410 encountered with some relationships

This is likely an expected situation as the graph is under modification during the crawl. Still we should log the instances encountered in the DB so that they can be examined after the crawl (rather than just logging to stderr).

Inconsistency in counts between crawl-fs and rbgcf

Logs from crawl-fs run:

Login to FamilySearch...
Downloading hop: 0... (3 individuals in hop)
Downloading hop: 1... (27 individuals in hop)
Downloading hop: 2... (164 individuals in hop)
Downloading hop: 3... (804 individuals in hop)
Downloading hop: 4... (3499 individuals in hop)
Downloading hop: 5... (10718 individuals in hop)
Downloading hop: 6... (34646 individuals in hop)

Logs from rbgcf run (4 hops):

2020-02-26 16:31:11,722 Reading graph input files
2020-02-26 16:31:11,937 4086 vertices in graph. 81 vetices were removed from the graph as they had no edges.

I would expect that rbgcf report 3+27+164+804+3499 rather than 4086+81

When running subsequent crawl, histogram of invalid relationships by iteration is incorrect

See following log:

2022-02-01 16:08:00,235 712 invalid relationships remain after resolution: 
Iteration 1: 1
Iteration 2: 6
Iteration 3: 22
Iteration 4: 87
Iteration 5: 349
Iteration 6: 247
2022-02-01 16:08:00,236 Crawl complete.
root@rbg-collect:~# crawl-fs -b all -i ... -o /data/rbg/feb-01-2022/ -u rappdw -h 1 --save-living
2022-02-01 16:08:10,706 Login to FamilySearch...
2022-02-01 16:08:15,624 Loaded graph for restart: 91,341 vertices, 231,695 frontier, 137,909 edges, 229,493 spanning edges, 124396 frontier edges. Running iterations 7 through 8.
2022-02-01 16:08:15,696 Starting iteration: 7... (231,695 individuals to process)
2022-02-01 16:14:08,585         Finished iteration: 7. Duration: 351.79 s. Graph stats: 323,036 vertices, 735,050 frontier, 501,006 edges, 740,906 spanning edges, 387865 frontier edges
2022-02-01 16:14:21,389 Downloaded 323,036 vertices, 735,050 frontier, 501,006 edges, 740,906 spanning edges, 387865 frontier edges, duration: 371 seconds, HTTP Requests: 1,160.
2022-02-01 16:14:33,334 Resolving 73834 relationships.
2022-02-01 16:55:29,799 Moved 1269 relationships to 'auxiliary'.
2022-02-01 16:55:46,019 69397 invalid relationships remain after resolution: 
Iteration 0: 5
Iteration 1: 46
Iteration 2: 224
Iteration 3: 1183
Iteration 4: 4920
Iteration 5: 17483
Iteration 6: 44707
Iteration 7: 829

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.