Code Monkey home page Code Monkey logo

d1_synchronization's People

Contributors

csjx avatar datadavev avatar leinfelder avatar mbjones avatar nahf avatar taojing2002 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

csjx

d1_synchronization's Issues

Sync on production can crash ungracefully

We noticed an out-of-sync state between the production CN and urn:node:ARCTIC the other day and found the CN thought it was completely in sync when it wasn't. In this particular case, the CN had failed to pick up tens of System Metadata updates from urn:node:ARCTIC we were expecting to see and the CN may have missed many more. I messaged @taojing2002 for help and we found that sync had crashed due to being OOM. Our fix was to set the last harvest timestamp back a day and allow processing to run. My immediate thoughts are:

  • Sync shouldn't go OOM and crash
  • If sync does crash, it shouldn't update the last sync (last harvest?) timestamp because this causes and out of sync state that's very hard to detect

We talked about possible next steps on our dev call this week and came up with:

  1. Bump max heap (Xmx) on the process. This might not be possible due to limited resources on cn-ucsb-1.
  2. Move sync (and processing?) over to another host with more resources
  3. We might consider making MN's responsible for auditing (Note: Bryce thinks this is not quite the route to go but it's an idea that came up nonetheless)
  4. In the mean time before a fix, we could consider auditing sync on some of our more active member nodes (ARCTIC, ESS-DIVE, RW)
  5. Set up monitoring on our logs to detect crashes like this
  6. Work on figuring out the bugs at the top of this post

For now, @taojing2002 is going to look into this and coordinate with @datadavev and we can go from there.

[Note: This might on the wrong repo since I can't see our logs on cn-ucsb-1 to see what actually crashed. Feel free to move.]

Some object from The RW member node weren't harvested

The operator from RW member node reported some pids hadn't been harvested even though they were uploaded couple days ago. I grepped the harvest log file and didn't find any information. So we need to monitor the situation to see if this happens again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.