Code Monkey home page Code Monkey logo

dotnet-reliability's Introduction

the dumpling service

###http://aka.ms/dumpling

dotnet-reliability

Contains the tooling and infrastructure used for .NET stress testing and reliability investigation. More specifically this includes tools for, but not limited to :

  • authoring component specific stress and load tests
  • generating stress tests from existing framework and runtime tests
  • unified dump collection, storage, and bucketing across all .NET supported platforms
  • investigating and diagnosis of reliability failures

dotnet-reliability's People

Contributors

adityamandaleeka avatar brianrob avatar cshung avatar davmason avatar jkotas avatar leculver avatar msftgits avatar schaabs avatar weshaggard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dotnet-reliability's Issues

analysis.py incorrectly parses stack frame with only module + offset

In the case a stack frame only has a module and an offset, with no method, analysis.py includes the offset in the module name. This causes frames such as

libstdc++.so.6 + -1

to be displayed as:

libstdc++.so.6 + -1!UNKNOWN

and this also causes bucketing problems for module based rules

OOM inducing tests need to be filtered from the stress mix candidates

We are including a number of tests in our stress mixes which induce OOM situations either because of large allocations or because of allocations rooted by static feilds. We need to investigate which tests in our stress mix candidates are at fault for this and filter them from our stress mixes. The following buckets illustrate failures of this nature and can be used as a starting point to identify tests which need to be filtered from the mix.

SIGABRT_System.OutOfMemoryException_libstdc++.so.6!__gnu_cxx::__verbose_terminate_handler
SIGABRT_System.OutOfMemoryException_libcoreclr.so + -1!UNKNOWN
SIGABRT_System.OutOfMemoryException_libcoreclr.so!JIT_New

error/log/verbose script

todo fill out

Filled it out but forgot to save it :| It is sitting on my work machine. I'll refresh it Monday.

Report the Originating OS

At the moment the state table has a place for the originating OS, however we need to add this to the SQL back-end AND the web view.

Issues with loading libsos on analysis machines

There seems to be a disparity between the experience when using libsos via user interaction, versus the scripting front-end. It is complicating the analysis process. This issue is to track the repairs and fixes for this.

MISBUCKET: SIGABRT_libstdc++.so.6!__gnu_cxx::__verbose_terminate_handler

The entire module libstdc++.so.6 should be ignored. It is in a common managed code failure path, plus we can assume that the c++ std library is clean for our stress bucketing.

libc.so.6!gsignal
libc.so.6!abort
libstdc++.so.6!__gnu_cxx::__verbose_terminate_handler()
libstdc++.so.6!???
libstdc++.so.6!std::terminate()
libstdc++.so.6!__cxa_allocate_exception
libcoreclr.so!RtlpRaiseException(_EXCEPTION_RECORD*)
libcoreclr.so!RaiseException
libcoreclr.so!RaiseTheExceptionInternalOnly(Object*, int, int)
libcoreclr.so!IL_Throw(Object*)
...

Linux machines defaulting to windows

        if args.distro == None:
            if platform.platform().lower() == 'linux':
                args.distro = platform.dist()[0].lower()
            else:
                args.distro = 'win'

Result of platform.platform().lower() is not 'linux.'
@aditya suggested that:
platform.system()
may be the right call.

Use symbols.exe to manage dump contents

At the moment we're saving uploaded dump files to our own blob storage. It's okay for a user base of 2, and to connect the dots, but there is going to be issues with this in the long run.

Namely,

  • Non-unique file names just stomp on each other.
  • Duplicate files, even with different names wastes space. Symbols.exe will calculate a hash and store files using that, as well as managing ref counts.

WEB UI Changes

Small changes:

  • Add bucket count
  • order buckets by hits, descending
  • order dumps by dumpling id, descending
  • Improve dump properties readability
  • Use display names
  • Fix line wrappings in call stack
  • Add a table cell top border above hits
  • Use font 'Numans' for hit count
  • Allow ReAnalyzing of buckets or individual dumps
  • Last Analyzed Timestamp
  • Place behind Azure Active Directory
  • Use PST timestamps, instead of UTC.

Add contextual information to dumps

This is a feature request

On the CoreCLR repo, we're using dumpling to upload dumps when our Jenkins CI runs encounter segfaults or other crashes. This is great, but sometimes these jobs are running against a PR which has obvious bugs (for example, because it's a work in progress).

Today, this means we need to download and extract the dump, inspect it to understand what's going on, check the directory structure to guess what type of job it came from (e.g. "checked_ubuntu_tst_prtest"), go to Jenkins and look at the instances of that job that failed to find the one that generated this dump, and only then realize that the job was kicked off by a faulty PR.

It would be nice if the information about the job and/or PR were exposed in a more direct way somehow so the time spent looking into these dumps is minimized. Depending on how it's implemented, this may involve work on the CI side and also in dumpling itself.

cc: @bryanAR

MISBUCKET: SIGABRT_libcoreclr.so!CallDescrWorkerInternal: bucketing fails due to SOS failure to load on triage service machines

The bucket SIGABRT_libcoreclr.so!CallDescrWorkerInternal is a misbucketed set of failures, due to the fact that SOS is failing to load so no managed frame data is available

libc.so.6!gsignal
libc.so.6!abort
libcoreclr.so!PROCEndProcess(void*, unsigned int, int)
libcoreclr.so!UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*)
libcoreclr.so!DispatchManagedException(PAL_SEHException&)
libcoreclr.so!IL_Rethrow()
--> UNKNOWN!UNKNOWN
--> UNKNOWN!UNKNOWN
--> UNKNOWN!UNKNOWN
--> UNKNOWN!UNKNOWN
libcoreclr.so!CallDescrWorkerInternal
libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*)
libcoreclr.so!AppDomainTimerCallback_Worker(void*)
libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)
libcoreclr.so!ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*)
libcoreclr.so!AppDomainTimerCallback(void*, unsigned char)
libcoreclr.so!ThreadpoolMgr::AsyncTimerCallbackCompletion(void*)
libcoreclr.so!UnManagedPerAppDomainTPCount::DispatchWorkItem(bool*, bool*)
libcoreclr.so!ThreadpoolMgr::WorkerThreadStart(void*)
libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*)
libpthread.so.0!start_thread
libc.so.6!clone

Also the following frames should be added to the ignore list for netcore on linux:

libcoreclr.so!CallDescrWorker
libcoreclr.so!MethodDescCallSite::CallTargetWorker

dumpling cli planning

At the moment we have two downzip functions. Ideally we should unify and have an agreement as to what downzip does. At the moment, corezip is the python function that zips up the contents on the lab machine and then uploads to blob storage, and downzip does the downloading and unzipping of the contents.

@schaabs does corezip take care of uploading? Perhaps downzip could just be split in to two: One is the downloading piece, then the unzipping + locating core files logic?

in other words, convert downzip in to:
def DownloadZip(...)
def CoreUnzip(...)

Updates
We riffed on this a bit.

In addition to the two-apis listed above, we'd also have:

def UploadZip(...)
def CoreZip(...)

We'd place them in to dumpling.py which lays a foundation for dumpling CLI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.