Light

microsoft / dotnet-reliability Goto Github PK

View Code? Open in Web Editor NEW

27.0 99.0 25.0 1.76 MB

.NET reliability and stress test tooling

License: MIT License

Batchfile 2.74% C# 84.90% Python 0.23% Groovy 0.36% Shell 1.33% C++ 6.19% PowerShell 4.26%

dotnet-reliability's Introduction

the `dumpling` service

###http://aka.ms/dumpling

dotnet-reliability

Contains the tooling and infrastructure used for .NET stress testing and reliability investigation. More specifically this includes tools for, but not limited to :

authoring component specific stress and load tests
generating stress tests from existing framework and runtime tests
unified dump collection, storage, and bucketing across all .NET supported platforms
investigating and diagnosis of reliability failures

dotnet-reliability's People

Contributors

Stargazers

Watchers

Forkers

schaabs phoenixhelper leculver xornand adityamandaleeka chunlei cephdon karajas lodexinc michellemcdaniel drewscoggins noahfalk shouqibao finchyang bhaskers-blu-org2 taffywrinkle claudiusgonzo

dotnet-reliability's Issues

analysis.py incorrectly parses stack frame with only module + offset

In the case a stack frame only has a module and an offset, with no method, analysis.py includes the offset in the module name. This causes frames such as

libstdc++.so.6 + -1

to be displayed as:

libstdc++.so.6 + -1!UNKNOWN

and this also causes bucketing problems for module based rules

OOM inducing tests need to be filtered from the stress mix candidates

We are including a number of tests in our stress mixes which induce OOM situations either because of large allocations or because of allocations rooted by static feilds. We need to investigate which tests in our stress mix candidates are at fault for this and filter them from our stress mixes. The following buckets illustrate failures of this nature and can be used as a starting point to identify tests which need to be filtered from the mix.

SIGABRT_System.OutOfMemoryException_libstdc++.so.6!__gnu_cxx::__verbose_terminate_handler
SIGABRT_System.OutOfMemoryException_libcoreclr.so + -1!UNKNOWN
SIGABRT_System.OutOfMemoryException_libcoreclr.so!JIT_New

Allow Theories to be selected for stress mixes

A large portion of our tests in CoreFX are theories and these cannot be included in stress mixes currently. we should rectify this.

error/log/verbose script

todo fill out

Filled it out but forgot to save it :| It is sitting on my work machine. I'll refresh it Monday.

Report the Originating OS

At the moment the state table has a place for the originating OS, however we need to add this to the SQL back-end AND the web view.

Add pass rate over time graphing

We could use Benchview or PowerBI for this

Worker dead locks

Fix worker dead lock.

workers on the analysis machines assume all zips are extracted to /home/DotNetBot

shutil.rmtree('/home/DotNetBot/') # This will likely not work in the future when we receive dumps from other people.

@schaabs is currently experimenting with how zip files are extracted. Perhaps via #13 we can declare that coreunzip/downzip return a list of files. This could be used to locate core files, as well as help clean up more efficiently.

Consider failures that are functional in nature

If a test is not crashing, but instead failing by returning the wrong value we should also fail and collect a core dump.

Issues with loading libsos on analysis machines

There seems to be a disparity between the experience when using libsos via user interaction, versus the scripting front-end. It is complicating the analysis process. This issue is to track the repairs and fixes for this.

MISBUCKET: SIGABRT_libstdc++.so.6!__gnu_cxx::__verbose_terminate_handler

The entire module libstdc++.so.6 should be ignored. It is in a common managed code failure path, plus we can assume that the c++ std library is clean for our stress bucketing.

libc.so.6!gsignal
libc.so.6!abort
libstdc++.so.6!__gnu_cxx::__verbose_terminate_handler()
libstdc++.so.6!???
libstdc++.so.6!std::terminate()
libstdc++.so.6!__cxa_allocate_exception
libcoreclr.so!RtlpRaiseException(_EXCEPTION_RECORD*)
libcoreclr.so!RaiseException
libcoreclr.so!RaiseTheExceptionInternalOnly(Object*, int, int)
libcoreclr.so!IL_Throw(Object*)
...

Linux machines defaulting to windows

        if args.distro == None:
            if platform.platform().lower() == 'linux':
                args.distro = platform.dist()[0].lower()
            else:
                args.distro = 'win'

Result of platform.platform().lower() is not 'linux.'
@aditya suggested that:
platform.system()
may be the right call.

Use symbols.exe to manage dump contents

At the moment we're saving uploaded dump files to our own blob storage. It's okay for a user base of 2, and to connect the dots, but there is going to be issues with this in the long run.

Namely,

Non-unique file names just stomp on each other.
Duplicate files, even with different names wastes space. Symbols.exe will calculate a hash and store files using that, as well as managing ref counts.

Make Maximum upload size > 2GB

It seems it is a hard-limit that the maximum file size that can be uploaded is 2GB using HTTP POST with ASP.Net.

https://blogs.msdn.microsoft.com/friis/2013/06/19/uploading-large-file-to-iis-7-5-or-8-using-file-input-element/

I'll need to devise a way to exceed this limit.

There are other protocols: ftp, ssh, etc. that may be able to be leveraged.

WEB UI Changes

Small changes:

Add contextual information to dumps

This is a feature request

On the CoreCLR repo, we're using dumpling to upload dumps when our Jenkins CI runs encounter segfaults or other crashes. This is great, but sometimes these jobs are running against a PR which has obvious bugs (for example, because it's a work in progress).

Today, this means we need to download and extract the dump, inspect it to understand what's going on, check the directory structure to guess what type of job it came from (e.g. "checked_ubuntu_tst_prtest"), go to Jenkins and look at the instances of that job that failed to find the one that generated this dump, and only then realize that the job was kicked off by a faulty PR.

It would be nice if the information about the job and/or PR were exposed in a more direct way somehow so the time spent looking into these dumps is minimized. Depending on how it's implemented, this may involve work on the CI side and also in dumpling itself.

cc: @bryanAR

MISBUCKET: SIGABRT_libcoreclr.so!CallDescrWorkerInternal: bucketing fails due to SOS failure to load on triage service machines

The bucket SIGABRT_libcoreclr.so!CallDescrWorkerInternal is a misbucketed set of failures, due to the fact that SOS is failing to load so no managed frame data is available

libc.so.6!gsignal
libc.so.6!abort
libcoreclr.so!PROCEndProcess(void*, unsigned int, int)
libcoreclr.so!UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*)
libcoreclr.so!DispatchManagedException(PAL_SEHException&)
libcoreclr.so!IL_Rethrow()
--> UNKNOWN!UNKNOWN
--> UNKNOWN!UNKNOWN
--> UNKNOWN!UNKNOWN
--> UNKNOWN!UNKNOWN
libcoreclr.so!CallDescrWorkerInternal
libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*)
libcoreclr.so!AppDomainTimerCallback_Worker(void*)
libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)
libcoreclr.so!ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*)
libcoreclr.so!AppDomainTimerCallback(void*, unsigned char)
libcoreclr.so!ThreadpoolMgr::AsyncTimerCallbackCompletion(void*)
libcoreclr.so!UnManagedPerAppDomainTPCount::DispatchWorkItem(bool*, bool*)
libcoreclr.so!ThreadpoolMgr::WorkerThreadStart(void*)
libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*)
libpthread.so.0!start_thread
libc.so.6!clone

Also the following frames should be added to the ignore list for netcore on linux:

libcoreclr.so!CallDescrWorker
libcoreclr.so!MethodDescCallSite::CallTargetWorker

dumpling cli planning

At the moment we have two downzip functions. Ideally we should unify and have an agreement as to what downzip does. At the moment, corezip is the python function that zips up the contents on the lab machine and then uploads to blob storage, and downzip does the downloading and unzipping of the contents.

@schaabs does corezip take care of uploading? Perhaps downzip could just be split in to two: One is the downloading piece, then the unzipping + locating core files logic?

in other words, convert downzip in to:
def DownloadZip(...)
def CoreUnzip(...)

Updates
We riffed on this a bit.

In addition to the two-apis listed above, we'd also have:

def UploadZip(...)
def CoreZip(...)

We'd place them in to dumpling.py which lays a foundation for dumpling CLI.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.