Code Monkey home page Code Monkey logo

bachelor_thesis's Introduction

bachelor_thesis

Machine learning techniques for glitch detection in Planck/HFI data

Results

See here (file results.pdf).

Description and roadmap

See here (file thesis.pdf).

Some information about equations and followed procedures are included in the notebooks' comments.

  • DATA CLEANING, folder cleaning: clean data from various effects.

    Since the purpose of this thesis is to detect glitches and not to clean up the RAW signal from the galactic signal and other signals, all points that are on the galactic plane or coincide with a point source can be ignored without any consequences.

    The effects to be cleaned up are:

    • Galactic dipole using the theoretical equation.

    • Galactic plane signal and point sources using a mask extracted from the flags in SCI data.

    The steps to be performed are:

    • Mask preview; since the SCI data follow the satellite data collection, the preview of the total mask cannot be performed starting from that data. However, the PLA provides the masks used, called COM_Mask_PCCS-143-zoneMask_2048_R2.01 and HFI_Mask_PointSrc_2048_R2.00: using these masks, in HEALPix format, it's possible to have a global view of the total mask.

    • Clean data by removing two effects:

      • The galactic dipole, using the theoretical equation reported here (section 3.1, point 1).

      • The galactic plane signal and point sources, using the flags in SCI data. The SCI data, taken from the Planck Legacy Archive (PLA), are the so-called scientific data (already cleaned and calibrated) and each data has a flag that indicates a peculiarity, e.g. point object, planet or galaxy plane. The flags of interest are those concerning the galactic plane and the point source:

        bit 4: StrongSignal; 1 = In Galactic plane
        bit 5: StrongSource; 1 = On point source
        

        Data with these flags must be discarded.

      Cleaned data are saved in HDF5 format: it's fast, light and allows you to save attributes like the title and the version of the code used.

  • DATA CLASSIFICATION, folder classification: classify data for the machine learning algorithm training.

    • Create code; features:

      • Load and save status in a toml formatted file, so you don't have to classify all the data at the same time.

      • Save beautiful examples.

      • Reset everything.

      As cleaned data, classified data are saved in HDF5 format, containing also attributes like OD and detector, date of classification and git commit of the script.

    • Classify data; number of data to be classified: 2000 (1000 with a glitch, 1000 without it).

  • BUILD MACHINE LEARNING MODELS, folder ml_models: train and test various machine learning algorithms.

    PCA dimensionality reduction technique is used to see, in an intuitive way, if data are clustered in well-delimited groups or if they mix. Looking at the graphs, in both normal and sorted data, glitches (both single and multi) and non-glitches cluster in different and well-defined areas, while glitches and multi glitches are mixed. This means that a machine learning model can make a good distinction between glitches (both single and multi) and non-glitches. Instead, it's unlikely that a machine learning model can distinguish between glitched and multi-glitches. So, it is possible to avoid multiclass classifiers and focus only on binary classifiers. This has also been tested using the SVC model, which confirmed the deduction. So, except for the SVC model, all algorithms do not have the no-multi-glitch (nmg) - multi-glitch (mg) distinction.

    Candidate algorithms are:

    • C-Support Vector Classifier (from scikit-learn), folder ml_models/SVC; in-depth descriptions of the algorithms used and why they were used are in notebooks in the model's main folder.

      Best scores:

      • Normal data (with mg): 0.98054 +- 0.00627 | 0.98980 +- 0.00187 (data aug, bagging)

      • Sorted data (with mg): 0.99932 +- 0.00124

      State: finished.

    • Random Forest Classifier (from scikit-learn), folder ml_models/RFC; in-depth descriptions of the algorithms used and why they were used are in notebooks in the model's main folder.

      Best scores:

      • Normal data (with mg): 0.91433 +- 0.01130 | 0.99608 +- 0.00118 (data aug)

      • Sorted data (with mg): 0.98992 +- 0.00518

      State: finished.

    • K-Nearest Neighbors Classifier (from scikit-learn), folder ml_models/KNC; in-depth descriptions of the algorithms used and why they were used are in notebooks in the model's main folder.

      Best scores:

      • Normal data (with mg): 0.90033 +- 0.01501 | 0.98917 +- 0.00224 (data aug)

      • Sorted data (with mg): 0.99842 +- 0.00177

      State: finished.

    • Light Gradient Boosting Machine (from lightgbm, Microsoft), folder ml_models/LGB; in-depth descriptions of the algorithms used and why they were used are in notebooks in the model's main folder.

      Best scores:

      • Normal data (with mg): 0.91316 +- 0.01207 | 0.95430 +- 0.00372 (data aug)

      • Sorted data (with mg): 0.99617 +- 0.00184

      State: finished.

Resources

bachelor_thesis's People

Contributors

paolo97gll avatar

Stargazers

Leonardo Alchieri avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.