Code Monkey home page Code Monkey logo

abstractive-extractive-summarization's Introduction

Financial Text Summarization using RL policies

This repository contains the code for our Financial Texts Summarization task, realized for Deep Natural Language Process class(2021-2022). Our task was to re-produce the work made by Zmander who managed to rank 2nd in FNS2021 competition. In this work, we will exploit long texts' summarization task combining an extraction and abstraction approaches by a Reinforcement Learning policy. In addition to that, we propose a distributional analysis to understand which are the most salient parts of the documents and to cut them according to the distribution we found. Furthemore, we extend the task to CNN's headline generation and we use the model to FNS2022's dataset which is composed by three different languages: English, Spanish and Greek. The pipeline we propose, and you can reproduce, is the following:

  1. Data preprocessing, comprising documents'cut, corpus generation, etc.
  2. Extractor, Abstractor and RL models training
  3. Model evaluation

Dependencies

  • Python 3 (tested on python 3.6)
  • PyTorch 0.4.0
    • with GPU and CUDA enabled installation (though the code is runnable on CPU, it would be way too slow)
  • gensim
  • cytoolz
  • tensorboardX
  • pyrouge (for evaluation)

Datasets Download

You can download the datasets we used from the following links

  1. FNS 2022 Dataset
  2. CNN Headlines Generation Dataset

Execution guide

In the following, we will give you a suggestion on how to re-run our work. In this way you can check our results and change the training settings if you like. If you like, a colab notebook with the most salient steps is provided at Open In Colab. Make a copy and play with it as you prefer!
Moreover we provide you with the preprocessed files and pretrained models:

  1. FNS2022
  2. CNN

Download the whole directories and copy them in your main Google Drive folder. Enjoy!

Distribution

This step can be avoided. In fact, if it is not run, the cut is performed using to first 1000 sentences. If you would like to have a look at the distribution of importance in your documents you should run:

%run preprocess/distribution_analysis.py --data <DATASET> --language <language> --stage <stage you want to start from> --top_M <top sentences to compute the rouge with> --jit <True if you want to use jit, False otherwise>

Note that if the parameter --jit is not specified, the code is run using it.

Preprocessing pipeline

The following sketch shows an idea of how our preprocessing pipeline works. Note that the documents are cut according to their distribution

Preprocessing

First of all, you will need to pre-process your data. To do that, you can use the script pipeline.py which is inside the folder called "preprocess". In this way, you will transform you data in ordert to be feasible for the models.

!python preprocess/pipeline.py --data <DATASET> --language <language> --max_len <maximum length. Set this parameter even if you want to use distribution.> --stage <stage_you_want_to_start_from> --jit <True if you want to use jit, False otherwise> --use_distribution <stores true if you want to cut documents according to their distribution>

Note: if the parameter --jit is not specified, the code is run using it.

Train extractor

Next step is to train extractor. The image displays the main idea of its architecture.

Alt text

To train it run the following cell:

!python train_extractor_ml.py --data <DATASET> --language <language> --lstm_layer <number of lstms> --lstm_hidden <number of lstm hidden layers> --batch <batch size> --ckpt_freq <checkpoint frequency> --max_word <words in a sentence are cut according to this parameter> --max_sent <sentences in a document are cut according to this parameter>

Train abstractor

If you want to train abstractor, following line needs to be executed:

!python train_abstractor.py --data <DATASET> --language <language> --n_layer <number of layers> --n_hidden <number of hidden layers> --batch <batch size> --ckpt_freq <checkpoint_frequnce>

Train RL model

Last, but not the least, model to be trained is the Reinforcement Learning's agent. The scheme shows the main steps of its functioning idea.

Alt text

Concerning RL, the parameter --abs_dir passes the abstraction directory to RL model. If you like performing our ablation study, do not pass the parameter --abs dir. If you would like to train RL part, run the following cell:

!python train_full_rl.py --data <DATASET> --language <language> --batch <batch size> --abs_dir <directory of the abstractor (use "model\abs"). If you want to perform ablation study, do not pass this argument>

Evaluate the model

In the end, evaluate the model using the script decode_full_model.py:

!python decode_full_model.py --data <DATASET> --language <language> --batch <batch size>

Results

You should get the following results if you cut the documents according to their distribution.

Language HypPar R-type R-1 R-2 R-L
English n_hidden=128 n_LSTM=1 batch_size=4 F1 0.332 0.118 0.326
Recall 0.315 0.117 0.309
Greek n_hidden=256 n_LSTM=2 batch_size=4 F1 0.489 0.311 0.479
Recall 0.227 0.140 0.224
Spanish n_hidden=128 n_LSTM=1 batch_size=4 F1 0.340 0.094 0.334
Recall 0.292 0.081 0.286

abstractive-extractive-summarization's People

Contributors

chenrocks avatar leomaggio avatar vitopalmis avatar z216z avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.