Code Monkey home page Code Monkey logo

master_thesis's Introduction

Modellierung von Leseverhalten auf Online-Zeitungsportalen anhand von Deep Learning

Masterarbeit von Susanna Rücker an der FSU Jena in Zusammenarbeit mit der INWT Statistics GmbH

English abstract

The present study deals with user engagement on German online news articles. Several implicit feedback measures or Key Performance Indicators (KPIs) – such as pageviews or dwell time – are commonly used as a proxy for measuring user engagement on websites. After a thorough discussion concerning the use of implicit user feedback and related fields of research, the main focus of this work is building different kinds of models for predicting average user dwell time given only the text of each article. A large corpus (consisting of news articles and their respective KPI measures) was created specifically for this study, part of which (36383 articles from the German daily newspaper [censored]) was then chosen for the prediction task. The line of models includes several baselines, two of them using Bag-of-Words features, but relies mostly on including the well-known pretrained transformer model BERT in several Deep Learning architectures. All models are evaluated and compared on unseen test data, using various evaluation metrics. This work deals with the problem of applying BERT to longer documents, given its limitation on input sequence length. Most of the models simply truncate the article and just use the first part – which turns out to be a valid approach resulting in good predictions of dwell time. However, two of the more complex models take a hierarchical approach, splitting the article in several smaller sections and combining the output of each section. A further analysis gives insights on the dwell time predictions of two models (one BOW-baseline and one model including BERT), using the tool SHAP for interpreting model predictions.

abgegebene schriftliche Version der Arbeit (Update am 10.09.21):

Disclaimer zum Repository (Update am 10.09.21)

Dieses Working Repository ist alles andere als aufgeräumt, dokumentiert oder öffentlichkeitsfähig... Code ist unverändert, README wurde etwas angepasst und um pdf erweitert.

Überblick über den Inhalt des Repository:

in /master_thesis:

  • /src: Modell-Architekturen (models.py), allgemein wichtige Helferfunktionen (utils.py), Einlesen der Daten (read_data.py) und solche zu Dataset/Dataloafer etc. (data.py).

  • /experiments: Skripte für das Training der unterschiedlichen Modelle. Die relevanten liegen allem in `/regression', die anderen Folder enthalten nicht weitergeführte zusätzliche Experimente zur Umformung in ein binäres Klassifikationsproblem, erste Ansätze für Emotionsanalyse.

  • /notebooks: (sehr unsystematische) Jupyter Notebooks mit ersten explorativen Modellierungsversuchen, Datensatzinspektion, Zusammenführen der verschiedenen Datensätze, ...

  • /outputs: (wird nicht von git getrackt) Gespeicherte trainierte Modelle, gespeicherte Features, Predictions, Tensorboard-logdirs

  • /deprecated: Veraltetes, das aber vielleicht noch interessant sein könnte und daher nicht gelöscht wurde

Erste Notizen, hilfreiche Links etc. (nicht fortgesetzt)

Venelin Valkov (gute Tutorials, klar erklärt)

zum Problem LANGE Texte:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.