Code Monkey home page Code Monkey logo

new_german_sten_protokollen's Introduction

New_German_Sten_Protokollen

Review of fetching German Stenographic Protocols using the Bundestag API

File: 0_retrieve_text
Greate the general_df that will receive all the documents . This df has more than 25 MB so I could not upload it in the repository

File: 001_new_try_slicing_filtered_texts:

  1. Look for sections that have any of the search terms and save them in the folder filtered_sections:
# Define the search terms and forbidden terms

search_terms_holz = [r'\bholz\w*', r'\bforst\w*', r'\bwald\w*', r'\bbioökonomie', r'\bbioenergie', r'\bmöbel\w*']

forbidden_terms = ['Holzner', 'Holzleitner', 'Waldneukirchen', 'Holzmann', 'Waldegg', 'Waldheims', 'Waldorf', 'Waldvier', 'Waldei',
                   'Holzkreuze', 'Holzweg', 'Wald4tler', 'Walddorf', 
                  'Waldkirchen', 'Waldburg', 'Waldorfschule', 'Waldspaziergang', 'Holzfuß', 'Waldzell', 'Holzleit',
                  'Waldbe', 'Waldneukirchen', 'Holzgau', 'waldorfmathematisch', 'Waldwürfe', 'Waldviertelbahn', 'Waldheim', 'Walding', 'Waldenstein', 
                  'Waldviertlerin', 'Holzlärm', 'Holzchild', 'Waldviertelautobahn', 'Holzinger', 'Waldviertler', 'Waldbrand', 'Waldbrände', 'Land- und forstwirtschaftliche Landeslehrpersonen-Dienstrechtsgesetz',
                  'Land- und forstwirtschaftliche Landesvertrags­lehr­personengesetz']
  1. I manually marked where did every Tagesordnungpunkt (TOP) started and finished into the files, and saved a copy of each section at filtered_sections_marked
  2. I divided the section into TOP and searched our search terms in each of the TOPs. I saved the TOPs seperately at the folder filtered_tops
  3. Then I identified in the files the beginning of the president speech and the beginning of a depulty speech, and I automatically divided these speeches
  4. I further sliced the TOPs into Redes
  5. I searched which Rede has our search terms
  6. I gathered the filtered speeches with th plenary information and saved the speeches in filtered_reden and the speeches with the information as reden_df_final.csv

new_german_sten_protokollen's People

Contributors

voigtjessica avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.