Code Monkey home page Code Monkey logo

regulatorycomplexity's People

Contributors

alimonm avatar cogeorg avatar sabineschaller avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

regulatorycomplexity's Issues

Dash Error

For the Halstead Approach, the dashes that were causing issues (in not being able to classify words prior to them) have now been removed but the inability to classify these words still remains.

Clean DODDFRANK.txt

There is a bunch of text in DODDFRANK.txt that I don't think should be there. Take for example the string:
anorris on DSK5R6SHH1PROD with PUBLIC LAWS SEC. 327. IMPLEMENTATION PLAN AND REPORTS. Consultation. 12 USC 5437.
and
21:17 Aug 02, 2010 Jkt 089139 PO 00203 Frm 00163 Fmt 6580 Sfmt 6581 E:\PUBLAW\PUBL203.111 PUBL203
which I think come from page breaks. They are not exactly identical for every page break, which makes it tricky to find them. But perhaps you can find a good regexp to get rid of them.

Improve 030_create_visuals.py

The current ./100_code/python/030_create_visuals.py creates a simple visualization of the Dodd-Frank act using a single Title and a set of files which contain different types of words.

  1. Improve the visualization so that the structure of the regulation is kept (e.g. using your xml script). This should be displayed in .html
  2. Then, allow people to add words to different categories (the files in 020_auxiliary_data/Sections/619/). Ideally, this would be possible by someone selecting a word and then right-clicking on it to get a menu with each file name as entry so that the word can directly be added (the script should add the word to the respective file).

Refactor xlm_parser

  • Rename it properly to xml_parser
  • Have the file_name be passed as a command-line argument
  • Write a .sh bash script that passes the respective command-line arguments
  • Have the output file be passed as a command-line argument
  • Turn the Readme.rtf into a proper .txt file. There are lot of special characters in the file right now.
  • Make sure to use 'html' throughout, not 'htm'

Proper identification of string operands

Identify and highlight the proper string for each operand. There are not nested operands.
For example:

  • Advisers Act of 1940 - Legal Operand, True
  • Advisers Act -of- Grammar operand 1940 - Legal Operand, False

Balance sheet fixes

  • Get rid of extra space on the right;
  • Use a larger font;
  • Use the same font everywhere;
  • Write below balance sheet "EUR is the domestic currency and USD is a foreign currency"
  • Delete "(national currency)" and "(foreign currency)" everywhere
  • Show subcategories (see excel I forwarded, this is how it should look on the website as well, including horizontal line to better differentiate the various balance sheet positions)

Fix parser shell script

the shell script is not working since you forgot to update the link. Always run code once before submitting it to make sure it works.

c836f8f

Standardize strings

Delete all the punctuation, convert new line markers into spaces

Apply to Dodd-Frank

Compute Halstead measures, several variations

  • Operands can be: "economic operands" / everything that is not an operator / the terms with a separate definition, as identified by Ali
  • Operators can be function words / regulatory operators / legal operators / logical operators / combinations of the last three categories
  • We can compute volume, length, level, repetition of operands, unnecessary operators, difficulty, effort
  • Apply to Dodd-Frank, section by section. Correlation matrix between the different variations on operators and operands.
  • To be done once the counting of all word categories is done.

Recognizing logical structures

Once we have categorized the different words, could we go further, define and recognize patterns in the text ?

For instance sequences such as: "economic operand" - "regulation operator" - "economic operand" - "attribute/operand" - "logical operator" - "economic operand" etc.

e.g.: "a bank" - "should not" - "engage in" - " proprietary trading" - "unless" - "the regulator" - "is okay with it"

More like a long-run thing.

Give feedback at the end

Give feedback at the end: number of correct answers, total time taken, how many people did better

Questions on code

In RegulatoryComplexity/100_code/python:

  • What is the difference in 010_split_sections-DF.py and 011_analyze_sections-DF.py?
  • What happens in 020_compute_statistics_sections-DF.py ?

In RegulatoryComplexity/100_code/shells:

  • Why does 003_parser_xml.sh use HTML input?

In RegulatoryComplexity/050_results/DoddFrank/Visuals/VIsualizer_Versions/V8_visualizer/app:

  • What does tabledef.py create? Database with registered users?

Generally:

  • Please use more comments and documentation.

Expressions within existing expressions

At the moment, if someone classifies e.g. 'of', all instances of 'The Banking Act of 1956' etc. are changed in the visualization. This can be quite confusing. Perhaps the visualizer could be changed such that only unclassified words can be changed in an update. This would also solve the issue of longer versus shorter classified words.

Include a hall of fame

Create a simple hall of fame where user times and number of correct answers is shown. This is not super trivial, I think, and we will likely have to have a call about it. For now, it would be fine if you can simply figure out exactly where the results from individual users are kept and prepare a very basic hall of fame mockup that eventually should read this data.

Ask subjects to solve a first example before they start.

If they don't come up with the right solution explain the likely problem and let them compute again, until they find the right solution. You're not expected to create the example, of course, but we need one /experiment page that is slightly different from the others in that it shows a specific example which for now you can take as any of the examples that are being loaded. We want users to demonstrate that they understand what they need to do and hence the page should check whether a user has provided the correct answer (e.g. "42") and only advance them to the experiment stage if they got it right.

Standardize Literature

Ensure all literature in the Dropbox folder ./500_literature/ is named in the same way, i.e. Author1Author2YYYY-ShortTitle-journal.pdf . You can use abbreviations, such as JF (Journal of Finance), AER (American Economic Review), ECTA (Econometrica), etc.

Different types of word classifiers

We should have different types of word classifiers. One 'protected' list which is the approved list which with a user starts a new session. The other list of words is the user specific list. Users should be able to switch between these two lists, e.g. by selecting a switch on the right somewhere.

Find other regulation documents

Starting with the US, identify key regulation documents and find their sources. Download them and translate into .txt or .xml files.

Certain words not classifying

In the Halstead Approach, particular words such as 'financial', 'term', 'dealer', 'providing' are unable to be classified. This relates specifically to title 8 but the problem exists for all titles. When we classify one of these words it gets highlighted but once the document is updated the word is no longer classified.

Also when the document is updated, the number of classified words seems to change (i.e. sometimes a particular word will be classified and other times not).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.