cogeorg / regulatorycomplexity Goto Github PK
View Code? Open in Web Editor NEWResearch project to help measuring complexity of legal documents.
License: GNU General Public License v3.0
Research project to help measuring complexity of legal documents.
License: GNU General Public License v3.0
Title by title
Bootstrap should take care of that, but we are not sure it really does. Would be great if you could test this somehow.
Add the missing words due to nested operands in the txt files
For the Halstead Approach, the dashes that were causing issues (in not being able to classify words prior to them) have now been removed but the inability to classify these words still remains.
There is a bunch of text in DODDFRANK.txt that I don't think should be there. Take for example the string:
anorris on DSK5R6SHH1PROD with PUBLIC LAWS SEC. 327. IMPLEMENTATION PLAN AND REPORTS. Consultation. 12 USC 5437.
and
21:17 Aug 02, 2010 Jkt 089139 PO 00203 Frm 00163 Fmt 6580 Sfmt 6581 E:\PUBLAW\PUBL203.111 PUBL203
which I think come from page breaks. They are not exactly identical for every page break, which makes it tricky to find them. But perhaps you can find a good regexp to get rid of them.
For instance "imply, implies, implying, implied" or "bank, banks", "agency, agencies" etc.
Is there a way to automatize this? Thesaurus crawling?
The current ./100_code/python/030_create_visuals.py creates a simple visualization of the Dodd-Frank act using a single Title and a set of files which contain different types of words.
Download the raw data for various regulatory texts from:
https://sites.google.com/site/unsharedtask2014/
and add it to the dropbox in ./001_raw_data/ in a structured way. Manually inspect the texts, starting with the Dodd-Frank act to get an idea how regulation documents can look like.
One color for each category
Apply to Dodd-Frank
From landing page and /experiment page
Identify and highlight the proper string for each operand. There are not nested operands.
For example:
Use the Halstead (1976) measures of complexity and apply them to the Dodd-Frank regulation text.
Halstead (1976)
McCabe (1977)
Haldane (2012)
Li and Azar (2015)
Celerier and Vallee (2015)
the shell script is not working since you forgot to update the link. Always run code once before submitting it to make sure it works.
Delete all the punctuation, convert new line markers into spaces
Apply to Dodd-Frank
For each section of Dodd-Frank, count the number of words/expressions in each category.
Create a python program to get all the Legal References from the Dodd-Frank xml file.
Once we have categorized the different words, could we go further, define and recognize patterns in the text ?
For instance sequences such as: "economic operand" - "regulation operator" - "economic operand" - "attribute/operand" - "logical operator" - "economic operand" etc.
e.g.: "a bank" - "should not" - "engage in" - " proprietary trading" - "unless" - "the regulator" - "is okay with it"
More like a long-run thing.
Give feedback at the end: number of correct answers, total time taken, how many people did better
In RegulatoryComplexity/100_code/python:
In RegulatoryComplexity/100_code/shells:
In RegulatoryComplexity/050_results/DoddFrank/Visuals/VIsualizer_Versions/V8_visualizer/app:
Generally:
Requires to retrieve the different versions, obviously.
At the moment, if someone classifies e.g. 'of', all instances of 'The Banking Act of 1956' etc. are changed in the visualization. This can be quite confusing. Perhaps the visualizer could be changed such that only unclassified words can be changed in an update. This would also solve the issue of longer versus shorter classified words.
Also by section, e.g. compare Basel I to capital regulation in Basel III
Create a simple hall of fame where user times and number of correct answers is shown. This is not super trivial, I think, and we will likely have to have a call about it. For now, it would be fine if you can simply figure out exactly where the results from individual users are kept and prepare a very basic hall of fame mockup that eventually should read this data.
Requires to retrieve the original acts.
If they don't come up with the right solution explain the likely problem and let them compute again, until they find the right solution. You're not expected to create the example, of course, but we need one /experiment page that is slightly different from the others in that it shows a specific example which for now you can take as any of the examples that are being loaded. We want users to demonstrate that they understand what they need to do and hence the page should check whether a user has provided the correct answer (e.g. "42") and only advance them to the experiment stage if they got it right.
Ensure all literature in the Dropbox folder ./500_literature/ is named in the same way, i.e. Author1Author2YYYY-ShortTitle-journal.pdf . You can use abbreviations, such as JF (Journal of Finance), AER (American Economic Review), ECTA (Econometrica), etc.
We should have different types of word classifiers. One 'protected' list which is the approved list which with a user starts a new session. The other list of words is the user specific list. Users should be able to switch between these two lists, e.g. by selecting a switch on the right somewhere.
Starting with the US, identify key regulation documents and find their sources. Download them and translate into .txt or .xml files.
In the Halstead Approach, particular words such as 'financial', 'term', 'dealer', 'providing' are unable to be classified. This relates specifically to title 8 but the problem exists for all titles. When we classify one of these words it gets highlighted but once the document is updated the word is no longer classified.
Also when the document is updated, the number of classified words seems to change (i.e. sometimes a particular word will be classified and other times not).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.