philterd / phileas Goto Github PK
View Code? Open in Web Editor NEWThe PII and PHI redaction engine
Home Page: https://www.philterd.io
License: Apache License 2.0
The PII and PHI redaction engine
Home Page: https://www.philterd.io
License: Apache License 2.0
Street addresses should support suite and apartment numbers.
Validate filter profiles when loading them. If a strategy has "CRYPTO_REPLACE" make sure there is a Crypto object, etc.
The goal is to reduce zip code false positives by including a look up when text matches a potential zip code. Because zip codes change, the lookup should not be definitive but should be an additional factor when determining if it is a true positive.
Rename CRYPTO_REPLACE to AESSIV_REPLACE. This also means updating the encryption algorithm to use AES-SIV instead.
Given the input:
"George Washington was president and his ssn was 123-45-6789 and he lived at 90210."
The POS filter fails because the tokens are "George" and "Washington" individually and not "George Washington." The filter needs changed to allow for multi-word tokens.
Read list of ignored words from an external database. Allow the user to specify the SQL query to return a list of words.
This can be used for more than ignored words, such as dictionary terms, etc.
Street addresses with suite numbers
445 Minnesota Street, Suite 175
CityFilterTest test cases are using sensitivity level that does not match the function name.
Shift dates by context (a person's name) in the text. This is so each document about a specific person has its dates shifted consistently.
Due to ONNX Runtime on M2.
[ERROR] Errors:
[ERROR] PersonsV2FilterTest.filter1:64 » UnsatisfiedLink no onnxruntime in java.librar...
[ERROR] PersonsV2FilterTest.filter2:96 » NoClassDefFound Could not initialize class ai...
[ERROR] PersonsV2FilterTest.filter3:135 » NoClassDefFound Could not initialize class a...
[ERROR] PersonsV2FilterTest.filter4:172 » NoClassDefFound Could not initialize class a...
[ERROR] PersonsV2FilterTest.filter5:205 » NoClassDefFound Could not initialize class a...
[ERROR] PersonsV2FilterTest.filter6:240 » NoClassDefFound Could not initialize class a...
Use stop words to shorten physician names. Instead of taking the entire n-gram, see if we can use stop words to shorten the span by cutting it based on the location of the stop words.
Look at each token in the physician name span from the outsides to see if they are stop words. If they are condense the span.
Add AWS access/secret key detection support.
Macie includes it: Using managed data identifiers in Amazon Macie - Amazon Macie
Add Phileas ONNX documentation.
Add an option to the Persons filter to also look for titles and suffixes.
Remove unnecessary bounding box property on Identifiers
. Bounding boxes are set through the Graphical
property on Identifiers
.
Hi,
I am not very knowledgeable about Java, but much to my surprise I did manage to write a simple client using your instructions and get it to compile and run using Maven. However, I have not been able to figure out how to launch the Phineas service it expects at https://127.0.0.1:8080
. I was wondering how to do that?
Cheers,
Andrew
Revisit intelligent NER filtering based on confidence values.
Support non-USD currencies. Need to add options to the filter strategy to designate the type of currency (or none for all types).
Ignore cities when they appear as part of a court name, e.g. District Court of Baltimore City.
This requires consideration about where to implement the feature. If we are looking for city names then it seems to be a function of the CITY filter. So that would require a flag in the CITY filter strategy to ignore the city if it is given as part of a court name.
Court names seem to be either:
… Court of … - Supreme Court of West Virginia
… Court of the … - Supreme Court of the United States
… Court - Wisconsin Supreme Court
… Court for the … - United States District Court for the Eastern District of Wisconsin
The Restriction
class could probably be used as a means of doing a lookup.
Allow individual filter regex to be enabled/disabled. The purpose is to allow only a set of regexes to be enabled.
There could be magic environment variables that can be set/unset to enable/disable the regex patterns. (Or some other method.)
Add an option to the Tracking Number filter to allow for enabling/disabling individual shippers.
Use the contextual terms in a regex filter to set the confidence.
How to handle spelled out numbers in place of digits in the filters that use regular expressions with digits?
Identifier filters should be able to specify the group number in the filter profile.
Add license headers to the source files.
Consider adding priority to filters in events of where two spans are completely identical, the priority would be used to determine which span is selected.
This needs tested well. Will need to test:
getFiltersForFilterProfile
- to ensure the filters are in the order given by the priorities (high to low).Was coded in 1.10.0 but not tested or added to documentation.
Disable this logging:
Jan 16 15:31:14 ip-10-0-2-32.ec2.internal bash[3348]: 2021-01-16 15:31:14.544 ERROR 3363 — [nio-8080-exec-6] c.m.p.s.validators.DateSpanValidator : Text '3/2018' could not be parsed: Unable to obtain LocalDate from TemporalAccessor: {MonthOfYear=3, Year=2018},ISO of type java.time.format.Parsed
Add OR boolean operator to grammar.
Currently, OR can be accomplished to some degree by using multiple filter strategies.
It would be ideal to allow expressions like:
context == 'test' and confidence > 1.0 or token == 'asdf'
Add text preprocessing options for NER filter. The options should go in Philter's configuration file since they apply to the model and not to a filter profile.
Adaptive confidence threshold calculations should be distributed. The DescriptiveStatistics is local. In an instance where multiple instances are running, each instance will have its own calculations and that's not ideal.
Fix index out of bounds when redacting PDF. It happens when the identifier is the only text on a given line.
Add an optional parameter to the FirstName filter that requires a Surname immediately after.
Likewise, add an optional parameter to the Surname filter that requires a FirstName immediately preceding it.
Both options can be set independently, and both should default to false.
When either option is set to true, that filter should only report a span when it is preceded/followed by a span from the other filter.
Person's name filter should be able to use Apache OpenNLP maxent models.
Allow NerFilterStrategy to apply to specific types, such as PER or LOC.
Create medical abbreviation filter to identify false positives.
One possible list of abbreviations: The BioText Project
// Convert all caps words to just first letter capitalized.
// This changes things like JAMES SMITH to James Smith which the model likes better.
input = WordUtils.capitalizeFully(input);
returns
George Washington Was President
Add option for possessives to custom dictionary and NER filters.
Condition should be a list of strings instead of just a string. As written now, there is a one-to-one between condition and filter strategy.
This allows for multiple conditions for a given filter strategy. This is how it was done in Philter Studio before it was discovered that "condition" is just a single string in the filter condition.
Add captialization property to Lucene index filters. When set to true, the first letter of the term has to be capitalized.
Add an option to the Persons filter to also look for titles and suffixes.
OpenNLP SentenceModel usage should include a sentence model.
In PhileasFilterServiceTest.endToEnd15()
, the name “David O’Brien” is not being identified. “David O '“ is being found but the space between the O and the apostrophe is causing the findByRegex
to return -1
.
Allow filter profiles to be written in YAML.
The goal is to be able to reuse filters instead of creating new ones each time.
Look at https://github.com/Netflix/derand for potential for a false-positive detector.
Ignore span based on whitespace --
Ignore "George Washington"
what about
"George Washington"
Allow for spaces in tracking numbers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.