The phileas from philterd

Street addresses should support suite and apartment numbers

Street addresses should support suite and apartment numbers.

Validate filter profiles when loading them

Validate filter profiles when loading them. If a strategy has "CRYPTO_REPLACE" make sure there is a Crypto object, etc.

Incorporate zip code database

The goal is to reduce zip code false positives by including a look up when text matches a potential zip code. Because zip codes change, the lookup should not be definitive but should be an additional factor when determining if it is a true positive.

Rename CRYPTO_REPLACE to AESSIV_REPLACE

Rename CRYPTO_REPLACE to AESSIV_REPLACE. This also means updating the encryption algorithm to use AES-SIV instead.

POS post filter does not handle multi-word tokens

Given the input:

"George Washington was president and his ssn was 123-45-6789 and he lived at 90210."

The POS filter fails because the tokens are "George" and "Washington" individually and not "George Washington." The filter needs changed to allow for multi-word tokens.

Read list of ignored words from an external database

Read list of ignored words from an external database. Allow the user to specify the SQL query to return a list of words.

This can be used for more than ignored words, such as dictionary terms, etc.

Change to Java 17

Street addresses with suite numbers

445 Minnesota Street, Suite 175

CityFilterTest test cases are using sensitivity level that does not match the function name

CityFilterTest test cases are using sensitivity level that does not match the function name.

Shift dates by context (a person's name) in the text

Shift dates by context (a person's name) in the text. This is so each document about a specific person has its dates shifted consistently.

Failing tests on OSX M2

Due to ONNX Runtime on M2.

[ERROR] Errors:
[ERROR]   PersonsV2FilterTest.filter1:64 » UnsatisfiedLink no onnxruntime in java.librar...
[ERROR]   PersonsV2FilterTest.filter2:96 » NoClassDefFound Could not initialize class ai...
[ERROR]   PersonsV2FilterTest.filter3:135 » NoClassDefFound Could not initialize class a...
[ERROR]   PersonsV2FilterTest.filter4:172 » NoClassDefFound Could not initialize class a...
[ERROR]   PersonsV2FilterTest.filter5:205 » NoClassDefFound Could not initialize class a...
[ERROR]   PersonsV2FilterTest.filter6:240 » NoClassDefFound Could not initialize class a...

Use stop words to shorten physician names

Use stop words to shorten physician names. Instead of taking the entire n-gram, see if we can use stop words to shorten the span by cutting it based on the location of the stop words.

Look at each token in the physician name span from the outsides to see if they are stop words. If they are condense the span.

Add AWS access/secret key detection support

Add AWS access/secret key detection support.

Macie includes it: Using managed data identifiers in Amazon Macie - Amazon Macie

Add Phileas ONNX documentation

Add Phileas ONNX documentation.

Add an option to the Persons filter to also look for titles and suffixes

Add an option to the Persons filter to also look for titles and suffixes.

Remove unnecessary bounding box property on Identifier

Remove unnecessary bounding box property on Identifiers. Bounding boxes are set through the Graphical property on Identifiers.

How to launch Phileas?

Hi,

I am not very knowledgeable about Java, but much to my surprise I did manage to write a simple client using your instructions and get it to compile and run using Maven. However, I have not been able to figure out how to launch the Phineas service it expects at https://127.0.0.1:8080. I was wondering how to do that?

Cheers,
Andrew

Revisit intelligent NER filtering based on confidence values

Revisit intelligent NER filtering based on confidence values.

Support non-USD currencies

Support non-USD currencies. Need to add options to the filter strategy to designate the type of currency (or none for all types).

Ignore cities in court names

Ignore cities when they appear as part of a court name, e.g. District Court of Baltimore City.

This requires consideration about where to implement the feature. If we are looking for city names then it seems to be a function of the CITY filter. So that would require a flag in the CITY filter strategy to ignore the city if it is given as part of a court name.

Court names seem to be either:

… Court of … - Supreme Court of West Virginia
… Court of the … - Supreme Court of the United States
… Court - Wisconsin Supreme Court
… Court for the … - United States District Court for the Eastern District of Wisconsin

The Restriction class could probably be used as a means of doing a lookup.

Allow individual filter regex to be enabled/disabled

Allow individual filter regex to be enabled/disabled. The purpose is to allow only a set of regexes to be enabled.

There could be magic environment variables that can be set/unset to enable/disable the regex patterns. (Or some other method.)

Street address filter should use the street address filter strategies

Allow tracking numbers to be individually enabled

Add an option to the Tracking Number filter to allow for enabling/disabling individual shippers.

Use the contextual terms in a regex filter to set the confidence

Use the contextual terms in a regex filter to set the confidence.

Spelled out numbers in place of digits

How to handle spelled out numbers in place of digits in the filters that use regular expressions with digits?

Identifier filters should be able to specify the group number in the filter profile

Identifier filters should be able to specify the group number in the filter profile.

Add license headers to the source files

Add license headers to the source files.

Add a priority to each filter

Consider adding priority to filters in events of where two spans are completely identical, the priority would be used to determine which span is selected.

This needs tested well. Will need to test:

getFiltersForFilterProfile - to ensure the filters are in the order given by the priorities (high to low).
Identical spans found by different filters only return the span having the highest priority.

Was coded in 1.10.0 but not tested or added to documentation.

Disable dependency logging

Disable this logging:

Jan 16 15:31:14 ip-10-0-2-32.ec2.internal bash[3348]: 2021-01-16 15:31:14.544 ERROR 3363 — [nio-8080-exec-6] c.m.p.s.validators.DateSpanValidator : Text '3/2018' could not be parsed: Unable to obtain LocalDate from TemporalAccessor: {MonthOfYear=3, Year=2018},ISO of type java.time.format.Parsed

Add OR boolean operator to grammar

Add OR boolean operator to grammar.

Currently, OR can be accomplished to some degree by using multiple filter strategies.

It would be ideal to allow expressions like:

context == 'test' and confidence > 1.0 or token == 'asdf'

Add text preprocessing options for NER filter

Add text preprocessing options for NER filter. The options should go in Philter's configuration file since they apply to the model and not to a filter profile.

Adaptive confidence theshold calculations should be distributed

Adaptive confidence threshold calculations should be distributed. The DescriptiveStatistics is local. In an instance where multiple instances are running, each instance will have its own calculations and that's not ideal.

Fix index out of bounds when redacting PDF

Fix index out of bounds when redacting PDF. It happens when the identifier is the only text on a given line.

Add options to make first names and surnames be adjacent

Add an optional parameter to the FirstName filter that requires a Surname immediately after.

Likewise, add an optional parameter to the Surname filter that requires a FirstName immediately preceding it.

Both options can be set independently, and both should default to false.

When either option is set to true, that filter should only report a span when it is preceded/followed by a span from the other filter.

Person's name filter should be able to use OpenNLP maxent models

Person's name filter should be able to use Apache OpenNLP maxent models.

Allow NerFilterStrategy to apply to specific types

Allow NerFilterStrategy to apply to specific types, such as PER or LOC.

Create medical abbreviation filter

Create medical abbreviation filter to identify false positives.

One possible list of abbreviations: The BioText Project

Implement document analyzer for subpoenas

Implement document analyzer for subpoenas.

Only change the capitalization of words that are fully capitalized

        // Convert all caps words to just first letter capitalized.
        // This changes things like JAMES SMITH to James Smith which the model likes better.
        input = WordUtils.capitalizeFully(input);

returns

George Washington Was President

What format will the API return when retrieving filter profiles?
When saving filter profiles through the API, how to set the format? Content-type header?
The .json extension is used extensively through the filter profile services to find filter profiles on disk.

what about

"George Washington"

Allow for spaces in tracking numbers

Allow for spaces in tracking numbers.

philterd / phileas Goto Github PK

phileas's People

Contributors

Stargazers

Watchers

Forkers

phileas's Issues

Recommend Projects

Recommend Topics

Recommend Org