dishmint / lexicalcases Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 61.31 MB

Extract substrings matching a lexical pattern

Home Page: https://www.paclets.com/FaizonZaman/LexicalCases

License: MIT License

Mathematica 100.00%

text-mining text pattern-matching text-analaysis wolfram-language wolfram-mathematica linguistics text-search

lexicalcases's Introduction

Hey! I'm Fez.

Check out what I've been up to!

Wolfram Language

Paclets

Paclet	Description
	Extract substrings matching a lexical pattern
	Experimental Wolfram Language functions

Functions

Function	Description
	Map a function over elements at specified positions and another function to the rest
	Map functions to specified parts of an expression

Stats

lexicalcases's People

Contributors

Stargazers

Watchers

lexicalcases's Issues

Implement patterns as StringExpressions

In the above example I'm getting the verbs, not the quantities.

I'm wondering if I can use native pattern syntax, this might be much easier. I wouldn't have to specify a bunch of utility functions to get the same behavior the pattern functions already offer. If anything, I just need to keep TextType heads.

Strings with whitespace need to be PatternSequences

https://github.com/dishmint/TextSequenceCases/blob/6100ad93e2a7255d89382f55b3042772ebac03cd/TextSequenceCases.wl#L47-L54

Strings with white space will not match in SequenceCases, so results like "Elon Musk" need to become PatternSequence["Elon","Musk"] probably.

Code to test:

tp3 = TextPattern[TextType["Adjective"], "books", OptionalTextPattern["from" | "by"], TextType["Person"]];
TextSequenceCases["I've been reading some cool books by Elon Musk.", tp3]

Deploy standalone app

InputFields for LexicalPattern and search query
Select Service by drop down

[LexicalCases] Support a list of strings or files as input

A list of strings or files should be processed like multiple wikipedia articles.

LexicalCases[{string1, string2, ...}, LexicalPattern[...]]

[Documentation] Add docs folder for GitHub pages documentation

There is no /docs directory currently. Add one so that GitHub pages will work

Empty matches are counted giving incorrect match count

Add Dashboard property to TextPatternSummary

The Dashboard property features:

A Dataset of max entires
A DateListPlot showing the WordFrequencyData for each of the max entries
A FeatureSpacePlot of the max entries

Rename TextPatternCases to LexicalCases

I think LexicalCases is sleeker and more descriptive. TextTypes are essentially Lexical categories, so I think having Lexical in the function name is nice.

Nothing could be replaced with a user specified default

https://github.com/dishmint/TextSequenceCases/blob/3de98f90091af05e2411d514e78c49dc3f3a846a/TextSequenceCases.wl#L47

The Nothing on this line could be an OptionValue["OptionalDefault"] instead.

Update ConvertToWikipediaSearchQuery to handle updated TextPattern syntax

TextPattern[TextType["Determiner"], "king" | "queen"];

Resulted in the string "king,queen", where

TextPattern[TextType["Determiner"], " ", "king" | "queen"];

Resulted in the string "king queen"

Convert package to Paclet

The functionality would benefit from being packaged up into a paclet.

[LexicalCasesOnString] StringPosition of list-matches will give incorrect results

LexicalCases/LexicalCases.wl

Lines 204 to 208 in bc7d238

    
           Map[AssociationThread[{"Match", "Position"} -> #] &]@With[ 
        
           	{cases = MatchTrim[OptionValue["StringTrim"]]@DeleteDuplicates@StringCases[source, RX]}, 
        
           	Thread[{cases, Map[StringPosition[source, #] &][cases]}] 
        
           	] 
        
           ]

This needs to change because matches returned as lists will not return the correct results from StringPosition. The reason I implemented it this way was because the threading returned a same-length error, but StringPosition needs to search for the pattern, and then I need to combine the matches and positions appropriately.

Example pattern:

LexicalPattern[adv : TextType["Adverb"], adj : TextType["Adjective"], "music"] :> {adv, adj}

Returning position of subsequence

It might be useful to return the match and its position.

[LexicalCases] TextType match may be different from the word type

In some cases the text that matches a TextType is not of that type in context. For example, here, defined is a verb, when it matched as an adjective.

So I need to figure out how to maintain/represent the grammatical structure in the LexicalPattern.

[OptionalLexicalPattern] Need to consider implication of surrounding patterns in 0 instance case

OptionalLexicalPattern needs to be resolved differently. Note in the pattern below the location of the optional. OptionalLexicalPatterns match its arguments, or an empty string. so what results when the OptionalLexicalPattern argument is not present is a sequence of two whitespaces, where only one should be there.

LexicalPattern["Alice ", TextType["Verb"], " ", TextType["Preposition"], " ", OptionalLexicalPattern["the"], " ", TextType["Noun"], WordBoundary]

Improve Text Tokenization

Text words doesn't respect sentence boundaries

An option is compiling the text pattern to a RegularExpression, that way the source text doesn't need to be 'tokenized' into a list of words.

[WikipediaSearch] Content keywords only return articles including the keywords

This returns articles with all the keywords,

WikipediaSearch["Content" -> {"marathon", "race", "hike"}]

but what if I want to consider the keywords individually? LexicalCases should have an input form that provides this feature.

Add note for which version of M is supported

This functionality was developed in 12.3, I should note that in documentation

Add Properties to extract data displayed in the Dashboard

"PartOfSpeechGroups"
"PercentDataset"

ConvertToWikipediaSearchQuery needs refactoring

https://github.com/dishmint/TextSequenceCases/blob/2df7a87349f99ee5a7b16970d63c98e132ffd28a/TextSequenceCases.wl#L97-L103

Calling ConvertToWikipediaSearchQuery on this pattern produces an empty string, which WikipediaSearch can't handle, due to deletion of OrderlessTextPattern objects.

TextPattern[TextType["Adjective"], OrderlessTextPattern["movie" | "movies", OptionalTextPattern["from" | "by"], TextType["Person"]]]

OrderlessTextPattern's shouldn't be deleted. Instead the TextPattern should expand, covering all orderings, or just one, since these queries are meant as keywords for WikipediaSearch (the function name should reflect that: ConvertToWikipediaSearchKeywords).

So then — what happens with TextPatterns whose arguments dissolve, thereby producing " "? A random sample of wikipedia articles would suffice.

[LexicalCases] Singular and Plural alternatives only matching the Singular case

I noticed the pattern below only matching the shorter string machine

LexicalCases[$SampleStringLong, LexicalPattern[TextType["Adjective" | "Noun"], " ", "machine" | "machines"]]

Support String Pattern Symbols

Except doesn't work on words, but works on character types above.

Except

The explicit nature of LexicalPattern's doesn't warrant the use of Longest or Shortest.

Longest
Shortest

Support option to return a TextType Association

https://github.com/dishmint/TextSequenceCases/blob/fca402def6ac23cef589908277048a7c705259a5/TextSequenceCases.wl#L115-L118

It would be useful to have an association returned where each element in the result has a key corresponding to its TextType.

So, instead of this:

{{"generally", "extinct", "species"}, {"aboriginally", "distinct", "species"}, ...}

You'd get something like:

{
    <| "Adverb" -> "generally", "Adjective" -> "extinct", "Text" -> "species"|>,
    <| "Adverb" -> "aboriginally", "Adjective" -> "distinct", "Text" -> "species"|>,
    ...
    }

This would avoid a retroactive tagging step on the user's part when performing analysis on the results.

TextSequenceSummary

The result should be a TextSequenceSummary object with accessors for:

Data
Relative Counts
MatchFrequencyPerSentence (?)
- can't check the sentence unless the source-text tokenization is by WordBoundary instead of Whitespace.
- Would there be performance loss from tokenizing by WordBoundary (therein preserving punctuation etc.?)

Support Replacement Rules in TextPattern

Replacement rules should work:

TextPatternCases[sourcetext, TextPattern["this is a", adj:TextType["Adjective"], TextType["Noun"]] :> <|"Adjective" -> adj |>]

[LexicalCases] Add File support

File specs should be valid input, that is, expressions with the head File:

LexicalCases[File["path/to/file"], LexicalPattern[...]]

[LexicalCases] Cache results for articles and update when they've been changed

Article text doesn't need to be cached, but results could be, and whichever articles have been updated can be re-scanned.

[LexicalCases] Incorrect Matches

calculatin did not appear in the second match. This may be because StringCases is looking for matching substrings, so g was considered appropriate. It could also be that the text type expansion picked up g as a noun or adjective.

Call LexicalCases from WCL Python

Considering implementing a Wolfram Client Library for Python version to make LexicalCases available for python users.

Rename the Repo to TextPatternCases

Reflect the name change in the Repo name

[TextType] Spanning content types cause memory errors

Spanning TextTypes (VerbPhrase,AdjectivePhrase,...) may cause Java errors because they match large patterns.

[Wikipedia] Match and Missing counts are incorrect

When searching 5000 articles, a Missings count of 9000 was returned. This is impossible. There should only be one Missing["NoMatchFound"] per article without matches.

A similar issue occurred for match counts, where the count given was effectively the number of articles a match appeared in, not the number of occurrences of the word.

Special characters in source text need escaping

This is a problem... VerbPhrase uses up a lot of memory. I can try it on a small text to see if the same issue occurs. This query should be possible, but the scope of fixing it might be beyond the development of this function.

Define Format or TextString for TextPatterns

https://github.com/dishmint/TextSequenceCases/blob/797e2f4fe035203d8bf959c306368256b4390fe9/TextSequenceCases.wl#L39-L47

Instead of the custom TextPatternFormat I could use Format or TextString

Format

Format[TextPattern[args__], OutputForm] := StringForm["(> `1` <)", Sequence@@Map[ToString, {args}]]

TextString

TextPattern /: TextString[TextPattern[args__]] := "(>"<>StringJoin[Map[TextString, {args}]]<>"<)"

Support ctrl+= entities in addition to (or instead of) TextType

The entities should be supported in the text pattern. So long as I can turn them into a form suitable for TextCases to search for their examples.

Adding whitespace to patterns makes the TextPatternString look fluffy

The string form of the TextPattern looks fluffy because whitespace is now added by the user.

[Documentation] Add Notes doc for comments on best practices

Suppressing output for speed increase
Increase MaxItems option for more match opportunities
~~Partial matches for TextTypes (explicit WordBoundary)~~ supported by default
~~Ensure full word matches by adding WordBoundary~~ supported with BoundedString function

[LexicalCasesWikipedia] ProgressIndicator when one article is searched is not useful

Example of problem:

Note how found is at 0, which at the moment can't be avoided. Maybe it would suffice to not have that data show up if there is only one article being searched?

[Documentation] Needs updating given new properties

Some properties were renamed, and a few were added, the changes should be reflected in the docs.

[ToTextElementStructure] Support Pattern Symbols

Pattern symbols not hooked up in ToTextElementStructure.

[ToTextElementStructure] Formatting latter arguments of functions like Repeated

I'm not sure {2,3} should be rendered this way.

[LexicalCases] Delegate service definitions to separate files

The service functionality should be split up. A separate package file for each supported service, for example, LexicalCasesWikipedia.wl, and LexicalCasesArXiv.wl. Each file would contain code for query parsing consistent with that service. This would clean up LexicalCases.wl and make it easier to read.

Originally posted by @dishmint in #1 (comment)

[LexicalCasesOnString] Is it more performant to convert LP to SE for all source text before searching

Right now I'm calling LexicalPatternToStringExpression per source at the same step of searching. I'm wondering if I should have all the string expressions generated before searching. Then I could use MapThread:

MapThread[LexicalCasesOnString[<source>, <pattern>]&, {{source1, pattern1}, {source2, pattern2}}]

Or use MapIndexed

texts = {text1, text2, ...};
MapIndexed[LexicalCasesOnString[texts[[#2]], #1]&, {pattern1, pattern2, ...}]

(parallelized?)

I'll also need to do some profiling of the code before coming to any conclusions. All the more reason to pacletize so i can profile the code from WorkBench.

[LexicalSummary] Add WordStem tab

Add WordStem tab where All matches are WordStemmed and grouped by stem.

[LexicalCases] Support list of strings as input

Support a first argument list of strings in LexicalCases and have it work like LexicalCasesFromWikipedia, that is, instead of associating matches with an article, associate them with a file name. Though, I suppose the question is, would you want a separate result for each text, or an aggregate result?

Add VerificationTests

Implement VerificationTests as there currently are none.

Add second argument to "MatchCountGroups" to limit the number of results returned

For example, the code below would return the 5 most occurring matches

TPSO["MatchCountGroups", 5]

	Map[AssociationThread[{"Match", "Position"} -> #] &]@With[
	{cases = MatchTrim[OptionValue["StringTrim"]]@DeleteDuplicates@StringCases[source, RX]},
	Thread[{cases, Map[StringPosition[source, #] &][cases]}]
	]
	]