Code Monkey home page Code Monkey logo

lexicalcases's Introduction

Hey! I'm Fez.

Check out what I've been up to!


Wolfram Language

Paclets

Paclet Description
Lexical Cases Paclet Badge Extract substrings matching a lexical pattern
Macro Tools Paclet Badge Experimental Wolfram Language functions

Functions

Function Description
SetComplementMap Function Badge Map a function over elements at specified positions and another function to the rest
MapAtPart Function Badge Map functions to specified parts of an expression

Stats

Top Langs

lexicalcases's People

Contributors

dishmint avatar

Stargazers

 avatar

Watchers

 avatar

lexicalcases's Issues

Implement patterns as StringExpressions

Screen Shot 2021-11-25 at 12 39 16 AM

In the above example I'm getting the verbs, not the quantities.

I'm wondering if I can use native pattern syntax, this might be much easier. I wouldn't have to specify a bunch of utility functions to get the same behavior the pattern functions already offer. If anything, I just need to keep TextType heads.

Strings with whitespace need to be PatternSequences

https://github.com/dishmint/TextSequenceCases/blob/6100ad93e2a7255d89382f55b3042772ebac03cd/TextSequenceCases.wl#L47-L54

issue-Screen Shot 2021-08-28 at 6 01 57 PM

Strings with white space will not match in SequenceCases, so results like "Elon Musk" need to become PatternSequence["Elon","Musk"] probably.

Code to test:

tp3 = TextPattern[TextType["Adjective"], "books", OptionalTextPattern["from" | "by"], TextType["Person"]];
TextSequenceCases["I've been reading some cool books by Elon Musk.", tp3]

Rename TextPatternCases to LexicalCases

I think LexicalCases is sleeker and more descriptive. TextTypes are essentially Lexical categories, so I think having Lexical in the function name is nice.

[LexicalCasesOnString] StringPosition of list-matches will give incorrect results

Map[AssociationThread[{"Match", "Position"} -> #] &]@With[
{cases = MatchTrim[OptionValue["StringTrim"]]@DeleteDuplicates@StringCases[source, RX]},
Thread[{cases, Map[StringPosition[source, #] &][cases]}]
]
]

This needs to change because matches returned as lists will not return the correct results from StringPosition. The reason I implemented it this way was because the threading returned a same-length error, but StringPosition needs to search for the pattern, and then I need to combine the matches and positions appropriately.

Example pattern:

LexicalPattern[adv : TextType["Adverb"], adj : TextType["Adjective"], "music"] :> {adv, adj}

[OptionalLexicalPattern] Need to consider implication of surrounding patterns in 0 instance case

OptionalLexicalPattern needs to be resolved differently. Note in the pattern below the location of the optional. OptionalLexicalPatterns match its arguments, or an empty string. so what results when the OptionalLexicalPattern argument is not present is a sequence of two whitespaces, where only one should be there.

LexicalPattern["Alice ", TextType["Verb"], " ", TextType["Preposition"], " ", OptionalLexicalPattern["the"], " ", TextType["Noun"], WordBoundary]

Improve Text Tokenization

  • Text words doesn't respect sentence boundaries

An option is compiling the text pattern to a RegularExpression, that way the source text doesn't need to be 'tokenized' into a list of words.

ConvertToWikipediaSearchQuery needs refactoring

https://github.com/dishmint/TextSequenceCases/blob/2df7a87349f99ee5a7b16970d63c98e132ffd28a/TextSequenceCases.wl#L97-L103

Calling ConvertToWikipediaSearchQuery on this pattern produces an empty string, which WikipediaSearch can't handle, due to deletion of OrderlessTextPattern objects.

TextPattern[TextType["Adjective"], OrderlessTextPattern["movie" | "movies", OptionalTextPattern["from" | "by"], TextType["Person"]]]

Screen Shot 2021-09-02 at 1 29 50 AM

OrderlessTextPattern's shouldn't be deleted. Instead the TextPattern should expand, covering all orderings, or just one, since these queries are meant as keywords for WikipediaSearch (the function name should reflect that: ConvertToWikipediaSearchKeywords).

So then โ€” what happens with TextPatterns whose arguments dissolve, thereby producing " "? A random sample of wikipedia articles would suffice.

Support String Pattern Symbols

  • DigitCharacter
  • LetterCharacter
  • WhitespaceCharacter
  • WordCharacter
  • WordBoundary

Except doesn't work on words, but works on character types above.

  • Except

The explicit nature of LexicalPattern's doesn't warrant the use of Longest or Shortest.

  • Longest
  • Shortest

Support option to return a TextType Association

https://github.com/dishmint/TextSequenceCases/blob/fca402def6ac23cef589908277048a7c705259a5/TextSequenceCases.wl#L115-L118

It would be useful to have an association returned where each element in the result has a key corresponding to its TextType.

So, instead of this:

{{"generally", "extinct", "species"}, {"aboriginally", "distinct", "species"}, ...}

You'd get something like:

{
    <| "Adverb" -> "generally", "Adjective" -> "extinct", "Text" -> "species"|>,
    <| "Adverb" -> "aboriginally", "Adjective" -> "distinct", "Text" -> "species"|>,
    ...
    }

issue-Screen Shot 2021-08-28 at 7 52 22 PM

This would avoid a retroactive tagging step on the user's part when performing analysis on the results.

TextSequenceSummary

The result should be a TextSequenceSummary object with accessors for:

  • Data
  • Relative Counts
  • MatchFrequencyPerSentence (?)
    • can't check the sentence unless the source-text tokenization is by WordBoundary instead of Whitespace.
    • Would there be performance loss from tokenizing by WordBoundary (therein preserving punctuation etc.?)

Support Replacement Rules in TextPattern

Replacement rules should work:

TextPatternCases[sourcetext, TextPattern["this is a", adj:TextType["Adjective"], TextType["Noun"]] :> <|"Adjective" -> adj |>]

[LexicalCases] Add File support

File specs should be valid input, that is, expressions with the head File:

LexicalCases[File["path/to/file"], LexicalPattern[...]]

[LexicalCases] Incorrect Matches

Screen Shot 2021-11-29 at 10 11 22 PM

calculatin did not appear in the second match. This may be because StringCases is looking for matching substrings, so g was considered appropriate. It could also be that the text type expansion picked up g as a noun or adjective.

[Wikipedia] Match and Missing counts are incorrect

When searching 5000 articles, a Missings count of 9000 was returned. This is impossible. There should only be one Missing["NoMatchFound"] per article without matches.

A similar issue occurred for match counts, where the count given was effectively the number of articles a match appeared in, not the number of occurrences of the word.

Special characters in source text need escaping

This is a problem... VerbPhrase uses up a lot of memory. I can try it on a small text to see if the same issue occurs. This query should be possible, but the scope of fixing it might be beyond the development of this function.

Screen Shot 2021-09-10 at 10 21 11 PM

[Documentation] Add Notes doc for comments on best practices

  • Suppressing output for speed increase
  • Increase MaxItems option for more match opportunities
  • Partial matches for TextTypes (explicit WordBoundary) supported by default
  • Ensure full word matches by adding WordBoundary supported with BoundedString function

[LexicalCases] Delegate service definitions to separate files

The service functionality should be split up. A separate package file for each supported service, for example, LexicalCasesWikipedia.wl, and LexicalCasesArXiv.wl. Each file would contain code for query parsing consistent with that service. This would clean up LexicalCases.wl and make it easier to read.

Originally posted by @dishmint in #1 (comment)

[LexicalCasesOnString] Is it more performant to convert LP to SE for all source text before searching

Right now I'm calling LexicalPatternToStringExpression per source at the same step of searching. I'm wondering if I should have all the string expressions generated before searching. Then I could use MapThread:

MapThread[LexicalCasesOnString[<source>, <pattern>]&, {{source1, pattern1}, {source2, pattern2}}]

Or use MapIndexed

texts = {text1, text2, ...};
MapIndexed[LexicalCasesOnString[texts[[#2]], #1]&, {pattern1, pattern2, ...}]

(parallelized?)

I'll also need to do some profiling of the code before coming to any conclusions. All the more reason to pacletize so i can profile the code from WorkBench.

[LexicalCases] Support list of strings as input

Support a first argument list of strings in LexicalCases and have it work like LexicalCasesFromWikipedia, that is, instead of associating matches with an article, associate them with a file name. Though, I suppose the question is, would you want a separate result for each text, or an aggregate result?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.