Code Monkey home page Code Monkey logo

Comments (11)

jsvine avatar jsvine commented on June 5, 2024

Hi @nezetimesthree, and thanks for your interest in markovify. When you get a chance, please provide code and text that reproduces the problem. Without that, it will unfortunately be quite hard to debug.

from markovify.

nezetimesthree avatar nezetimesthree commented on June 5, 2024

of course. here's the code and text file.

from transformers import pipeline
import random
import markovify

model_link = "IProject-10/bert-base-uncased-finetuned-squad2"
question_answerer = pipeline("question-answering", model=model_link)

with open('mayakovsky.txt', 'r') as file:
  f = file.readlines()
  poems = []
  poem = ''
  dataset = ''
  for line in f:
    dataset += line.strip() + '. '
    if line != '\n':
      poem += line.strip() + ' '
   
 else:
      poems.append(poem)
      poem = ''

context = random.choice(poems)
question = input()

answer = question_answerer(question=question, context=context)['answer']

print(answer, '->', ' '.join(answer.split()[-2:]))

text_model = markovify.Text(' '.join(poems))

if len(answer.split()) > 1:
  print(text_model.make_sentence_with_start(' '.join(answer.split()[-2:]), strict=False, tries=100), end='\n')
else:
  print(text_model.make_sentence_with_start(answer, strict=False, tries=100), end='\n')
for i in range(5):
  print(text_model.make_short_sentence(200, min_length=100, tries=100), end='\n')

mayakovsky.txt

from markovify.

jsvine avatar jsvine commented on June 5, 2024

Thanks for sharing this, @nezetimesthree.

It seems that you're passing to make_sentence_with_state a "start" that was generated by an LLM, which is not guaranteed to be a start that actually exists in your corpus, which is a requirement for markovify and this type of Markov chain generally. Is that correct? If so, this is expected behavior of markovify and I would not consider it a bug.

If I've misunderstood, could you share a simpler code example that doesn't depend on other libraries, yet still reproduces the problem? In this example, the logic that uses IProject-10/bert-base-uncased-finetuned-squad2 is fairly intertwined here with the logic that uses markovify, and there are several different calls to markovify, making it difficult to debug.

from markovify.

nezetimesthree avatar nezetimesthree commented on June 5, 2024

thanks for taking a look, @jsvine. but you're misunderstanding this: LLM gives answers only from the given context, which, in this case, is one of the poems from the file. i've checked the errors in poem dataset, and the words were there always. for some reason, NewlineText didn't see them as a start for sentences. maybe it's because some of the lines consist only of one word? could this be the issue?

from markovify.

jsvine avatar jsvine commented on June 5, 2024

Thank you for the helpful clarification, @nezetimesthree. Could you share a start that the code fails on but that is definitely a start in the corpus?

from markovify.

nezetimesthree avatar nezetimesthree commented on June 5, 2024

hello again, @jsvine. sorry i didn't answer yesterday, but here's the example, the error, and the proof that it's clearly there.

image
image

from markovify.

jsvine avatar jsvine commented on June 5, 2024

Thanks; can you share that as copy-pasteable text?

from markovify.

nezetimesthree avatar nezetimesthree commented on June 5, 2024

addititon: here's what happens when it receives only one word
image
image

can you clarify what you mean by "copy-pastable text", though? if i understand you corretcly, then the words are "ладно слажен" and "Наоборот"

from markovify.

jsvine avatar jsvine commented on June 5, 2024

Great, thanks; that's what I was looking for, indeed.

from markovify.

jsvine avatar jsvine commented on June 5, 2024

Thanks again for the helpful example. Taking a closer look, the issue seems not to be with make_sentence_with_start, but rather the sentence parser much earlier in the processing pipeline.

import markovify

with open("mayakovsky.txt", "r") as file:
    model = markovify.Text(file.read())


def test_presence(fragment):
    return any(
        any(fragment == token for token in sentence)
        for sentence in model.parsed_sentences
    )


print(test_presence("Послушайте!"))
print(test_presence("слажен"))

Prints:

True
False

The default Text model uses a regex-powered filter to remove sentences that could cause problems, mostly re. apostrophes and quotation marks. It also invokes unidecode, which seems to be causing the problem here. Because it's a generally useful approach, I don't want to remove that step from the library, but there are two ways you should be able to handle on your end:

  • Calling markovify.Text(..., well_formed=False), which skips the filtering step
  • Extending markovify.Text (documented here) to behave in a way better suited to your corpus.

Using well_formed=False seems to work well, although you'll have to contend with the punctuation (or strip it out in a pre-processing step), as you'll see with the comma below:

import markovify

with open("mayakovsky.txt", "r") as file:
    model = markovify.Text(file.read(), well_formed=False)

print(model.make_sentence_with_start("ладно слажен,"))

Prints: ладно слажен, — и все обвыл.

from markovify.

nezetimesthree avatar nezetimesthree commented on June 5, 2024

thank you very much, @jsvine. i will test it and return with the result next week. sorry for making you wait for it, but i just won't have a chance this week. thank you again, and we'll see if this works.

from markovify.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.