Comments (11)
Hi @nezetimesthree, and thanks for your interest in markovify
. When you get a chance, please provide code and text that reproduces the problem. Without that, it will unfortunately be quite hard to debug.
from markovify.
of course. here's the code and text file.
from transformers import pipeline
import random
import markovify
model_link = "IProject-10/bert-base-uncased-finetuned-squad2"
question_answerer = pipeline("question-answering", model=model_link)
with open('mayakovsky.txt', 'r') as file:
f = file.readlines()
poems = []
poem = ''
dataset = ''
for line in f:
dataset += line.strip() + '. '
if line != '\n':
poem += line.strip() + ' '
else:
poems.append(poem)
poem = ''
context = random.choice(poems)
question = input()
answer = question_answerer(question=question, context=context)['answer']
print(answer, '->', ' '.join(answer.split()[-2:]))
text_model = markovify.Text(' '.join(poems))
if len(answer.split()) > 1:
print(text_model.make_sentence_with_start(' '.join(answer.split()[-2:]), strict=False, tries=100), end='\n')
else:
print(text_model.make_sentence_with_start(answer, strict=False, tries=100), end='\n')
for i in range(5):
print(text_model.make_short_sentence(200, min_length=100, tries=100), end='\n')
from markovify.
Thanks for sharing this, @nezetimesthree.
It seems that you're passing to make_sentence_with_state
a "start" that was generated by an LLM, which is not guaranteed to be a start that actually exists in your corpus, which is a requirement for markovify
and this type of Markov chain generally. Is that correct? If so, this is expected behavior of markovify
and I would not consider it a bug.
If I've misunderstood, could you share a simpler code example that doesn't depend on other libraries, yet still reproduces the problem? In this example, the logic that uses IProject-10/bert-base-uncased-finetuned-squad2
is fairly intertwined here with the logic that uses markovify
, and there are several different calls to markovify
, making it difficult to debug.
from markovify.
thanks for taking a look, @jsvine. but you're misunderstanding this: LLM gives answers only from the given context, which, in this case, is one of the poems from the file. i've checked the errors in poem dataset, and the words were there always. for some reason, NewlineText didn't see them as a start for sentences. maybe it's because some of the lines consist only of one word? could this be the issue?
from markovify.
Thank you for the helpful clarification, @nezetimesthree. Could you share a start that the code fails on but that is definitely a start in the corpus?
from markovify.
hello again, @jsvine. sorry i didn't answer yesterday, but here's the example, the error, and the proof that it's clearly there.
from markovify.
Thanks; can you share that as copy-pasteable text?
from markovify.
addititon: here's what happens when it receives only one word
can you clarify what you mean by "copy-pastable text", though? if i understand you corretcly, then the words are "ладно слажен" and "Наоборот"
from markovify.
Great, thanks; that's what I was looking for, indeed.
from markovify.
Thanks again for the helpful example. Taking a closer look, the issue seems not to be with make_sentence_with_start
, but rather the sentence parser much earlier in the processing pipeline.
import markovify
with open("mayakovsky.txt", "r") as file:
model = markovify.Text(file.read())
def test_presence(fragment):
return any(
any(fragment == token for token in sentence)
for sentence in model.parsed_sentences
)
print(test_presence("Послушайте!"))
print(test_presence("слажен"))
Prints:
True
False
The default Text
model uses a regex-powered filter to remove sentences that could cause problems, mostly re. apostrophes and quotation marks. It also invokes unidecode
, which seems to be causing the problem here. Because it's a generally useful approach, I don't want to remove that step from the library, but there are two ways you should be able to handle on your end:
- Calling
markovify.Text(..., well_formed=False)
, which skips the filtering step - Extending
markovify.Text
(documented here) to behave in a way better suited to your corpus.
Using well_formed=False
seems to work well, although you'll have to contend with the punctuation (or strip it out in a pre-processing step), as you'll see with the comma below:
import markovify
with open("mayakovsky.txt", "r") as file:
model = markovify.Text(file.read(), well_formed=False)
print(model.make_sentence_with_start("ладно слажен,"))
Prints: ладно слажен, — и все обвыл.
from markovify.
thank you very much, @jsvine. i will test it and return with the result next week. sorry for making you wait for it, but i just won't have a chance this week. thank you again, and we'll see if this works.
from markovify.
Related Issues (20)
- subclassing markovify.Text to allow for different types of 'sentences' HOT 3
- Decreasing export size / memory usage HOT 1
- Character level chains instead of word level? HOT 2
- Markovify always outputs "None" with russian corpus HOT 12
- markovify and music HOT 1
- Thank you for a job well done! HOT 2
- I can’t install because of the encoding of the file HOT 1
- Can I generate sentence with only two words? HOT 2
- generate sentence with it's prediction HOT 2
- spaCy model shortcuts are deprecated HOT 1
- Non-english characters are not being displayed correctly.
- markov_text_model.make_sentence_with_start KeyError HOT 1
- Fallback without building a new model? HOT 1
- “python_requires” should be set with “>=3.6”, as markovify 0.9.3 is not compatible with all Python versions. HOT 1
- Control generated sentences randomness HOT 2
- - HOT 2
- missing utf-8 BOM lead to codec failures during tests on windows
- Markovify - Markov chain : Seed and Condition to text generated based in input. HOT 2
- Can't install on browser webpage.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from markovify.