Admittedly, that title was clickbait. It seems like you are interested in scenes or wizards, and other programming patters that allow you to define conversational interfaces like they were a finite-state machine (FSM, wiki).
THIS IS A A LOT OF TEXT. Please still read everything before you comment. Let's try to keep the signal-to-noise ratio up high on this one :)
What Is This Issue
One of the features that get requested most often are scenes. This issue shall:
- Bring everyone on the same page what people mean when they say scenes.
- Why there is not going to be a traditional implementation of scenes for grammY, and how we're trying to do better.
- Update you about the current progress.
- Introduce two novel competing concepts that could both turn out as better than scenes.
- Serve as a forum to discuss where we want to take this library regarding grammY conversations/scenes.
1. What are scenes?
A chat is a conversational interface. This means that the chat between the user and the bot evolves over time. Old messages stay relevant when processing current ones, as they provide the context of the conversation that determines how to interpret messages.
< /start
>>> How old are you?
< 42
>>> Cool, how old is your mother?
< 70
>>> Alright, she was 28 when you were born!
Note how the user sends two messages, and both are numbers. We only know that those two numbers mean two different things because we can follow the flow of the conversation. The two age numbers are following up two different questions. Hence, in order to provide a natural conversational flow, we must store the history of the chat, and take it into account when interpreting messages.
Note that Telegram does not store the chat history for bots, so you should store them yourself. This is often done via sessions, but you can also use your own database.
In fact, we often don't need to know the entire chat history. The few most recent messages are often enough to remember, as we likely don't have to care about what the user sent back in 2018. It is therefore common to construct state, i.e. a small bit of data that stores where in the conversation where are. In our example, we would only need to store if the last question was about the age of the user, or about the age of their mother.
Scenes are a way to express this conversational style by allowing you to define a finite-state machine. Please google what this is, it is essential for the following discussion. The state is usually stored in the session data. They achieve this by isolating a part of the middleware into a block that can be entered and left.
Different bot frameworks have different syntax for this, but it typically works roughly like this (explanatory code, do not try to run):
// Define a separate part of the middleware handling.
const scene = new Scene('my-scene')
scene.command('start', ctx => ctx.reply('/start command from inside scene'))
scene.command('leave', ctx => ctx.scene.leave()) // leave scene
// Define regular bot.
const bot = new Bot('secret-token')
bot.use(session())
bot.use(scene)
bot.command('start', ctx => ctx.reply('/start command outside of scene'))
bot.command('enter', ctx => ctx.scene.enter('my-scene')) // enter scene
bot.start()
This could result in the following conversation.
< /start
>>> /start command outside of scene
< /enter
< /start
>>> /start command from inside scene
< /leave
< /start
>>> /start command outside of scene
In a way, every scene defines one step of the conversation. As you can define arbitrarily many of these scenes, you can define a conversational interface by creating a new instance of Scene
for every step, and hence define the message handling for it.
Scenes are a good idea. The are a huge step forward from only defining dozens of handlers on the same middleware tree. Bots that do not use scenes (or a similar form of state management) are effectively forgetting everything that happened in the chat immediately after they're done handling a message. (If they seem like they remember their context, then this is more or less a workaround which relies on a message that you reply to, inline menus, or other information in order to avoid state management.)
2. Cool! So what is the problem?
Scenes effectively reduce the flow of a conversation to being in a state, and then transitioning into another state (ctx.scene.enter('goto')
). This can be illustrated by translating scenes into routers:
const scene = new Router(ctx => ctx.session.scene)
// Define a separate part of the middleware handling.
const handler = new Composer()
scene.route('my-scene', handler)
handler.lazy(ctx => {
const c = new Composer()
c.command('start', ctx => ctx.reply('/start command from inside scene'))
c.command('leave', ctx => ctx.session.scene = undefined) // leave scene
return c
})
// Define regular bot.
const bot = new Bot('secret-token')
bot.use(session())
bot.use(scene)
bot.command('start', ctx => ctx.reply('/start command outside of scene'))
bot.command('enter', ctx => ctx.session.scene = 'my-scene') // enter scene
bot.start()
Instead of creating new Scene
objects, we simply create new routes, and obtain the same behaviour with minimally more code.
This may work if you have two states. It may also work for three. However, the more often you instantiate Scene
, the more states you add to your global pool of states, between which you're jumping around arbitrarily. This quickly becomes messy. It takes you back to the old days of defining a huge file of code without indentation, and then using GOTO to move around. This, too, works at a small scale, but considering GOTO harmful led to a paradigm shift that substantially advanced programming as a discipline.
In Telegraf, there are some ways to mitigate the problem. For example, once could add a way to group some scenes together into a namespace. As an example, Telegraf calls the Scene
from above a Stage
, and uses the word scene to group together several stages. It also allows you to force certain stages into a linear history, and calls this a wizard, in analogy to the multi-step UI forms.
With grammY, we try to rethink the state of the art, and to come up with original solutions to long standing problems. Admitting that Update
objects are actually pretty complex objects led us to giving powerful tools to bot developers: filter queries and the middleware tree were born, and they are widely used in almost all bots. Admitting that sending requests is more than just a plain HTTP call (at least when you're working with Telegram) led us to developing API transformer functions: a core primitive that drastically changes how we think about plugins and what they can do. Admitting that long polling at scale is quite hard led us to grammY runner: the fastest long polling implementation that exists, outperforming all other JS frameworks by far.
Regarding conversational interfaces, the best we could come up with so far is GOTO. That was an okay first step a few years ago. Now, it is time to admit that this is harmful, and that we can do better.
3. So what have we done about this so far?
Not too much. Which is why this issue exists. So far, we've been recommending people to combine routers and sessions, rather than using scenes, as it does not use much more code, and providing the same plain old scenes for grammY is not ambitious enough.
There is a branch in this repository that contains some experiments with the future syntax that could be used, however, the feedback for it was mixed. It does bring some improvements to the situation as it provides a structure between the different steps in the conversation. Unfortunately, the resulting code is not too readable, and it makes things that belong together end up in different places of the code. It is always cool if the things that are semantically linked can be written close to each other.
As a consequence of this lack of progress, we need to have a proper discussion with everyone in the community in order to develop a more mature approach. The next section will suggest two ideas, one of them is the aforementioned one. Your feedback and ideas will impact the next step in developing conversational interfaces. Please speak up.
4. Some suggestions
Approach A: “Conversation Nodes”
This suggestion is the one the we've mentioned above. Its main contribution is to introduce a more implicit way of defining scenes. Instead of creating a new instance of a class for every step, you can just call conversation.wait()
. This will internally create the class for you. As a result, you can have a more natural way of expressing the conversation. The wait
calls make it clear where a message from the user is expected.
Here is the example from the top again. Handling invalid input is omitted intentionally for brevity.
const conversation = new Conversation('age-at-birth')
conversation.command('start', async ctx => {
await ctx.reply('How old are you'))
ctx.conversation.forward()
})
conversation.wait()
conversation.on('message:text', async ctx => {
ctx.session.age = parseInt(ctx.msg.text, 10)
await ctx.reply('Cool, how old is your mother?')
ctx.conversation.forward()
})
conversation.wait()
conversation.on('message:text', async ctx => {
const age = parseInt(ctx.msg.text, 10)
await ctx.reply(`Alright, she was ${age - ctx.session.age} when you were born!`)
ctx.conversation.leave()
})
This provides a simple linear flow that could be illustrated by
We can jump back and forth using ctx.conversation.forward(3)
or ctx.conversation.backward(5)
.
The wait
calls optionally take string identifiers if you want to jump to a specific point, rather than giving a relative number of steps.
Next, let us see how we can branch out, and have an alternative way of continuing the conversation.
const conversation = new Conversation('age-at-birth')
conversation.command('start', async ctx => {
await ctx.reply('How old are you'))
ctx.conversation.forward()
})
conversation.wait()
// start a new sub-conversation
const invalidConversation = conversation.filter(ctx => isNaN(parseInt(ctx.msg.text))).diveIn()
invalidConversation.on('message', ctx => ctx.reply('That is not a number, so I will assume you sent me the name of your pet'))
invalidConversation.wait()
// TODO: continue conversation about pets here
// Go on with regular conversation about age:
conversation.on('message:text', async ctx => {
ctx.session.age = parseInt(ctx.msg.text, 10)
await ctx.reply('Cool, how old is your mother?')
ctx.conversation.forward()
})
conversation.wait()
conversation.on('message:text', async ctx => {
const age = parseInt(ctx.msg.text, 10)
await ctx.reply(`Alright, she was ${age - ctx.session.age} when you were born!`)
ctx.conversation.leave()
})
We have now defined a conversation that goes like this:
That way, we can define conversation flows.
There are a number of improvements that could be done to this. If you have any concrete suggestions, please leave them below.
Approach B: “Nested Handlers”
Newcomers commonly try out something like this.
bot.on('start', async ctx => {
await ctx.reply('How old are you?')
bot.on('message', ctx => { /* ... */ })
})
grammY has a protection against this because it would lead to a memory leak, and eventually OOM the server. Every received /start
command would add a handler that is installed globally and persistently. All but the first are unreachable code, given that next
isn't called inside the nested handler.
It would be worth investigating if we can write a different middleware system that allows this.
const conversation = new Conversation()
conversation.on('start', async ctx => {
await ctx.reply('How old are you?')
conversation.on('message', ctx => { /* ... */ })
})
This would probably lead to deeply nested callback functions, i.e. bring us back to callback hell, something that could be called the GOTO statement of asynchronous programming.
What could we do to mitigate this?
Either way, this concept is still tempting. It is very intuitive to use. It obviously cannot be implemented with exactly the above syntax (because we are unable to reconstruct the current listeners on the next update, and we obviously cannot store the listeners in a database), but could try to figure out if small adjustments could make this possible. Internally, we would still have to convert this into something like an FSM, but maybe one that is generated on the fly. The dynamic ranges of the menu plugin could be used as inspiration here.
5. We need your feedback
Do you have a third idea? Can we combine the approaches A and B? How would you change them? Do you think the examples are completely missing the point? Any constructive feedback is welcome, and so are questions and concerns.
It would be amazing if we could find the right abstraction for this. It exists somewhere out there, we just have to find it.
Thank you!