Repository for the SongsBot telegram bot.
@author: Enrico Collu
Project for the 2021 AI-NLP course of Università degli Studi di Cagliari.
Introduction
Songsbot is a Telegram bot that, starting from a phrase given in input by the user (for example: "I want canadian indie rock songs"), returns youtube links to songs that most correspond to the user's request.
Main NLP features
The system mainly uses two NLP features:
- Tf-idf (term-frequency / inverse document-frequency)
- Knowledge Graph with Sparql and DBpedia
Tf-idf is used to score the most relevant words of the user input. It is calculated on the basis of the corpus available in the dataset and, once the sentence is received from the user, a list of the most characteristic words is returned (sorted by decreasing score)
Sparql is used as a query language for the DBpedia environment (a Wikipedia database that can be queried through RDF queries). This section of the project allows, starting from the name of an artist, to obtain information about the songs produced by him.
Dataset composition
The starting dataset consists of approximately 18,000 links to reviews on the Pitchfork website (https://pitchfork.com/). Starting from these links, a Web Scraping job was carried out in order to build a new database that contained the corpus of reviews in textual form. In this way it was possible to build a more useful dataset for the type of task to be performed.
Python libraries
- sqlite3 -> database connection
- pandas -> Dataframe management
- bs4 and BeautifulSoup -> Web Scraping
- requests -> HTTP requests
- telepot -> Telegram Bot management
- joblib -> perform parallel work
- scipy -> managing NLP tools
- numpy -> powerful tool for matrix operations
- nltk.corpus -> useful for preprocessing tasks
- sklearn.feature_extraction -> compute TF-IDF
- SPARQLWrapper -> Sparql query environment
- urllib.request -> retrieving youtube links
- gensim -> Word2Vec module (tried but not used in the final version of the project)
File and project structure
-
main.py is the main project file, which connects the entire system and makes the bot active. The bot setup and the actual service management are managed within this file. All other modules of the project converge in this file.
-
preprocessing.py contains useful tools to carry out the main textual preprocessing operations (tokenization, lowercase conversion, stopwords removing, lemmatization, etc ...)
-
datamanager.py it takes care of creating the connection to the initial links database and managing the whole module for the creation of the new textual database. It calls inside methods created in the scraper.py module (which allows you to carry out the work of retrieving the textual content from the pages).
-
queriesSparQL.py contains the method to perform the SparQL queries to DBpedia
-
responsebuilder.py it returns scores to the artists based on the keyword match within the reviews themselves.
-
youtube_module.py contains the useful method to return the link of the YouTube video corresponding to the artist sought (providing the first useful result)
Usage of the bot
-
Start the service by running the main.py file.
-
once the service is active, on Telegram it is necessary to search for the bot (@Songs20Bot) and start a chat.
-
To get started, type /start and send the message. At that point the Bot will respond by providing instructions for use.
-
The bot will return youtube videos of the songs it deems appropriate.