Code Monkey home page Code Monkey logo

wdm_exp5's Introduction

EX5 Information Retrieval Using Boolean Model in Python

DATE: 21.03.2024

AIM: To implement Information Retrieval Using Boolean Model in Python.

Description:

The Boolean model in Information Retrieval (IR) is a fundamental model used for searching and retrieving information from a collection of documents. It operates on the principles of set theory and logic, where documents are represented as sets of terms or words, and queries are expressed as Boolean expressions using logical operators such as AND, OR, and NOT.

Procedure:

  1. Initialize the BooleanRetrieval class: The BooleanRetrieval class is defined to manage the indexing and searching of documents.
  2. Constructor and Index Initialization: The class constructor initializes an empty index to store the inverted index mapping terms to documents.
  3. Indexing Documents:

    a) The index_document method is responsible for indexing documents.

    b) Tokenize the text content of documents, converting them into lowercase terms.

    c) For each term in the document, it adds an entry in the index, associating the term with the document ID.

  4. Fetch Web Page Text:

    a) The fetch_webpage_text method uses the requests library to fetch content from a given URL.

    b) Extract text content from the fetched HTML using BeautifulSoup.

    c) The extracted text is returned for further processing.

  5. Boolean Search:

    a) The boolean_search method performs Boolean searches on the indexed documents.

    b) Tokenize the input query and iterates through its terms.

    c) For each term in the query, it retrieves documents containing that term and performs Boolean operations (AND, OR, NOT) based on the query's structure.

Program:

import numpy as np
import pandas as pd

class BooleanRetrieval:
    def __init__(self):
        self.index = {}
        self.documents_matrix = None

    def index_document(self, doc_id, text):
        terms = text.lower().split()
        print("Document -", doc_id, terms)

        for term in terms:
            if term not in self.index:
                self.index[term] = set()
            self.index[term].add(doc_id)

    def create_documents_matrix(self, documents):
        terms = list(self.index.keys())
        num_docs = len(documents)
        num_terms = len(terms)

        self.documents_matrix = np.zeros((num_docs, num_terms), dtype=int)

        for i, (doc_id, text) in enumerate(documents.items()):
            doc_terms = text.lower().split()
            for term in doc_terms:
                if term in self.index:
                    term_id = terms.index(term)
                    self.documents_matrix[i, term_id] = 1

    def print_documents_matrix_table(self):
        df = pd.DataFrame(self.documents_matrix, columns=self.index.keys())
        print(df)

    def print_all_terms(self):
        print("All terms in the documents:")
        print(list(self.index.keys()))

    def boolean_search(self, query):
        query_terms = query.lower().split()
        results = None

        for term in query_terms:
            doc_ids = self.index.get(term, set())
            if results is None:
                results = doc_ids.copy()
            else:
                if term.startswith('not'):
                    results.difference_update(doc_ids)
                elif term == 'or':
                    results.update(doc_ids)
                elif term == 'and':
                    results.intersection_update(doc_ids)

        return list(results) if results else []

if __name__ == "__main__":
    indexer = BooleanRetrieval()


    documents = {
        1: "Python is a programming language",
        2: "Information retrieval deals with finding information",
        3: "Boolean models are used in information retrieval"
    }

    for doc_id, text in documents.items():
        indexer.index_document(doc_id, text)


    indexer.create_documents_matrix(documents)
    indexer.print_documents_matrix_table()


    indexer.print_all_terms()


    query1 = input("Enter your boolean query: ")
    results = indexer.boolean_search(query1)
    if results:
        print(f"Results for '{query1}': {results}")
    else:
        print("No results found for the query.")

Output:

WDM5

Result:

Thus the implementation of Information Retrieval Using Boolean Model in Python is successfully completed.

wdm_exp5's People

Contributors

varalakshmi1084 avatar kadinsamson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.