Code Monkey home page Code Monkey logo

Satarupa Guha's Projects

analysis-wikipedia-entities icon analysis-wikipedia-entities

Goal: To understand the Wikipedia dataset, especially the entity info boxes. Task: We have taken the Wikipedia dump. Our aim is to extract information about various entity types. The steps for this task are as follows: 1. Given the Wikipedia dump, gather all the pages from Wikipedia with Info boxes on them. 2. Find the set of all possible entity types on Wikipedia 3. Find the set of all possible attributes that can be associated with any entity type on Wikipedia. 4. From a few values of these attributes, infer the data type of these attributes as one of the following: String, set of strings, duration, number, set of durations, date, other. 5. Find various units that can be used to express the value of a numeric attribute. E.g., for “height” attribute of “person” entities, the units could be “cms, inches” 6. For numeric attributes, find typical ranges (using the most popular unit). E.g., For person entities, the age attribute should have the range as 0-150 years. 7. For attributes which are semantically similar but have different names used across different entities of the same type, merge them. E.g., Automatically identify that the attribute “birthdate” is the same as “bdate”.

extract_emailids_unstructured_webpages icon extract_emailids_unstructured_webpages

Goal: To understand basic crawling, and use simple heuristics to handle real world unclean web data to get email ids. Input: 2000 business webpages crawled from Yelp. Each webpage is an HTML containing details about the business. It does not have the email id, but it has the website address for the business which can be used to find the contact us page for the website and thereby extract its email id. Task is to obtain structured data for the business: business name, business phone number, business home page URL, contact-us URL for the business, email id for the business.

guidance icon guidance

A guidance language for controlling large language models.

handwritten-digit-recognition icon handwritten-digit-recognition

The problem of handwriting recognition is to interpret intelligible handwritten input automatically, which is of great interest in the pattern recognition research community because of its applicability to many fields towards more convenient input devices and more efficient data organization and processing. We have to code a complete digit recognizer and test it on the MNIST digit dataset. As a benchmark for testing classification algorithms, the MNIST dataset has been widely used to design novel handwritten digit recognition systems. The dataset consists of 70,000 gray scale images, each of size 784. The recognizer is supposed to read the image data, extract features from it and use a k-nearest neighbor classifier to recognize any test image. To carry out the experiments, we need to randomly divide it into two partitions - training and testing. The training set is used to create the classifier and test set is used to determine the accuracy.

implementingeigenfaces icon implementingeigenfaces

The goal of this mini project is to get familiarized with the ideas of image representation, PCA and LDA, and face recognition. It is also understand the practical difficulties in developing real-world systems that work with acceptable accuracies.

phrase-translation icon phrase-translation

Words may not always be the best atomic unit of a sentence. One word in the source language often corresponds to multiple words in the target language. A word-based model would break down in these cases. This is the mortivation for building a phrase-based model for translation.

rerankurl icon rerankurl

This project is based on the Personalized Web Search Challenge organized by Kaggle. The aim of this challenge is to re-rank URLs of each SERP returned by the search engine according to the personal preferences of the users.

searchengineforwikipedia icon searchengineforwikipedia

Given a query, search the Wikipedia Corpus (46 GB) and give the titles of top ten retrieved documents, in ranked order. Queries can be either phrase queries or field based queries. Multi-level indexes were built to improve retrieval speed. Evaluation will be done primarily on the basis of the quality of results and time taken for retrieval (less than 1 sec). Keeping the size of the index was also a challenge. Compression techniques was used for that purpose.

top-k-influentials-in-temporal-graph icon top-k-influentials-in-temporal-graph

Given a social network graph, our objective is to find the top –k influential nodes such that if these k nodes are made seeds of information, the information will spread to maximal number of nodes in a certain number of time stamps. We also wish to optimise k so that there is a reasonable trade-off between cost and time.

transformers icon transformers

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

try_sentiment icon try_sentiment

This is an attempt to implement NRC-Canada's sentiment module for SemEval'14

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.