Code Monkey home page Code Monkey logo

plagiarism-checker's Introduction

Plagiarism-Checker

A utility to check if a document's contents are plagiarised.

How it works

  • It searches online using Google Search API's for some queries. Queries are n-grams extracted from the source txt file.
  • Resulting URL, matched contents are checked for similarity with given text query.
  • Result of average similarity of all URL's is stored in output text file.

Required Libraries

The project uses python-docx module to decode docx files. The python-docx module has its own set of dependent libraries. The required libraries are:

  • PIL
  • lxml
  • python-dateutil
  • python-docx

GETTING LIBRARIES ON LINUX

  • Get easy_install
sudo apt-get install python-setuptools
  • Install PIP
sudo easy_install pip
  • Install dependent libraries
sudo pip install PIL

sudo pip install lxml

sudo pip install python-dateutil
  • Install python-docx
sudo pip install docx
  • Install pdftotext for pdf support (sketchy at the moment)
sudo apt-get install poppler-utils
  • Get ppt and doc support
sudo apt-get install catdoc

GETTING LIBRARIES ON WINDOWS

These steps assume you already have python installed and that python is in your windows environment variables.

Download setup-tools according to your python version. (That is python 2.7 in most cases)

Run the .exe file. The installer will automatically find your python installation location from the registry and install easy_install to the Scripts directory where your python installation is located.

Once the installer has run, add easy_install to the windows environment variables path.

  • Open a command window
  • Run the following command:
easy_install pip
  • Then install the required libraries for docx support
pip install PIL

pip install lxml

pip install python-dateutil

pip install docx
  • EXEs for pdf, ppt and doc support are included in the package. Nothing need be installed.

Folder Structure

  • assets/

Holds Twitter Bootstrap CSS and Javascript files and images/glyphicons

  • config/

Stores configuration data (Path to Python on Windows)

  • scripts/

Contains python scripts to perform plagiarism checks

  • temp/

Contains uploaded files

Python Scripts

Backend is supported using python. There are 3 scripts in total.

  • scripts/main.py

Main script which gets the results of plagiarism

  • scripts/htmlstrip.py

Used to strip text from HTML tags

  • scripts/cosineSim.py

Helper modules to find cosine similarity between strings

Usage of Python Script (Standalone)

python main.py sampleText.txt sampleOut.txt

plagiarism-checker's People

Contributors

architshukla avatar askmrsinh avatar shashankrao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

plagiarism-checker's Issues

Scripts Download: Automated?

When the user asks to Download Scripts, why not tar the latest copy of the scripts on the fly and make available for download?

Instead of always statically generating scipts.zip

sampleOut.txt not showing result

I copied the some content from wikipedia in the sampleText.txt file and executed the python script as mentioned- python main.py sampleText.txt sampleOut.txt but sampleOut.txt was empty when I opened it. Please help.

DOCX

Docx can be implemented, by using the python-docx module.

And then dumping the docx text contents into a txt file.

I shall do this in a seperate branch.

Importance: Medium

Other Metrics?

Should we look into classifying documents using other metrics?

If so, send me wiki links for the metric and the math behind it.

We can provide different metric results.

log_xxxx file is empty

Hi, i am getting error that;
"Warning: fread(): Length parameter must be greater than 0 in readLogs.php on line 8"

I can see that log_xxxxxx file is empty while input_xxxxxx file have text which is fine. Can you help please regarding log empty file?

Thanks,

Usage in Mac

Hello,
Please will this program work in Mac?

Thanks

Log Data / Log View

It would be nice to view the logs on the front end, as to what happened while executing the script.

Can this be done?

Wiki Page

  • Make a wiki page for this later.
  • Set up a Git Page, if possible

PIL on windows

When I run the new python script on windows (after installing all modules), it tells me that Image module was not found. I looked it up and found that it has something to do with Python Image Library not being installed properly and it's a pain to install it on windows. Can we work around this/find a tutorial to install this module for windows?

Asynchronous execution on windows

The current inability to execute the python scripts asynchronously on Windows leads to difference in user experience among Linux and Windows users.

-The growth of progress bar is not visible on Windows
-Results are displayed once the entire script ends, directly displaying statistics
-Produces a sense of latency and is undesirable for users

Update UI to Bootstrap 3

Needs changing of some classes as well as how some buttons are themed.

This will also allow us to experiment with "themes". Possibility of user selecting theme theme he wants for the front end. (Most themes are based on Bootstrap 3.0)

.pyc Files

Shashank, what are the .pyc files for? Should they also be included in the .zip file?

Navbar

There is currently no way to go back to the main page after uploading a document.

This needs to be fixed. Other things that could be shifted to the navbar could be learn more or download scripts.

OR

Implement a back to main page button in the last page.

Ideas?

PDF support on windows

PDF support can be easily implemented on windows.

Tool needed is pdftotext which is open source under GPL.

Can you figure out how to install on windows etc and add to path? Then we can extend pdf support to windows.

Python - exceptions not in log file

As things are now, when python throws an error, it writes to stderr. Can this be changed to stdout (by handling those exceptions and echoing a common string along with description of the exception). This is because the log file doesn't contain any exceptions (at least for windows) as so the user will not know what caused the problem.

Is this feasible?

Referer Bug

After a lot of searches using the google search API, the api stops working.

This is mostly due to the referer reaching the max number of searches allowed.

I shall look into workarounds.

Importance: High

Statistics Log

Could the log_time on logs.php be made linked to the actual logfile?

Since the log file is a text file, it should be able to link to it, and have it display as a link (preferably in a new tab)

This would help in debugging and finding out whats wrong if the progress bar stops midway.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.