Code Monkey home page Code Monkey logo

reuse_analyzer's Introduction

Reuse analyzer

This is a fairly simple Awk script to analyze one or more text files and report on matches or near-matches. It uses the Levenshtein distance algorithm to perform fuzzy matches on lines of text.

It generates a report, showing matches and the score for each. The minimum score can be adjusted on the command line. You can use the report to find strings to reuse, or catch inconsistencies.

Preparing the text

Use plain text, stripped of all markup. Each line should be a single sentence or block (paragraph). For DITA, use a script like this to prepare your text (requires the org.lwdita plugin):

dita --format=markdown_github --input=my.bookmap --args.rellinks=none
cd out
rm index.md
for i in *.txt; do
	f=`basename $i .md`
	pandoc -t plain —wrap=none -o $f.txt $i
done

This puts each block element on a single line.

For word processor documents, save as text.

Running the script

Use this command to run the script:

awk -f analyzer.awk [-v option=value]... files...

You can specify multiple options. Available options are:

minratio (default: 0.95) : The minimum fuzzy matching ratio to report on. A value of 1 turns off fuzzy matching, reporting only exact matches.

minlength (default: 0) : The minimum string length to compare.

ignorecase (default: 0) : Set to 1 to make matching case-insensitive.

quiet (default: 0) : Set to 1 to suppress status and progress messages.

progress (default: 100) : Prints a status/progress message after comparing this many blocks.

Performance

Fuzzy matching is a processor-intensive activity, especially when using a script. Limiting the number of matches needed is the most effective strategy to improve performance. The script uses these techniques to avoid doing fuzzy matches:

  • Blocks (lines of text) are not compared to themselves.
  • After comparing a block to everything else, the analyzer throws out the block to avoid duplicate comparisons. This by itself cuts runtime in half.
  • If two blocks have a length different enough that the result could never reach minratio, they are not compared. This makes a big difference in the time needed.
  • The script checks for an exact match before attempting the fuzzy match.

A small to mid-size document set (about 2200 paragraphs) takes just under 5 minutes to analyze on a late-2013 iMac. You can compile the script, using awka, for a performance boost (the same document set takes 1-1/2 minutes with the compiled version). To compile the script, use:

awka -X -f analyzer.awk
mv awka.out analyzer

Limitations vs. commercial offerings

Commercial reuse analyzers tend to have nicer interfaces, and some meant for DITA can build a collection (or library) file of strings and automatically apply them to your topics.

On the other hand, if your budget is very limited and you have a small document set to analyze, this one might be exactly what you need.

reuse_analyzer's People

Contributors

larrykollar avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

reuse_analyzer's Issues

The report has redundancies for groups of 3+

If there is a group of three or more matches, like this:

Matches for file1.txt block 3:
foo bar baz
    file2.txt block 7 (ratio 1):
    foo bar baz
    flle3.txt block 12 (ratio 1):
    foo bar baz

Then there will be a redundant group all but the first match:

Matches for file2.txt block 7:
foo bar baz
    flle3.txt block 12 (ratio 1):
    foo bar baz

A second script, that takes the piped input from the report, would be a good way to weed out redundant groups.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.