Code Monkey home page Code Monkey logo

text-statistics's People

Contributors

aaron3 avatar bryant1410 avatar davechild avatar dominicvonk avatar drdub avatar garymarkfuller avatar jrfnl avatar migurski avatar mryand avatar repat avatar richtom80 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-statistics's Issues

Grade Levels capped at 12

Most reading level scoring tools I have worked with (Hemingway, etc.) do not cap the grade level, but this tool seems to have a hard-coded cap of 12.

I work at a higher education institution where much text is drafted at a Flesch-Kincaid 14-18 level and where we are happy if we can work with the author to bring it down to 12. In its current form this tool does not distinguish between these levels.

Ideally there would be a way to specify a max grade level rather than having 12 be hardcoded, e.g.:

$textStatistics = new DaveChild\TextStatistics\TextStatistics;
$textStatistics->setMaxGradeLevel(18);
$grade_level = $textStatistics->fleschKincaidGradeLevel( $string );

Numbers are not handled correctly

From Google Code:

Numbers within text numerically (1, 20, 100 etc) may not be handled
correctly.

Currently an unknown - should "20" be counted as two syllables ("twen-ty")
or as one syllable? Or should it be excluded from the calculations?

Tagged Versions

Hi there! Thanks for a really useful package, we really appreciate it. Would you be willing to tag a first version (even if its a beta 0.1.0) on the project? I'd like to use this in production but pulling it in thru composer using dev-master is a bit risky. If I were to run a composer update and pull in a breaking change without noticing it I would have some very unhappy customers. That'd be bad.

Thanks again for the package and your consideration :)

word_count() is not accurate when counting sentences with quotes

Issue transferred from Google Code:

Here's the test case:

public function testWordCountWithQuotes() {
$textStats = new TextStatistics();
$text = ""There should be seven words," said Joe";

$expected = 7;
$actual = $textStats->word_count($text); // value is 8

$this->assertEqual($actual, $expected);

}

Here's a possible fix:

In the clean_text(), replace:

$strText = preg_replace('/[,:;()-]/', ' ', $strText); // Replace commans,

hyphens etc (count them as spaces)

with:

$strText = preg_replace('/[",:;()-]/', ' ', $strText); // Replace double

quotes, commans, hyphens etc (count them as spaces)

Empty text returning word/sentence/syllable counts of 1

When empty text is passed, the following functions are returning 1 instead of 0: word_count, syllable_count and sentence_count.

This is causing some errors when statistics are being calculated for empty texts.

A fix for this problem would be applying the following changes to the code:

TextStatistics.php word_count method

         /**
         * Returns word count for text.
         * @param   strText      Text to be measured
         */
        public function word_count($strText) {
            if(strlen(trim($strText)) == 0){
                return 0;
            }

            $strText = $this->clean_text($strText);

            // Will be tripped by by em dashes with spaces either side, among other similar characters
            $intWords = 1 + $this->text_length(preg_replace('/[^ ]/', '', $strText)); // Space count + 1 is word count
            return $intWords;
        }

TextStatistics.php syllable_count method

        /**
         * Returns the number of syllables in the word.
         * Based in part on Greg Fast's Perl module Lingua::EN::Syllables
         * @param   strWord      Word to be measured
         */
        public function syllable_count($strWord) {
            if(strlen(trim($strWord)) == 0){
                return 0;
            }

            // Should be no non-alpha characters
            $strWord = preg_replace('/[^A_Za-z]/' , '', $strWord);

            $intSyllableCount = 0;
            $strWord = $this->lower_case($strWord);

            // Specific common exceptions that don't follow the rule set below are handled individually
            // Array of problem words (with word as key, syllable count as value)
            $arrProblemWords = Array(
                 'simile' => 3
                ,'forever' => 3
                ,'shoreline' => 2
            );
            if (isset($arrProblemWords[$strWord])) {
                return $arrProblemWords[$strWord];
            }

            // These syllables would be counted as two but should be one
            $arrSubSyllables = Array(
                 'cial'
                ,'tia'
                ,'cius'
                ,'cious'
                ,'giu'
                ,'ion'
                ,'iou'
                ,'sia$'
                ,'[^aeiuoyt]{2,}ed$'
                ,'.ely$'
                ,'[cg]h?e[rsd]?$'
                ,'rved?$'
                ,'[aeiouy][dt]es?$'
                ,'[aeiouy][^aeiouydt]e[rsd]?$'
                //,'^[dr]e[aeiou][^aeiou]+$' // Sorts out deal, deign etc
                ,'[aeiouy]rse$' // Purse, hearse
            );

            // These syllables would be counted as one but should be two
            $arrAddSyllables = Array(
                 'ia'
                ,'riet'
                ,'dien'
                ,'iu'
                ,'io'
                ,'ii'
                ,'[aeiouym]bl$'
                ,'[aeiou]{3}'
                ,'^mc'
                ,'ism$'
                ,'([^aeiouy])\1l$'
                ,'[^l]lien'
                ,'^coa[dglx].'
                ,'[^gq]ua[^auieo]'
                ,'dnt$'
                ,'uity$'
                ,'ie(r|st)$'
            );

            // Single syllable prefixes and suffixes
            $arrPrefixSuffix = Array(
                 '/^un/'
                ,'/^fore/'
                ,'/ly$/'
                ,'/less$/'
                ,'/ful$/'
                ,'/ers?$/'
                ,'/ings?$/'
            );

            // Remove prefixes and suffixes and count how many were taken
            $strWord = preg_replace($arrPrefixSuffix, '', $strWord, -1, $intPrefixSuffixCount);

            // Removed non-word characters from word
            $strWord = preg_replace('/[^a-z]/is', '', $strWord);
            $arrWordParts = preg_split('/[^aeiouy]+/', $strWord);
            $intWordPartCount = 0;
            foreach ($arrWordParts as $strWordPart) {
                if ($strWordPart <> '') {
                    $intWordPartCount++;
                }
            }

            // Some syllables do not follow normal rules - check for them
            // Thanks to Joe Kovar for correcting a bug in the following lines
            $intSyllableCount = $intWordPartCount + $intPrefixSuffixCount;
            foreach ($arrSubSyllables as $strSyllable) {
                $intSyllableCount -= preg_match('/' . $strSyllable . '/', $strWord);
            }
            foreach ($arrAddSyllables as $strSyllable) {
                $intSyllableCount += preg_match('/' . $strSyllable . '/', $strWord);
            }
            $intSyllableCount = ($intSyllableCount == 0) ? 1 : $intSyllableCount;
            return $intSyllableCount;
        }

TextStatistics.php sentece_count method

        /**
         * Returns sentence count for text.
         * @param   strText      Text to be measured
         */
        public function sentence_count($strText) {
            if(strlen(trim($strText)) == 0){
                return 0;
            }
            $strText = $this->clean_text($strText);
            // Will be tripped up by "Mr." or "U.K.". Not a major concern at this point.
            $intSentences = max(1, $this->text_length(preg_replace('/[^\.!?]/', '', $strText)));
            return $intSentences;
        }

Keeps throwing error?

PHP Notice: Undefined offset: #### in /vendor/davechild/textstatistics/src/DaveChild/TextStatistics/Syllables.php on line 380

Is this project still alive ?

There are a couple of things I found which can do with improving/fixing, but I'm wondering whether to spend the time on it as the project seems dormant.

Can someone please let me know the status of the project and/or the policy for contributing ?

Addition of Läsbarhetsindex

I've writen code for calculating the Läsbarhetsindex (Björnsson, 1968).

LIX = 100*RWL+ASL

*Where*
RLW (Ratio of long Words to all words) = Number of Long Words / Number of Words
ASL (Average sentence length) = number of Words / Number of Sentences

and: Long Words have more than six characters.

This Readability-Score formula has been developed for Swedish, but also works for German and English. More Informations:

Ott,N.(2009). Information retrieval for language learning: An exploration of text difficulty measures. In ISCL master’s thesis. Universität Tübingen, Seminar für Sprachwissenschaft, Tübingen, Germany. Page 19

May I commit this additions direct to master-branch?

Can't create 2 instances

If you create a second instance of the statistics module, the first one has already loaded all the words with include_once so the second instance doesn't get the words.

I suggest making a TextStatistics::instance() method which returns an instance, and stores it in a static variable in the class, so if you run it again you get the same object not a new one.

Doesn't work for cyrillic symbols

Sample text:

Лондон — город и столица Соединённого Королевства Великобритании и Северной Ирландии. Административно образует регион Англии Большой Лондон, разделённый на 32 самоуправляемых района и Сити.
Население — 8,3 млн человек (2012 год), второй по величине город Европы и крупнейший в Евросоюзе. Образует агломерацию «Большой Лондон» и более обширный метрополитенский район. Расположен на юго-востоке острова Великобритания, на равнине Лондонского бассейна, в устье Темзы вблизи Северного моря.
Главный политический, экономический и культурный центр Великобритании. Экономика города занимает пятую часть экономики страны. Относится к глобальным городам высшего ранга, ведущим мировым финансовым центрам (наряду с Нью-Йорком).

This text return values with "-" on http://www.readability-score.com/

Question: Text Statistics for Other Languages (Korean)

Slightly tangential so apologies in advance.

Do any of you know any text statistics that work with other languages? I'm looking for modifications of these metrics that would work with Korean.

Thanks for any pointers you could give me.

Syllable Counting Error

Issue transferred from Google Code:

Hmm, I was testing it out on random text and noticed that "the reading
kitten" gave an output of 4 syllables but "the kitten reading" gives 5
Why does it give two different results?

Combined words are not handled

From Google Code:

Words which combine letters and numbers are not handled correctly. For
example, "3a" in text should be counted as two separate words, each of one
syllable.

Flesch-Kincaid score always return 0

I am trying to use the library but got some issue as the result is always 0 for Flesch-Kincaid:

$this->textStatistics = new DaveChild\TextStatistics\TextStatistics();
print $this->textStatistics->fleschKincaidReadingEase("Hello, my name is Mika"); // 0

If I enter the same text in the demo page, I get a different value.

Composer app keeps uninstalling itself

After a few usages, I keep getting this error:

Fatal error: Uncaught Error: Class "DaveChild\TextStatistics\TextStatistics" not found

I have to reinstall the app for it to work again.

It'll work a few times, and then I'm getting that error again.

The app folder is in the vendor folder.

Semver?

It would be great if this project's version adhered to some kind of semantic versioning.

I had projects that were requiring ^1.0.2, that suddenly broke because there was a breaking change in a patch release (1.0.2 to 1.0.3).

I see now that you have a note in the readme to specify 1.0.2 explicitly if you need < 7.2 support, but that wouldn't even be necessary to call out semver more strictly followed.

flesch kincaid statistics are both in error

Both the flesch_kincaid_reading_ease() and flesch_kincaid_grade_level() methods are maxing out. The first at 100 and the latter at 19.

Every text block we try has the same issue. And the stats don't tally with those found on readability-score.com

Just FYI - maybe a recent commit has caused a bug to creep in?

SMOG calculation discrepancies

Hi,

Text: "June 23rd, 2015 How Cigna deal limits Anthem’s Blue Cross brand “When health plans operate using the Blue Cross and Blue Shield brand, they are generally limited to business in a specific state or region as part of a licensing agreement with their trade group, the Blue Cross and Blue Shield Association. So when Anthem (ANTM), a major operator of Blue Cross plans, made its $184-a-share offer for Cigna (CI) to grow both health insurance businesses, it created potential hurdles when it comes to Anthem’s valuable Blue Cross brands expanding."

On readability-score.com I'm getting value of 15.2 for SMOG, but with $textStatistics->smogIndex($input) only 9.4. This is big difference. Am I doing something wrong?

Increase performance

Processing the same text over-and-over is not a good idea. Just process it once and store it.

 class TextStatistics {

    protected $text;
    protected $strEncoding = ''; // Used to hold character encoding to be used by object, if set

    /**
     * Constructor.
     *
     * @param string  $strEncoding    Optional character encoding.
     * @return void
     */
    public function __construct($text = NULL) {
        if($text) {
            $this->setText($text);
        }
    }

    /**
     * Set the text to parse
     */
    public function setText($text) {
        $this->strEncoding = mb_detect_encoding($text);
        $this->text = $this->clean_text($text);
    }

    /**
     * Fetch the current object text
     */
    public function getText() {
        return $this->text;
    }

    /**
     * Fetch the current object encoding
     */
    public function getEncoding() {
        return $this->strEncoding;
    }

    /**
     * Gives the Flesch-Kincaid Reading Ease of text entered rounded to one digit
     * @param   strText         Text to be checked
     */
    public function flesch_kincaid_reading_ease() {
        $strText = $this->text;

....

After making this simple change the processing time for me on a small document dropped from 0.16 seconds to 0.11 seconds do to the reduced clean_text calls.

What license is it listed under?

I'd love to see a LICENSE.txt file in here so that it's clear what license it should be distributed under. Would be great if it is GPL!

Search within div functionality not working

I can't give a particular URL publicly because of confidentiality (can't publicly say I'm working on it, but have emailed) but # specification isn't working. Annoying since this works fine on read-able (but I can't do bulk on that, which is what I paid for the premium version of the site for!)

Sentence count suggestion

In the inline code comments you already note:

// Will be tripped up by "Mr." or "U.K.". Not a major concern at this point.

I found it is also tripped up by ... or ?!

Just wanted to show you the below suggestion for consideration as it will at least provide better count in ... and ?! situations:

$intSentences = max( 1, preg_match_all( '`[^\.!?]+[\.!?]+([\s]+|$)`u', $strText, $matches ) );

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.