davechild / text-statistics Goto Github PK

Generate information about text including syllable counts and Flesch-Kincaid, Gunning-Fog, Coleman-Liau, SMOG and Automated Readability scores.

Home Page: https://readable.com/

License: BSD 2-Clause "Simplified" License

PHP 100.00%

text-statistics's People

Contributors

Stargazers

Watchers

Forkers

imorri1 zickzackv lornajane rasismeiro wmelton clintg pangor fcc gregelin consumer-brands stuartmoran drdub ppublic firfin dinis1 msquitieri remijohn xeoncross ancadumitrache sicouk aaron3 henbow lvil demorphica mxchelle gvelez17 agiza jrfnl poindexterc patrickmcsweeney mbeech kkaushi zukasmichael migurski donquiweb btoplak mattisbusycom alquemiemktg michaelsalvucci droidmunkey ptheofan crimsonearth t-web joshhighland fordi type-of-read cognitives takmad andrewtemnokhud digideskio tekjava shopoftheworld dixeam johnulist russellbits abromeit jamesvillarrubia thinkingmedia motazsaad brainshark curtisblumer maxtimothy kenw824 cruiser13 gsdu8g9 lucifurious frankrays bryant1410 nio-av maxleaver jaydeep-cunningpro garymarkfuller patricia555 asf-harlock sprax masum65 mryand carleton semstorm dominicvonk jestinas kreitje repat willvin313 heather-herbert 1m3nd0z4 ktp-forked-repos stockitories rahuls ericmillsio madeval beingathar wonkothesane42 jalexiscv amirrahmany lucus-lee mareksotak richtom80 smulvih2 kokomomtl

text-statistics's Issues

Grade Levels capped at 12

Most reading level scoring tools I have worked with (Hemingway, etc.) do not cap the grade level, but this tool seems to have a hard-coded cap of 12.

I work at a higher education institution where much text is drafted at a Flesch-Kincaid 14-18 level and where we are happy if we can work with the author to bring it down to 12. In its current form this tool does not distinguish between these levels.

Ideally there would be a way to specify a max grade level rather than having 12 be hardcoded, e.g.:

$textStatistics = new DaveChild\TextStatistics\TextStatistics;
$textStatistics->setMaxGradeLevel(18);
$grade_level = $textStatistics->fleschKincaidGradeLevel( $string );

Extending support for FK Reading Ease for other languages

Would it be possible to add support for other languages?

Best thing would be if we could add other languages ourselves.

Example: Yoast/YoastSEO.js#267

Numbers are not handled correctly

From Google Code:

Numbers within text numerically (1, 20, 100 etc) may not be handled
correctly.

Currently an unknown - should "20" be counted as two syllables ("twen-ty")
or as one syllable? Or should it be excluded from the calculations?

Tagged Versions

Hi there! Thanks for a really useful package, we really appreciate it. Would you be willing to tag a first version (even if its a beta 0.1.0) on the project? I'd like to use this in production but pulling it in thru composer using dev-master is a bit risky. If I were to run a composer update and pull in a breaking change without noticing it I would have some very unhappy customers. That'd be bad.

Thanks again for the package and your consideration :)

word_count() is not accurate when counting sentences with quotes

Issue transferred from Google Code:

Here's the test case:

public function testWordCountWithQuotes() {
$textStats = new TextStatistics();
$text = ""There should be seven words," said Joe";

$expected = 7;
$actual = $textStats->word_count($text); // value is 8

$this->assertEqual($actual, $expected);

}

Here's a possible fix:

In the clean_text(), replace:

$strText = preg_replace('/[,:;()-]/', ' ', $strText); // Replace commans,

hyphens etc (count them as spaces)

with:

$strText = preg_replace('/[",:;()-]/', ' ', $strText); // Replace double

quotes, commans, hyphens etc (count them as spaces)

Empty text returning word/sentence/syllable counts of 1

When empty text is passed, the following functions are returning 1 instead of 0: word_count, syllable_count and sentence_count.

This is causing some errors when statistics are being calculated for empty texts.

A fix for this problem would be applying the following changes to the code:

TextStatistics.php word_count method

         /**
         * Returns word count for text.
         * @param   strText      Text to be measured
         */
        public function word_count($strText) {
            if(strlen(trim($strText)) == 0){
                return 0;
            }

            $strText = $this->clean_text($strText);

            // Will be tripped by by em dashes with spaces either side, among other similar characters
            $intWords = 1 + $this->text_length(preg_replace('/[^ ]/', '', $strText)); // Space count + 1 is word count
            return $intWords;
        }

TextStatistics.php syllable_count method

        /**
         * Returns the number of syllables in the word.
         * Based in part on Greg Fast's Perl module Lingua::EN::Syllables
         * @param   strWord      Word to be measured
         */
        public function syllable_count($strWord) {
            if(strlen(trim($strWord)) == 0){
                return 0;
            }

            // Should be no non-alpha characters
            $strWord = preg_replace('/[^A_Za-z]/' , '', $strWord);

            $intSyllableCount = 0;
            $strWord = $this->lower_case($strWord);

            // Specific common exceptions that don't follow the rule set below are handled individually
            // Array of problem words (with word as key, syllable count as value)
            $arrProblemWords = Array(
                 'simile' => 3
                ,'forever' => 3
                ,'shoreline' => 2
            );
            if (isset($arrProblemWords[$strWord])) {
                return $arrProblemWords[$strWord];
            }

            // These syllables would be counted as two but should be one
            $arrSubSyllables = Array(
                 'cial'
                ,'tia'
                ,'cius'
                ,'cious'
                ,'giu'
                ,'ion'
                ,'iou'
                ,'sia$'
                ,'[^aeiuoyt]{2,}ed$'
                ,'.ely$'
                ,'[cg]h?e[rsd]?$'
                ,'rved?$'
                ,'[aeiouy][dt]es?$'
                ,'[aeiouy][^aeiouydt]e[rsd]?$'
                //,'^[dr]e[aeiou][^aeiou]+$' // Sorts out deal, deign etc
                ,'[aeiouy]rse$' // Purse, hearse
            );

            // These syllables would be counted as one but should be two
            $arrAddSyllables = Array(
                 'ia'
                ,'riet'
                ,'dien'
                ,'iu'
                ,'io'
                ,'ii'
                ,'[aeiouym]bl$'
                ,'[aeiou]{3}'
                ,'^mc'
                ,'ism$'
                ,'([^aeiouy])\1l$'
                ,'[^l]lien'
                ,'^coa[dglx].'
                ,'[^gq]ua[^auieo]'
                ,'dnt$'
                ,'uity$'
                ,'ie(r|st)$'
            );

            // Single syllable prefixes and suffixes
            $arrPrefixSuffix = Array(
                 '/^un/'
                ,'/^fore/'
                ,'/ly$/'
                ,'/less$/'
                ,'/ful$/'
                ,'/ers?$/'
                ,'/ings?$/'
            );

            // Remove prefixes and suffixes and count how many were taken
            $strWord = preg_replace($arrPrefixSuffix, '', $strWord, -1, $intPrefixSuffixCount);

            // Removed non-word characters from word
            $strWord = preg_replace('/[^a-z]/is', '', $strWord);
            $arrWordParts = preg_split('/[^aeiouy]+/', $strWord);
            $intWordPartCount = 0;
            foreach ($arrWordParts as $strWordPart) {
                if ($strWordPart <> '') {
                    $intWordPartCount++;
                }
            }

            // Some syllables do not follow normal rules - check for them
            // Thanks to Joe Kovar for correcting a bug in the following lines
            $intSyllableCount = $intWordPartCount + $intPrefixSuffixCount;
            foreach ($arrSubSyllables as $strSyllable) {
                $intSyllableCount -= preg_match('/' . $strSyllable . '/', $strWord);
            }
            foreach ($arrAddSyllables as $strSyllable) {
                $intSyllableCount += preg_match('/' . $strSyllable . '/', $strWord);
            }
            $intSyllableCount = ($intSyllableCount == 0) ? 1 : $intSyllableCount;
            return $intSyllableCount;
        }

TextStatistics.php sentece_count method

        /**
         * Returns sentence count for text.
         * @param   strText      Text to be measured
         */
        public function sentence_count($strText) {
            if(strlen(trim($strText)) == 0){
                return 0;
            }
            $strText = $this->clean_text($strText);
            // Will be tripped up by "Mr." or "U.K.". Not a major concern at this point.
            $intSentences = max(1, $this->text_length(preg_replace('/[^\.!?]/', '', $strText)));
            return $intSentences;
        }

Keeps throwing error?

PHP Notice: Undefined offset: #### in /vendor/davechild/textstatistics/src/DaveChild/TextStatistics/Syllables.php on line 380

Is this project still alive ?

There are a couple of things I found which can do with improving/fixing, but I'm wondering whether to spend the time on it as the project seems dormant.

Can someone please let me know the status of the project and/or the policy for contributing ?

Addition of Läsbarhetsindex

I've writen code for calculating the Läsbarhetsindex (Björnsson, 1968).

LIX = 100*RWL+ASL

*Where*
RLW (Ratio of long Words to all words) = Number of Long Words / Number of Words
ASL (Average sentence length) = number of Words / Number of Sentences

and: Long Words have more than six characters.

This Readability-Score formula has been developed for Swedish, but also works for German and English. More Informations:

Ott,N.(2009). Information retrieval for language learning: An exploration of text difficulty measures. In ISCL master’s thesis. Universität Tübingen, Seminar für Sprachwissenschaft, Tübingen, Germany. Page 19

May I commit this additions direct to master-branch?

Create release for the PHP 7.2 fix in #45

Can you please push out a new release number for the fix that was merged in with #45 ?

Can't create 2 instances

If you create a second instance of the statistics module, the first one has already loaded all the words with include_once so the second instance doesn't get the words.

I suggest making a TextStatistics::instance() method which returns an instance, and stores it in a static variable in the class, so if you run it again you get the same object not a new one.

Doesn't work for cyrillic symbols

Sample text:

Лондон — город и столица Соединённого Королевства Великобритании и Северной Ирландии. Административно образует регион Англии Большой Лондон, разделённый на 32 самоуправляемых района и Сити.
Население — 8,3 млн человек (2012 год), второй по величине город Европы и крупнейший в Евросоюзе. Образует агломерацию «Большой Лондон» и более обширный метрополитенский район. Расположен на юго-востоке острова Великобритания, на равнине Лондонского бассейна, в устье Темзы вблизи Северного моря.
Главный политический, экономический и культурный центр Великобритании. Экономика города занимает пятую часть экономики страны. Относится к глобальным городам высшего ранга, ведущим мировым финансовым центрам (наряду с Нью-Йорком).

This text return values with "-" on http://www.readability-score.com/

Question: Text Statistics for Other Languages (Korean)

Slightly tangential so apologies in advance.

Do any of you know any text statistics that work with other languages? I'm looking for modifications of these metrics that would work with Korean.

Thanks for any pointers you could give me.

Floating point calculations should use bcmath()

... as floating point calculation are notoriously unreliable if you don't use bcmath()

More info: http://floating-point-gui.de/ and http://www.php.net/manual/en/language.types.float.php

incorrectly considered SMOG

Hellow. Incorrectly considered SMOG. Сheck formula whith https://en.wikipedia.org/wiki/SMOG.

Syllable Counting Error

Issue transferred from Google Code:

Hmm, I was testing it out on random text and noticed that "the reading
kitten" gave an output of 4 syllables but "the kitten reading" gives 5
Why does it give two different results?

Access Token for Scrutinizier needs to be updated

Scrutinizer stats appear to be three or four years out of date, on account of needing a new Github API token.

typo in fetchSpracheWordList

https://github.com/DaveChild/Text-Statistics/blob/master/TextStatistics.php#L598

you have "resources/SpachelWordList.php" (an extra "l" before "W").

Dale-Chall Problem Words

Issue reported in Google Code:

http://code.google.com/p/php-text-statistics/issues/detail?id=3

Combined words are not handled

From Google Code:

Words which combine letters and numbers are not handled correctly. For
example, "3a" in text should be counted as two separate words, each of one
syllable.

Syllable count is incorrect on accented vowels

Words such as canapé and ajouré gives one less syllable than it should. The accented vowel should count as a distinct syllable.

Flesch-Kincaid score always return 0

I am trying to use the library but got some issue as the result is always 0 for Flesch-Kincaid:

$this->textStatistics = new DaveChild\TextStatistics\TextStatistics();
print $this->textStatistics->fleschKincaidReadingEase("Hello, my name is Mika"); // 0

If I enter the same text in the demo page, I get a different value.

Composer app keeps uninstalling itself

After a few usages, I keep getting this error:

Fatal error: Uncaught Error: Class "DaveChild\TextStatistics\TextStatistics" not found

I have to reinstall the app for it to work again.

It'll work a few times, and then I'm getting that error again.

The app folder is in the vendor folder.

Semver?

It would be great if this project's version adhered to some kind of semantic versioning.

I had projects that were requiring ^1.0.2, that suddenly broke because there was a breaking change in a patch release (1.0.2 to 1.0.3).

I see now that you have a note in the readme to specify 1.0.2 explicitly if you need < 7.2 support, but that wouldn't even be necessary to call out semver more strictly followed.

Undefined offset warning in "words_with_three_syllables"

I sometimes get the above warning at:

Line 499: if ($this->syllable_count($arrWords[$i]) > 2) {

flesch kincaid statistics are both in error

Both the flesch_kincaid_reading_ease() and flesch_kincaid_grade_level() methods are maxing out. The first at 100 and the latter at 19.

Every text block we try has the same issue. And the stats don't tally with those found on readability-score.com

Just FYI - maybe a recent commit has caused a bug to creep in?

Syllables of words ending in "sses" are not counted correctly

I have removed "ss" from $arrSubSyllables to resolve this issue.

P.S. Thank you very much for this library, you have saved me a lot of time.

SMOG calculation discrepancies

Hi,

Text: "June 23rd, 2015 How Cigna deal limits Anthem’s Blue Cross brand “When health plans operate using the Blue Cross and Blue Shield brand, they are generally limited to business in a specific state or region as part of a licensing agreement with their trade group, the Blue Cross and Blue Shield Association. So when Anthem (ANTM), a major operator of Blue Cross plans, made its $184-a-share offer for Cigna (CI) to grow both health insurance businesses, it created potential hurdles when it comes to Anthem’s valuable Blue Cross brands expanding."

On readability-score.com I'm getting value of 15.2 for SMOG, but with $textStatistics->smogIndex($input) only 9.4. This is big difference. Am I doing something wrong?

Increase performance

Processing the same text over-and-over is not a good idea. Just process it once and store it.

 class TextStatistics {

    protected $text;
    protected $strEncoding = ''; // Used to hold character encoding to be used by object, if set

    /**
     * Constructor.
     *
     * @param string  $strEncoding    Optional character encoding.
     * @return void
     */
    public function __construct($text = NULL) {
        if($text) {
            $this->setText($text);
        }
    }

    /**
     * Set the text to parse
     */
    public function setText($text) {
        $this->strEncoding = mb_detect_encoding($text);
        $this->text = $this->clean_text($text);
    }

    /**
     * Fetch the current object text
     */
    public function getText() {
        return $this->text;
    }

    /**
     * Fetch the current object encoding
     */
    public function getEncoding() {
        return $this->strEncoding;
    }

    /**
     * Gives the Flesch-Kincaid Reading Ease of text entered rounded to one digit
     * @param   strText         Text to be checked
     */
    public function flesch_kincaid_reading_ease() {
        $strText = $this->text;

....

After making this simple change the processing time for me on a small document dropped from 0.16 seconds to 0.11 seconds do to the reduced clean_text calls.

What license is it listed under?

I'd love to see a LICENSE.txt file in here so that it's clear what license it should be distributed under. Would be great if it is GPL!

Search within div functionality not working

I can't give a particular URL publicly because of confidentiality (can't publicly say I'm working on it, but have emailed) but # specification isn't working. Annoying since this works fine on read-able (but I can't do bulk on that, which is what I paid for the premium version of the site for!)

Incorrect Syllable Count for 'Meteor'

Noticed that 'Meteor' produces a count of 2 instead of 3.

Sentence count suggestion

In the inline code comments you already note:

// Will be tripped up by "Mr." or "U.K.". Not a major concern at this point.

I found it is also tripped up by ... or ?!

Just wanted to show you the below suggestion for consideration as it will at least provide better count in ... and ?! situations:

$intSentences = max( 1, preg_match_all( '`[^\.!?]+[\.!?]+([\s]+|$)`u', $strText, $matches ) );