Code Monkey home page Code Monkey logo

polari's Introduction

Polari ๐ŸŒˆ

Polari can perform two purposes:

  1. Detect the language of natural language text.
  2. Detect the sentiment of English language text with a basic pre-trained algorithm.

Python's simplicity with rust's speed and scale. ๐Ÿš€

What's in a name?

"Polari (from Italian parlare 'to talk') is a form of slang or cant historically used in Britain."Wikipedia

Polari was spoken by "mostly camp gay men. They were a class of people who lived on the margins of society. Many of them broke the law - a law which is now seen... as being unfair and cruel - and so they were at risk of arrest, shaming, blackmail, and attack. They were not seen as important or interesting. Their stories were not told." Fabulosa!: The Story of Polari, Britain's Secret Gay Language p. 10-11

The polari library:

  • Performs language & sentiment detection.
  • Is a plugin for a library named polars.
  • Was, coincidentally, first released during Pride Month (June 2024).

If you have fun with this library, please consider donating to a charity which supports LGBTQIA+ folks.

Perhaps:

Pull requests with further charity & organisation suggestions are welcome.1

Language Detection ๐Ÿ”Ž

Load the data quickly with hugging face & ducdkb

For quick setup with sample data, install the requirements in examples/example_requirements.txt

# Linux/MacOS
python -m venv .venv && source .venv/bin/activate && python -m pip install polari duckdb==0.10.3 polars==0.20.30 pyarrow==16.1.0

Load some sample data:

import polari
import duckdb
from time import time
from polars import Config, col

# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
rows = 5

# here are the languages that whichlang supports
languages = (
    # The MSA and Simplified Chinese less precise names in polari.
    "Modern Standard Arabic",
    "Simplified Chinese",
    "German",
    "English",
    "French",
    "Hindi",
    "Italian",
    "Japanese",
    "Korean",
    "Dutch",
    "Portuguese",
    "Russian",
    "Spanish",
    "Swedish",
    "Turkish",
    "Vietnamese",
)
# set up the LazyFrame
lf = (
    duckdb.sql(
        f"SELECT inputs, language FROM 'hf://datasets/CohereForAI/aya_dataset/data/train-00000-of-00001.parquet' WHERE language in {languages} LIMIT {str(rows)};"
    )
    .pl()
    .lazy()
)

Detect the language ๐ŸŒ ๐Ÿ”Ž

Config.set_tbl_hide_dataframe_shape(True)

df = lf.select(
    "inputs",
    polari.detect_lang("inputs", algorithm="which_lang").alias("detected_lang"),
    col("language").alias("true_lang"),
).collect()

print(df)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ inputs                          โ”† detected_lang โ”† true_lang  โ”‚
โ”‚ ---                             โ”† ---           โ”† ---        โ”‚
โ”‚ str                             โ”† str           โ”† str        โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Hรฃy tiแบฟp tแปฅc ฤ‘oแบกn vฤƒn sau:      โ”† Vietnamese    โ”† Vietnamese โ”‚
โ”‚ "Tโ€ฆ                             โ”†               โ”†            โ”‚
โ”‚ Bu paragrafฤฑn devamฤฑnฤฑ yazฤฑn: โ€ฆ โ”† Turkish       โ”† Turkish    โ”‚
โ”‚ ยฟCuรกl es la respuesta correctaโ€ฆ โ”† Spanish       โ”† Spanish    โ”‚
โ”‚ ไธญๆŠผ(ใกใ‚…ใ†ใŠ)ใ—ๅ‹ใกใจใ„ใˆใฐใ€  โ”† Japanese      โ”† Japanese   โ”‚
โ”‚ ใฉใ‚“ใชใ‚ฒใƒผใƒ ใฎๅ‹่ฒ ใฎๆฑบใพใ‚Šๆ–นโ€ฆ   โ”†               โ”†            โ”‚
โ”‚ Em que ano os filmes deixaram โ€ฆ โ”† Portuguese    โ”† Portuguese โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Algorithms

The above is only with the whichlang algorithm, the quickest and simplest algorithm.

Two of the algorithms can output a confidence score with detect_lang_confidence: what_lang, and lingua.

Supported algorithms:2

  • what_lang
  • lingua
  • whichlang

what_lang and lingua support language subsets and language exclusion. lingua supports high and low accuracy mode.

Detect the script ๐Ÿ“œ

It is also possible to detect the script of the dataset with what_lang and lingua.

df = lf.select(
    "inputs",
    "language",
    polari.detect_script("inputs").alias("detected_script"),
).collect()

print(df.head(3))
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ inputs                                                 โ”† language        โ”† detected_script โ”‚
โ”‚ ---                                                    โ”† ---             โ”† ---             โ”‚
โ”‚ str                                                    โ”† str             โ”† str             โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Heestan waxaa qada Khalid Haref Ahmed                  โ”† Somali          โ”† Latin           โ”‚
โ”‚ OO ku Jiray Kooxdii Dur Dur!                           โ”†                 โ”†                 โ”‚
โ”‚ Quels prรฉsident des ร‰tats-Unis ne sโ€™est jamais mariรฉ ? โ”† French          โ”† Latin           โ”‚
โ”‚ ูƒู… ุนุฏุฏ ุงู„ุฎู„ูุงุก ุงู„ุฑุงุดุฏูŠู† ุŸ ุฃุฌุจ ุนู„ู‰ ุงู„ุณุคุงู„ ุงู„ุณุงุจู‚.    โ”† Standard Arabic โ”† Arabic          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    

Sentiment Detection ๐Ÿ˜€๐Ÿ˜ 

polari can detect the sentiment of English language text via a rust port of VADER.

The pre-trained model was originally trained for sentiment detection on social media posts, but has semi-decent performance on opinionated text. The below performs analysis on amazon reviews.

Sample Data

import polari
import duckdb
from time import time
from polars import Config

# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
# This will load {rows} of 1*, 3*, and 5* reviews.
rows = 1
subset="Beauty_and_Personal_Care"
dataset = f"hf://datasets/McAuley-Lab/Amazon-Reviews-2023/raw/review_categories/{subset}.jsonl"
# set up the LazyFrame
lf = (
    duckdb.sql(
    f"""
    WITH positive as(
            SELECT text, rating FROM '{dataset}' WHERE rating = 5 LIMIT {rows}
        )
        , neutral as(
            SELECT text, rating FROM '{dataset}' WHERE rating = 3 LIMIT {rows}
        )
        , negative as(
            SELECT text, rating FROM '{dataset}' WHERE rating = 1 LIMIT {rows}
        )
    SELECT * FROM positive
    UNION ALL
    SELECT * FROM negative
    UNION ALL
    SELECT * FROM neutral;
    """
)
    .pl()
    .lazy()
)

Detect Sentiment ๐Ÿ˜€๐Ÿ˜ ๐Ÿ”Ž

df = lf.select(
    "text",
    polari.get_sentiment("text", output_type="compound").alias("sentiment"),
    polari.get_sentiment("text", output_type="pos").alias("pos"),
    polari.get_sentiment("text", output_type="neu").alias("neu"),
    polari.get_sentiment("text", output_type="neg").alias("neg"),
    "rating",
).collect()

df.head()
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ text                                                 โ”† sentiment โ”† pos      โ”† neu      โ”† neg      โ”† rating โ”‚
โ”‚ ---                                                  โ”† ---       โ”† ---      โ”† ---      โ”† ---      โ”† ---    โ”‚
โ”‚ str                                                  โ”† f64       โ”† f64      โ”† f64      โ”† f64      โ”† f64    โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Bought this for my granddaughter.  Her entire familyโ€ฆโ”† 0.63695   โ”† 0.21875  โ”† 0.78125  โ”† 0.0      โ”† 5.0    โ”‚
โ”‚ This is a good product but it doesn't last very longโ€ฆโ”† 0.238227  โ”† 0.130435 โ”† 0.869565 โ”† 0.0      โ”† 3.0    โ”‚
โ”‚ Tops the list for worst purchase. Tried these for alโ€ฆโ”† -0.939365 โ”† 0.094854 โ”† 0.735183 โ”† 0.169963 โ”† 1.0    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Output types include:

  • compound
  • neutral
  • positive
  • negative.

Credits

Language detection:

Sentiment:

Polars:

Footnotes

Footnotes

  1. In the extremely unlikely scenario that this project becomes popular, and therefore a library that needs to sustain itself, users could also be invited to donate to the project in a separate section of the README. โ†ฉ

  2. Benchmarking algorithm prediction precision/recall can be done with polari. Difference in detection speed by algorithm may be due to the implementation in polari, rather than the original rust crate. โ†ฉ

polari's People

Contributors

tomburdge avatar lyndonfan avatar

Stargazers

 avatar  avatar Joh Akaishi avatar baggiponte avatar Adylzhan Khashtamov avatar  avatar  avatar Larefly avatar Derek Snow avatar Yafee Ishraq avatar Alireza Ghaffari avatar Marco Edward Gorelli avatar Artem avatar  avatar Tom McKenna avatar Connor Duncan avatar

Watchers

 avatar

Forkers

lyndonfan

polari's Issues

Inconsistent country names

Country name conversion is inconsistent across APIs. This causes surprising behaviour with certain languages, particularly Arabic and Mandarin; these are very widely used language which are referred to by various names.

The best course of action is probably to:

  • Add a rust language formatting iso crate dependency.
  • Refactor to receive an output format argument. This will be a string argument that is: name | ISO.
    • Name example: English
    • ISO example: ENG.
  • ISO is default, and recommended.

Of the algorithms:

  • This would likely slow down whichlang, which outputs a name and does not have an ISO option.
  • This would likely speed up lingua, which didn't appear to have a public method to output a non ISO string.

Future possible feature of native-language output format name with what_lang. Example value: Franรงais.

Possible mistake in Example

In the detect a language example, the third row. I don't think the true language is English. Could you take another look and confirm?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.