Code Monkey home page Code Monkey logo

stop-words's Introduction

Stop Words

List of common stop words in various languages.

Available languages

  • Arabic
  • Bulgarian
  • Catalan
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Gujarati
  • Hindi
  • Hebrew
  • Hungarian
  • Indonesian
  • Malaysian
  • Italian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Slovak
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian
  • Vietnamese
  • Persian/Farsi

Contributing

You know how ;)

Programming languages support

License

Attribution 4.0 International (CC BY 4.0)

stop-words's People

Contributors

ali0saeedi avatar alir3z4 avatar babupriyavrat avatar cimox avatar cmccomb avatar cr0wg4n avatar dhruval10 avatar dksie09 avatar dmanole avatar elfrasco avatar grzegorzme avatar hklemp avatar hothanhluan avatar kissarat avatar macbre avatar mayankpi avatar norkans7 avatar sortiz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stop-words's Issues

Fix german stop words

german.txt contains wrong inflection forms of "uns".

lines 189 up to including 193 should be

unsere
unserem
unseren
unserer
unseres

and "unser" must be added again, as I changed it to "unserer" above.

Source: german as mother tongue

Fix romanian stop words

Here is the updated list of stop words for Romanian. In our language are some words that are using "a" instead of "i" in writing.

vreo acelea cata cita degraba lor alta tot ai dat despre peste bine dar foarte avea multi cit cat alt mai sa fie tu multe orice dintr dintre dintr-o dintr-un se intr intr-o intr-un niste multa insa il fost a abia nimic sub acel in altceva si avem altfel c ea acest li parca fi dintre unele m acestei mare cel este pe atitia atatia uneori acela iti astazi acestui o imi ele ceilalti pai fata noua sa-ti altul au i prin conform aceste anume azi k unul ala unei fara ei la aceeasi u inapoi acestea acesta aceasta catre sale asupra as aceea ba ale da le apoi aia suntem cum isi inainte s de cind cand cumva chiar acestia daca sunt care al numai cui sus tocmai prea cu mi eu doar niciodata exact putini aiurea tuturor celor astfel atunci citeva cateva cat sau fel intre acolo nostri ma mult una ceea iar sintem suntem ati din geaba sai caruia adica inca are aici ca ia nici d oricum asta carora face citiva cativa voi unor f atat toata alaturi cea nu totusi ce altii acum sint sunt capat mod deasupra cam vom b toate careia aceasta atit atat nimeni ii ci unde ul plus era sa-mi l spre dupa nou cele acea un incit incat n cei or va deci acelasi atatea h vor decit decat noi cineva desi ceva j ului atitea atatea avut ar pina pana t atata unui el citi asa totul pentru atita v alti asemenea atatia te ne deja unii p atare cite cate cine cand toti vreun ori r alte lui ti ni aceia am

License

I see that this project uses CC-BY-4.0. Wouldn't it better to use the Open Database License, because it is more suitable for datasets like this? Unlike CC-BY, the ODC-ODbL is a copyleft license, which ensures sharing on equal terms. See also https://opendatacommons.org/licenses/odbl/1.0/

English Stop Words have additional character.

Somehow the english library is outputting the letter u in front of each word.

stop_words = get_stop_words('english')
print(en_stop)
[u'a', u'about', u'above', u'after', u'again', u'against', u'all', u'am', u'an', u'and', u'any', u'are', u"aren't", u'as', u'at', u'be', u'because', u'been', u'before', u'being', u'below', u'between', u'both', u'but', u'by', u"can't", u'cannot', u'could', u"couldn't", u'did', u"didn't", u'do', u'does', u"doesn't", u'doing', u"don't", u'down', u'during', u'each', u'few', u'for', u'from', u'further', u'had', u"hadn't", u'has', u"hasn't", u'have', u"haven't", u'having', u'he', u"he'd", u"he'll", u"he's", u'her', u'here', u"here's", u'hers', u'herself', u'him', u'himself', u'his', u'how', u"how's", u'i', u"i'd", u"i'll", u"i'm", u"i've", u'if', u'in', u'into', u'is', u"isn't", u'it', u"it's", u'its', u'itself', u"let's", u'me', u'more', u'most', u"mustn't", u'my', u'myself', u'no', u'nor', u'not', u'of', u'off', u'on', u'once', u'only', u'or', u'other', u'ought', u'our', u'ours', u'ourselves', u'out', u'over', u'own', u'same', u"shan't", u'she', u"she'd", u"she'll", u"she's", u'should', u"shouldn't", u'so', u'some', u'such', u'than', u'that', u"that's", u'the', u'their', u'theirs', u'them', u'themselves', u'then', u'there', u"there's", u'these', u'they', u"they'd", u"they'll", u"they're", u"they've", u'this', u'those', u'through', u'to', u'too', u'under', u'until', u'up', u'very', u'was', u"wasn't", u'we', u"we'd", u"we'll", u"we're", u"we've", u'were', u"weren't", u'what', u"what's", u'when', u"when's", u'where', u"where's", u'which', u'while', u'who', u"who's", u'whom', u'why', u"why's", u'with', u"won't", u'would', u"wouldn't", u'you', u"you'd", u"you'll", u"you're", u"you've", u'your', u'yours', u'yourself', u'yourselves']

Additional stopword lists

Are you aware of the stopwords-json project which already has freely-licensed json-format lists of stopwords for 50 different languages? Importing these would be an easy way to expand the list of languages supported here without recreating the lists for each language from scratch.

I have also created some scripts to manage stoplist creation as part of the more-stoplists project (which has now been merged into stoplists-json). Some of these may be useful for your work. In addition there is a stopword-trainer by Espen Klem that can create stoplists based on a set of documents.

Other stopword projects / sources of freely-licensed stopwords that you might want to check out if you haven't already:

Turkish stop words wrong decoded

Turkish stop words has wrong decoded.

These characters are special
I -> ı
İ -> i
Ö -> ö
Ç -> ç
Ş -> ş
Ü -> ü
Ğ -> ğ

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.