charcoal-se / smokedetector Goto Github PK

View Code? Open in Web Editor NEW

464.0 22.0 175.0 1.23 GB

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.

Home Page: https://metasmoke.erwaysoftware.com

License: Apache License 2.0

Python 99.66% Shell 0.27% Dockerfile 0.07%

python spam regex hacktoberfest

smokedetector's Introduction

SmokeDetector

Headless chatbot that detects spam and posts it to chatrooms. Uses ChatExchange, takes questions from the Stack Exchange realtime tab, and accesses answers via the Stack Exchange API.

Example chat post:

Documentation

User documentation is in the wiki.

Detailed documentation for setting up and running SmokeDetector is in the wiki.

Basic setup

To set up SmokeDetector, please use

git clone https://github.com/Charcoal-SE/SmokeDetector.git
cd SmokeDetector
git checkout deploy
sudo pip3 install -r requirements.txt --upgrade
pip3 install --user -r user_requirements.txt --upgrade

Next, copy config.sample to a new file called config, and edit the values required.

To run, use python3 nocrash.py (preferably in a daemon-able mode, like a screen session.) You can also use python3 ws.py, but then SmokeDetector will be shut down after 6 hours; when running from nocrash.py, it will be restarted. (This is to be sure that closed websockets, if any, are reopened.)

Virtual environment setup

Running in a virtual environment is a good way to isolate dependency packages from your local system. To set up SmokeDetector in a virtual environment, you can use

git clone https://github.com/Charcoal-SE/SmokeDetector.git
cd SmokeDetector
git config user.email "[email protected]"
git config user.name "SmokeDetector"
git checkout deploy

python3 -m venv env
env/bin/pip3 install -r requirements.txt --upgrade
env/bin/pip3 install --user -r user_requirements.txt --upgrade

Next, copy the config file and edit as said above. To run SmokeDetector in this virtual environment, use env/bin/python3 nocrash.py.

[Note: On some systems (e.g. Mac's and Linux), some circumstances may require the --user option to be removed from the last pip3 command line in the above instructions. However, the --user option is known to be necessary in other circumstances. Further testing is necessary to resolve the discrepancy.]

Docker setup

Running in a Docker container is an even better way to isolate dependency packages from your local system. To set up SmokeDetector in a Docker container, follow the steps below.

Grab the Dockerfile and build an image of SmokeDetector:

DATE=$(date +%F)
mkdir temp
cd temp
wget https://raw.githubusercontent.com/Charcoal-SE/SmokeDetector/master/Dockerfile
docker build -t smokey:$DATE .

Create a container from the image you just built

docker create --name=mysmokedetector smokey:$DATE

Start the container. Don't worry, SmokeDetector won't run until it's ready, so you have the chance to edit the configuration file before SmokeDetector runs.

Copy config.sample to a new file named config and edit the values required, then copy the file into the container with this command:
```
docker cp config mysmokedetector:/home/smokey/SmokeDetector/config
```
If you would like to set up additional stuff (SSH, Git etc.), you can do so with a Bash shell in the container:
```
docker exec -it mysmokedetector bash
```
After you're ready, put a file named ready under /home/smokey:
```
touch ~smokey/ready
```

Automate Docker deployment with Docker Compose

I'll assume you have the basic ideas of Docker and Docker Compose.

The first thing you need is a properly filled config file. You can start with the sample.

Create a directory (name it whatever you like), place the config file and docker-compose.yml file. Run docker-compose up -d and your SmokeDetector instance is up.

If you want additional control like memory and CPU constraint, you can edit docker-compose.yml and add the following keys to smokey. The example values are recommended values.

restart: always  # when your host reboots Smokey can autostart
mem_limit: 512M
cpus: 0.5  # Recommend 2.0 or more for spam waves

Requirements

SmokeDetector only supports Stack Exchange logins, and runs on Python 3.7 or higher, for now.

To allow committing blacklist and watchlist modifications back to GitHub, your system also needs Git 1.8 or higher, although we recommend Git 2.11+.

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)

at your option.

Contribution Licensing

By submitting your contribution for inclusion in the work as defined in the Apache-2.0 license, you agree that it be dual licensed as above, without any additional terms or conditions.

smokedetector's People

Contributors

Stargazers

Watchers

Forkers

honnza theguywiththehat thomas-daniels braiam ndrewh astrocb unihedro jc3 me-azharahmed tvam apnorton rschrieken durron597 michaelb958 sentientmachine visualwaves kevin-brown artofcode- brocka k-guan ferrybig seth-johnson jalsnipe arasharchor magisch undo1 nickvolynkin aralun rivy mitizhi double-fault tripleee bwdraco ioffl djmcmayhem chrismarshallsf glorfindel83 davidpostill j-f1 raulsebastian mego cerbrus floern superplane39 sohaeb angussidney elvishfiend dheerajbhaskar nathanoliver1 surajrao quartata yemenr paulroub flyingitalianman nobodynada rjrudman awegnergithub jake-symons wangwuhen ibug videonauth micsthepick danbopes gparyani mehrdad-shokri bytecommander makyen longphui jdrew1303 fzql amnaaltawil krishnagarg37 gunr2171 martijnbraam gayathrivenkatesh malshagamage kishan-1234 tejasmit grahams58 ayushi-universe clairverbot akhillllldev szam25 42b 2alin naveenp13 amanbhandula wildbeez thesecretmaster chuck0518 bertiebaggio nareshnesh lunarwatcher alexanderthenotgreat sri-shree dipu11 azharmithani dineshkrishnareddy sgh304 jesstar

smokedetector's Issues

Trips on posts in a foreign language

Whenever a post shows up that is posted in a different language, a message like this is posted:

[ SmokeDetector ] All-caps title: יוֹשְׁבֵי דְּבִיר

Could we make it not do that?

Add blacklist commands for keywords

Prompted by this: http://chat.stackexchange.com/transcript/message/33655049#33655049

I (and bwDraco) think that it would be great if we could implement a !!/blacklist-keyword command or similar.

Since we already have the !!/blacklist command implemented for websites, and the bad keyword regexes have already been separated into a text file, it should be fairly easy. All the infrastructure is there; we just need to copy/paste some code and change a few things.

While we're doing this, it might also be a good idea to rename the existing !!/blacklist command to !!/blacklist-website. I've noticed that a few people recently have misunderstood the use of the blacklist command, and changing its name should make that clear.

What are your thoughts?

If we reach an agreement that this would be a good idea, I could do this on the weekend (as long as no-one beats me to it).

Website (metasmoke.erwaysoftware.com) gives a 503 service unavailable Error.

The website (https://metasmoke.erwaysoftware.com/), linked from the description of this github repo is broken, It gives me a "503 Service Unavailable" error page. Would be nice if it worked :-)

Reasons of locking every single issue..

What is the reason your locking every issue? I'm trying to help you even if you don't want that help ok.

Creating a write-only team for Charcoal projects

I'm leaving this here because it seems like more people see our GitHub issues than they do chat messages.

This was an idea I had a couple days ago. What if we created another team on the Charcoal-SE organization, with write access to all of Charcoal's projects (as opposed to the admin access that the Core team have)? That allows us to give people who commonly contribute to the projects write access, instead of forcing them to rely on pull requests - which, let's be honest, aren't much fun when you're just waiting around.

I'm thinking of people like Kyll, Ashish, and angussidney, who have sent us the majority of our recent pull requests. There are probably more people I've missed out there, but that's the general idea.

Obviously, they'd need to be people (a) who we can trust, and (b) with privileges for Smokey. I'm not proposing adding just anyone who submits a PR; rather, people who have a track record of good contributions. Those people are a big resource for Charcoal; this would reduce the friction to them helping out.

Thoughts?

Detect links to fiverr.com

Fiverr.com is a site for selling freelance work. Answers that link to that site are very likely spam.

An example post at http://stackoverflow.com/posts/38404091/revisions

Alias 'tpa' with 'tpu'

... I'm asking for this because I seem to always reply with "tpa" by habit (from using Phamhilator) instead of tpu.

Blacklist functionality is broken, due to Git checks

Git currently checks against "Master", not "Deploy" which isn't updated as frequently. This prevents HEAD checks from being done. Ideally, we'd be updating "Master" right before the check, or we'd switch all the checks to "Deploy".

We'll have to determine which approach we want to have happen, depending on which approach we want to have go forward.

FP feedback from the API does not delete chat message

If possible, smokey should attempt to delete the chat message of a report if false positive feedback is given via the API.

!!false command to indicate false positives

As an admin I want to be able to respond to a spam post with the !!false command to instruct smokey that the reported spam is a false positive so that smokey doesn't keep posting that same post in the chat room.

Possible steps to implement:
start with an in memory storage for the current running instance
Later add persistent storage

Commands not all returning Response Objects

At this point, this is an issue to remind me what I've investigated.

Problem: Smokey is returning something other than a response object in a few areas.

A bool here:

AttributeError: 'bool' object has no attribute 'message'
2016-07-18 10:37:33.673173 UTC
  File "/media/sda2/Smokey2/excepthook.py", line 46, in run_with_except_hook
    run_old(*args, **kw)

  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/media/sda2/Smokey2/ChatExchange/chatexchange/browser.py", line 696, in _runner
    self.on_activity(json.loads(a))

  File "/media/sda2/Smokey2/ChatExchange/chatexchange/rooms.py", line 81, in on_activity
    event_callback(event, self._client)

  File "/media/sda2/Smokey2/chatcommunicate.py", line 164, in watcher
    if result.message:

A NoneType here:




  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/media/sda2/Smokey2/ChatExchange/chatexchange/browser.py", line 696, in _runner
    self.on_activity(json.loads(a))

  File "/media/sda2/Smokey2/ChatExchange/chatexchange/rooms.py", line 81, in on_activity
    event_callback(event, self._client)

  File "/media/sda2/Smokey2/chatcommunicate.py", line 133, in watcher
    if result.command_status and result.message:


AttributeError: 'NoneType' object has no attribute 'command_status'
2016-07-18 10:15:45.964939 UTC
  File "/media/sda2/Smokey2/excepthook.py", line 46, in run_with_except_hook
    run_old(*args, **kw)

  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/media/sda2/Smokey2/ChatExchange/chatexchange/browser.py", line 696, in _runner
    self.on_activity(json.loads(a))

  File "/media/sda2/Smokey2/ChatExchange/chatexchange/rooms.py", line 81, in on_activity
    event_callback(event, self._client)

  File "/media/sda2/Smokey2/chatcommunicate.py", line 133, in watcher
    if result.command_status and result.message:

Initial oddness:

Issuing the command !!/addblu google.com returns the following:

Invalid format. Valid format: !!/addblu profileurl or !!/addblu userid sitename.

Expected to get the above plus <unrecognized command> because command_status is False and it should fall through to that check here

bool may be from a failed permissions check. check_permissions returns False on failure. Change this to a response object
Multiple responses to same command

Issuing sd abc dbf sdc responds with

1. [:31107584] <processed without return value>
<unrecognized command>
2. [:31107579] <processed without return value>
<unrecognized command>
3. [:31107416] <processed without return value>
<unrecognized command>

This needs to be adjusted to only response with the "unrecognized command". This is due to these checks being only if not if...else checks.

Migrate config to globalvariables

There are currently only a few values in config. I propose we remove "config" completely and migrate these values to globalvaribles.

This removes two locations (config and globalvariables) that developers have to modify when starting SmokeDetector and allows us to remove a ConfigParser import that is only used at start up.

Repeated letter in question may need a little tweaking:

It seems like this question ought to have been caught:

http://stackoverflow.com/questions/31643388/no-gateway-trying-to-portfoward

i do ipconfig /all, it gives me a blank gateway.
The gateway it gives me is not in the standard form
the first LETTER of the gateway is F

HEEEEELLLP

Limit "notify" to users that have been active in the room in the past <X> minutes.

At this moment, in the SO Close Vote Reviewers chatroom, SD is notifying 6 different users for every single report.

Example:

[ SmokeDetector | MS ] Few unique characters in body: SolvedSOLVEDSOLVED by Furkan Ayık on stackoverflow.com
(@PraveenKumar @AndrasDeak @πάνταῥεῖ @FrankerZ @tripleee @dorukayhan)

The amount of users getting notified has steadily been growing. Imo, it's getting a little annoying. Only a portion of the users actually respond to these notifications, and they get notified even when they haven't been active for hours.

I'd like to request these notifications to be filtered on user activity.
IE: Don't bother notifying a user that hasn't been active in the last hour.

Just to be clear: I have no issue with these users. Just that the list of notifications is getting close to the length of the actual report.

Relicense under dual MIT/Apache 2.0

We currently don't have a specified license for SmokeDetector - and thus, by default, it is under full copyright. This is very restrictive, and we'd like to change it to something more permissive (in this case, dual licensed under MIT/Apache 2.0). We'll need consent from all contributors to this repository to do so:

To agree to relicensing, just leave this comment below or otherwise indicate consent:

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to choose either at their option.

Some more info:

This involves adding the following to the README and including the full text of both licenses in the repository:

## License

Licensed under either of

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

MIT is fairly permissive, so it's preferred by most, however it requires you to include the license in everything using the code. On the other hand, Apache doesn't have this issue, but is incompatible with GPLv2. A dual license gives users the freedom to choose a license of their choice.

Submitting nonexistent sites seems to crash the bot

I am afraid I succeeded in bringing the bot offline repeatedly just now by attempting to !!/report posts which had already been deleted.

The first attempt was an experiment where perhaps I should have been more careful; the second, reports which originally didn't succeed, and when I retried, some of them had been deleted in the meantime.

This could easily happen by mistake at any time, so the bot should be able to cope with this.

Chat link for context: http://codereview.stackexchange.com/a/146113/63322

Smarter tp/fp comments from metasmoke on blacklist request

See here for an example.

A user requested that a username should be blacklisted - which prompted the usual metasmoke comment. However - the result only checks for true/false positives in post bodies, rather than the username.

Fix comment
Fix link in PR

Patterns for catching known spam domains

I know that some of these are already in the system, but it was easier to include them here as some can be improved. These were compiled based on the master spam list by @FieryDragonLord

Here are a few patterns that catch most of the "Repair Toolbox" domains

filefix(er)?\.com
fix(.*)(file(s)?|tool(box)?)\.com
(recovery|repair)kit\.com
(repair|recovery|fix)tool(box)?\.com
\.repair

And some patterns for most of the "Wise Recovery" domains

fix1\.org
easyfix\.org
errorsfixer\.org
regeasypro\.com
recoverypro\.(com|net)
smart(pc)?fixer\.(com|net|org)
wiserecovery\.com
drivertuner\.(com|net)
official-drivers\.com
wisefixer\.(com|net|org)

These patterns can match most of the "Tenorshare" domains

-recovery\.(com|net)
passwordcracker\.(com|net)
-password\.net
(windows(7-)?)?password(s)?(-)?(recovery|reset)\.(com|net)
lost(windows)?password\.com
tenorshare\.com
(downloader|pdf)converter\.(com|net)

Some for the "iSpire/Wasel" domains

i-spire\.net
iwasl\.com
qobul\.com
unblockingtwitter\.com
bestcheapvpnservice\.com
openingblockedsite\.com
arabic(soft)?download(s)?\.com
(vpn|internet)?wasel(pro)?\.com
vpn(faqs|answers)\.com

Whitelist known good users who have names with blacklisted keywords

Some valid users have names containing blacklisted keywords, and Smokey reports them each time their post gets bumped. Add ability to whitelist such valid users.

Sample fp reports by Smokey:

As a room owner I want newly privileged users to receive a link to Smokey's guidance/wiki

... so that they don't screw up things on their first actions and need to be told trivial things.

This might just be implemented as a humanoid procedure but if doable it would be nice if it could be automated.

Add !!/block and !!/unblock commands

With our recent chat trolls with SmokeDetector, we've found problems with !!/stappit. Since Undo has added an auto-restarting feature, !!/stappit doesn't work. Thus, Smokey keeps posting...

I think an easy solution to this would be to create !!/block {time} and !!/unblock commands. !!/block {time} would stop Smokey from posting until this given time (in minutes) is up. It could just be a simple check when posting, i.e. if isBlocked: return else: post. !!/unblock would just make it eligible to post again.

Is this a good solution to the problem? Feedback welcome.

Remove backoff messages

Is there any reason we can't remove these messages?

I can't think of any real action we take on them any more. Any actual violations of the backoff should be reported, but just receiving a backoff is a common thing.

Thoughts? Any reason we shouldn't stop posting these?

Please document how to add a new room

I was unable to find documentation for how to add a new room to the bot's configuration. This seems to be a recurring thing, so having a few sentences about how to request an addition could be a welcome addition to e.g. https://github.com/Charcoal-SE/SmokeDetector/wiki/Chat-Rooms

I can guess, but I don't think that qualifies me to actually write documentation.

A rough transcript of my guesses will be visible around http://chat.stackoverflow.com/transcript/message/32434109#32434109

Improvements to auto-blacklist PRs

I'd like to see if I can make the following improvements to the PRs created by non-code-privileged users when they use the !!/blacklist command:

Link to a MS search for the URL
Link the username to their chat profile
(if possible) automatically search MS and show the number of tps and fps for the site.

Submit edits to convert to lowercase

It should be programmed to convert those titles to lowercase... Suggest edits.

From the Tavern by nicael:
http://chat.meta.stackexchange.com/transcript/message/2631345#2631345

Different alert for apparent vandalism

I have repeatedly tripped on SmokeDetector reports which looked like spam or rude/abusive but which were self-vandalism, which is easily reverted and should not be flagged as spam.

Could the alert from the bot look different when it triggers because of an edit to a post which was previously fine?

For example, a chat alert message like

[ SmokeDetector ] Few unique characters in body, repeating characters in body, repeating characters in title, title has only one unique char: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa by Kot on stackoverflow.com (@tripleee)

... could have been more easily categorized correctly if it had a different chat message, perhaps something like

[ SmokeDetector ] Few unique characters in body, repeating characters in body, repeating characters in title, title has only one unique char: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa edited by Kot on stackoverflow.com (@tripleee)

Notice how the question's title is no longer a link, and the "edited" link instead links to the actual revision which introduced this vandalism.

Make SmokeDetector easier to run on an alternate account - Chat message prefix variables

I've been testing SmokeDetector more on an alternate account and would like to propose a modest change to make this easier for other developers.

I'd like to put this information in globalvars.py to make it easier for developers to quickly change this data.

Pull [SmokeDetector](https://github.com/Charcoal-SE/SmokeDetector)

By pulling this out of the multiple places it is in the code, we can allow users to link to their personal Repository and change the name from the official "SmokeDetector". This makes it easier to identify and if someone is curious, they can follow to an appropriate GitHub profile. It also removes the "authentic SmokeDetector" look that a test version may be running.

I'd suggest either pulling this into a single variable like chatmessage_prefix that contains the entire string, or two config variables one for bot_name and one for bot_repository.

This change would affect both ws.py and chatcommands.py

A similar variable may be needed for https://github.com/Charcoal-SE/SmokeDetector/commit/ links (in ws.py), but to me this isn't as important as the two listed above.

Regarding the blocking issue..

So what do you think about my suggestion? we can elaborate on a better approach if we bring the discussion further. I can code in C, C#, Python and so on.. there's no problem for me.

Inavlidated feedback on MS does not remove the user from blacklist

Apparently, if someone blacklists a user using tpu-, and the feedback gets invalidated, the user is not removed from the blacklist.

This has caused false positives. Example: http://chat.stackoverflow.com/transcript/message/34103819#34103819

IMO, the user should be removed from the blacklist in case the feedback is invalidated.

Detect proper links at the end of an body, not partial ones

http://english.stackexchange.com/a/344768/71877 was reported for link at end of answer but at the end the post only contained

(Dictionary.com)

That looks like a domain but I don't know if that should qualify as a proper link. Otherwise consider excluding certain domains from that check?

Smokey report: http://chat.stackexchange.com/transcript/message/32525630#32525630

test issue for testing stuff

hodor

Creating a write-only team for Charcoal projects

I'm leaving this here because it seems like more people see our GitHub issues than they do chat messages.

Thoughts?

Clean up handle_commands discussion

Summary

The handle_commands function is a 500 line monstrosity. I'd like to discuss cleaning it up to do the following:

Standardize what is returned from handle_commands
Have individual functions for each command.

If we agree that this work is needed, I'll take on the work load over the next several weeks (translation: after my summer vacation) to do this work.

Clean up details

This section describes my plan for the above changes. I'd like to do both changes at the same time, but if opposition to that exists one change at a time is acceptable to me.

Standardize what is returned from `handle_commands`

Currently, the function has multiple return paths and multiple ways of returning data.

One example is in the !!/block command:

return False, "Invalid duration."

Another example is in !!/errorlogs

return "The !!/errorlogs command requires 1 argument."

These two methods lead to weird handling in the two locations that handle_commands is called from. (Location 1, Location 2)

Method of handling 1:

if type(result) != tuple:
    result = (True, result)
if result[1] is not None:
    if wrap2.host + str(message_id) in GlobalVars.listen_to_these_if_edited:
        GlobalVars.listen_to_these_if_edited.remove(wrap2.host + str(message_id))
    message_with_reply = u":{} {}".format(message_id, result[1])
    if len(message_with_reply) <= 500 or "\n" in result[1]:
        ev.message.reply(result[1], False)
if result[0] is False:
    add_to_listen_if_edited(wrap2.host, message_id)

Method of handling 2:

r = result
if type(result) == tuple:
    result = result[1]
if result is not None and result is not False:
    reply += result + os.linesep
elif result is None:
    reply += "<processed without return value>" + os.linesep
    amount_none += 1
elif result is False or r[0] is False:
    reply += "<unrecognized command>" + os.linesep
    amount_unrecognized += 1

These type of checks are unintuitive and difficult to follow. It also adds two sections of unnecessary code that perform in a similar, but not identical way.

Proposal

I propose that all return paths in handle_commands return the expected tuple of (boolean, string) and the two locations that call handle_commands do so like this, with appropriate variable names:

bool_result, string_result = handle_commands(...)

The block of code afterward can then be reduced and simplified by using descriptive variables instead of elements of a tuple.

Have individual functions for each command.

As handle_commands is currently written, all command processing is occurring in this function. Much of this logic should be moved to their own individual functions and handle_commands should be used to determine which function to call. Each block that looks like the following should have their own function:

if content_lower.startswith("!!/addblu")

Proposal

Move the logic of each command to it's own function. This keeps handle_commands cleaner and allows for certain areas of code to be reused

Should reports be edited on FP instead of deletion?

Currently, if a report is marked as False Positive, Smokey will try to delete the report to prevent accidental flagging.

However, on more than one occasion I have been wanting to see the post itself so I can make up my own mind/investigate/check for hidden spam etc. To do this, I have to go to metasmoke, click the post tab, find the post, go to the post page and click once more to view on the original site. Too much effort for a lazy person like me :)

Instead, I propose that Smokey should try to edit the report to the following:

[SmokeDetector | MS] (false positive - report deleted)

or something similar, so that the MetaSmoke link is still clickable.

This could be easily done by modifying line 983 of chatcommands.py. Should be low-hanging fruit.

Inconsistent environments in CI and readme.md

Issue

In .travis.yml:

pip install coverage phonenumbers regex==2015.11.22 beautifulsoup4 requests websocket-client pytest flake8 termcolor --upgrade

in circleci.yml:

pip install beautifulsoup4 requests websocket-client coverage pytest phonenumbers flake8 regex==2015.11.22 termcolor --upgrade

In readme.md:

sudo pip install pip --upgrade
sudo pip install beautifulsoup4
sudo pip install requests --upgrade
sudo pip install websocket-client --upgrade
sudo pip install phonenumbers
sudo pip install regex
sudo pip install termcolor

Diff:

pip, requests, websocked-client are not upgraded in travis and circleci build
termcolor is not upgraded in readme.md.
regex version is frozen in travis build, but not in the instruction
circleci and travis use the same list with different ordering. It's hard to check if the environments are actually equal.

Proposal

This can be solved by making requirements.txt and reusing it both in CI and manual installation.

Make SmokeDetector's code pass Flake8

Kevin Brown suggested to use something like Flake8 for the source checks. But in the current state, there are too many warnings to use it for actual tests. Hence, we should make SmokeDetector's code pass Flake8 (except warning E501, which means "too long line", and should not always be changed). /cc @kevin-brown

testing, please ignore some more

still works 👎

Send Magento alerts to dedicated room

As requested by the Magento user Raphael at Digital Pianism in this Charcoal discussion, please send Smokey reports for the Magento site to the dedicated chat room 47869.

Add check_permissions decorator

Before I create a pull request for a feature I've already written, I want to make sure everyone is on board with the change.

The full change is available here (with a successful Travis CI build).

This change pulls the is_privileged check out of each of the commands and instead uses a decorator on the functions that should be restricted to privileged users.

Previously:

def command_add_blacklist_user(message_parts, content_lower, message_url, ev_room, ev_user_id, wrap2, *args, **kwargs):
    quiet_action = any([part.endswith('-') for part in message_parts])
    if is_privileged(ev_room, ev_user_id, wrap2):
        uid, val = get_user_from_list_command(content_lower)
        if uid > -1 and val != "":
            add_blacklisted_user((uid, val), message_url, "")
            return None if quiet_action else "User blacklisted (`{}` on `{}`).".format(uid, val)
        elif uid == -2:
            return "Error: {}".format(val)
        else:
            return "Invalid format. Valid format: `!!/addblu profileurl` *or* `!!/addblu userid sitename`."

New:

@check_permissions
def command_add_blacklist_user(message_parts, content_lower, message_url, ev_room, ev_user_id, wrap2, *args, **kwargs):
    quiet_action = any([part.endswith('-') for part in message_parts])
    uid, val = get_user_from_list_command(content_lower)
    if uid > -1 and val != "":
        add_blacklisted_user((uid, val), message_url, "")
        return None if quiet_action else "User blacklisted (`{}` on `{}`).".format(uid, val)
    elif uid == -2:
        return "Error: {}".format(val)
    else:
        return "Invalid format. Valid format: `!!/addblu profileurl` *or* `!!/addblu userid sitename`."

Using this decorator, we can write the functions to work without caring about whether the user has permission to access it. If we want to protect the function, add the decorator. If we do not, don't add the decorator.

Supress reposting of all-caps posts

Report an all-caps title only once, and then automatically ignore it. If no-one edits it after the first report, it is unlikely to be edited by Tavern folks on subsequent reports.

Original request by Behaviour in The Tavern on the Meta: http://chat.meta.stackexchange.com/transcript/89?m=2858969#2858969

add a command for tpu and delete

Looks like we haven't a command can run tpu and delete yet, I mean if I need make a message as tpu and delete it I need:

sd tpu
sd deleted

What about add a command like tp[u]d[-] or tp[u]del[-]?

Add command to show flagged posts that got not deleted yet (or post automatically)

Sometimes spam reports about smaller sites get less attention than they need, either if they're followed by many other reports or if only few people are online at the time.

I would suggest that Smokey should keep a list of all reported posts that got positive feedback or no feedback at all yet and are not yet removed from the site. A command like !!/pending would then show a list of all those reports that still need more flags or feedback. Example:

"Skin care tips" by "SpamUser" on webmsaters.stackexchange.com [MS] (reported 12 minutes ago, 1 tp, 0 naa, 0 fp, post score -3)
"Best essay writing service" by "Writer" on graphicdesign.stackexchange.com [MS] (reported 6 minutes ago, no feedback yet, post score -1)

This would be very helpful to make sure no reports slip through and to verify if anything needs more flags after a bunch of reports appeared without having to walk through the links manually.

Additionally, it might be useful to not only post this report on demand but also automatically for posts in the list that were reported more than e.g. 10 minutes ago.

Create unit tests for ws.py

We inadvertently break ws.py from time to time. With Travis, we should be able to catch most of these errors by having unit tests for ws.py.

Escape special characters from inputs (like post titles)

http://chat.meta.stackexchange.com/transcript/message/2225453#2225453

The ` in the post title screws things up.

Should we consider automatically flagging posts which hit more than one filter?

As we all know, the role of the SmokeDetector project is to detect and delete spam as fast as possible. Of course, the fastest way to delete spam is to have as many people flag it as fast as possible.

So, what if we made Smokey automatically flag posts which hit more than one filter, so that spam gets deleted faster? According to Undo, reports which hit more than one filter are pretty damn accurate:

At least two reasons: 22136 TPs, 424 FPs (~2% false positives)
At least three reasons: 13087 TPs, 24 FPs (~0.2% false positives)

When you compare that to the helpful flag percentage of a human (like me) of ~95%, Smokey is definitely accurate enough for us to consider automated flagging.

Of course, we can do more to make the auto-flags more accurate:

Increase the number of filters required before an auto-flag is cast
Exclude less accurate reasons from the count of filters hit, such as Link at End of Answer, Repeating characters in body etc.
Revert flags if FP feedback is provided (is this possible via the API? Maybe a Meta FR?)

However, there will be some things we need to think about before we do anything:

NAA posts need to be separated from red-flaggable posts
Do we need to separate spam and offensive posts (well, yes.... but will one incorrect type of red flag by smokey make much of a difference in comparison to 5 correct humans)?
Would SE be happy with us doing this? (I assume the answer would be yes, since they are already considering integration with us at our current accuracy, and with a little tweaking we can make auto-flags based on the number of filters more accurate than Andy's auto-comment-flagging, which everyone seems to be happy about)

But Angus, isn't that a huge amount of effort for just one extra flag? Surely it can't make much of a difference compared to our current flagger userbase?

Well, yes, in some ways, you are right. Most of the time, it won't make much of a difference, but it certainly will make spam get deleted faster, especially on smaller sites, where every flag counts.

Also, now that SE is starting to talk to us about the possibility of integration, I think it is important that we have a tried-and-tested way of identifying which posts are almost definitely spam, which we can present to them. Our overall false positive rate of 17% isn't good enough to be put into production on one of the biggest sites on the internet, so I think we should start work on a more accurate identification system so that it is ready for when SE needs it.

So, what does everyone else think? Please share your thoughts, ideas, and criticizms.

Of course, this is only an initial basic idea. Please don't nitpick at the specifics - if (mostly) everyone agrees to this, we can iron out any details in another discussion. For now, let's just discuss whether we want to put some serious effort into this.

New filter: website resembles username

E.g. for these kind of spam posts, which go undetected quite often or
https://metasmoke.erwaysoftware.com/post/52946
https://metasmoke.erwaysoftware.com/post/52841
https://metasmoke.erwaysoftware.com/post/51936

Procedure: replace spaces in username by \W? and check if there's a link in the post which contains that string.
There are some users with 3 character usernames which have a chance of accidentally triggering the filter. Maybe this should only work for usernames above a certain length.

Use named arguments for data-passing

As Normal so wisely suggests:

seeing these on separate lines suggests further improvement (for the future): naming these arguments. It's a long list with some "False, False" in the middle, and the only way to know what these do is to read the code in another module.

At some point in the future, we should use named arguments or something similar; especially as this grows and more data is flying around.

https://jsfiddle.net/TC2006/vcfjhg4m/[][1]

    Just something I made in JSFiddle.

    Hope it helps. :)

  [1]: https://jsfiddle.net/TC2006/vcfjhg4m/

Can this be fixed or explained why it qualifies for being reported?

charcoal-se / smokedetector Goto Github PK

smokedetector's Introduction

SmokeDetector

Documentation

Basic setup

Virtual environment setup

Docker setup

Automate Docker deployment with Docker Compose

Requirements

License

Contribution Licensing

smokedetector's People

Contributors

Stargazers

Watchers

Forkers

smokedetector's Issues

Summary

Clean up details

Standardize what is returned from handle_commands

Proposal

Have individual functions for each command.

Proposal

Issue

Diff:

Proposal

Recommend Projects

Recommend Topics

Recommend Org

Standardize what is returned from `handle_commands`