pamaxie / pamaxie.scan_api Goto Github PK

This repository contains our Scanning API.

License: Other

Rust 43.92% Dockerfile 0.03% Makefile 48.54% LLVM 0.52% C 6.99%

api content-detection image-classification image-recognition ml public-api rust safety

pamaxie.scan_api's Introduction

We have started development again. This is a long process and will take a while to get rolling. For now we will focus mostly on our web presence to drive more traffic to the project.

Documentation for this project can be found at Pamaxies wiki. API Credentials can be created at Pamaxies website. Please let us know if the API misbehaves in any way. We will assist you as soon as possible.

Pamaxie was developed to ensure and verify the security of content and media on the internet. This was developed to allow developers of chat applications to moderate content automatically. For example by a neural network scanning images for certain properties, and we plan on supporting other types of media as well. The intention of this connection of Machine Learning and hand crafted algorithms is to create a service that allows the internet to be more secure. Our goal is to make the web more fun and safe to browse, and prevent users from seeing content on websites that they, may want to avoid.

We are developing this API with content hosters in mind. We will never share any data or sell it to 3rd parties. We will always treat all user data that we keep on our servers with the highest respect for privacy. This is one of the reasons that we chose to make this an open source project. Our API will be developed to be easy to interact with and stay as responsive as possible.

The end goal of the project is to make the internet a more secure place to browse. Our API can be either trained with your own data. If you don't want to train your own data, you can just access our API for free by just creating an account on our website.

Contribution

If you'd like to contribute to pamaxie, feel free to check out our wiki pages article on how to do so! We are always looking for people helping us, no matter what your current skill level is.

Please make sure to read our contribution guidelines and code of conduct to be aware how you're supposed to act.

We will release further updates on the wiki once the API is available for public signup

Possible thanks to funding by:

Thanks to these partners helping us keep this project alive:

pamaxie.scan_api's People

Contributors

Stargazers

Forkers

prototypefund

pamaxie.scan_api's Issues

Validate current media detection

Validate the library we are using for media detection works as intended and can detect most image types we will be working with

Let users decide what kind of result they want

We currently deliver all results back.
This is really annoying if a user is just looking for one of the many things that we can scan for and increases scan time for them too as well as our load.
We should build a system that allows the user to "Query" for specific scan properties they look for and we just scan for them. For example, we just do a scan to detect if something contains porn, gore, or racy content not if it contains all other properties (like nazi symbolism or realism checks) we offer.

Create an authentication flow for the API

Create an Authentication flow that validates users are allowed to connect to our API endpoints by validating the token reached in with the database

Implement perceptual hashing into scanning API

We require perceptual hashing and perceptual hash detection for our scanning API to be reliable and also work on a somewhat fast performance level.
This means images should be hashed with a perceptual hash which is then used to search through our Keys in our database to find the closest matching one and return it.

Scan API becomes unavailable after JWT Token Refresh

Like the title says. The behavior is a bit strange and it's likely to do with the refresh routine

Rework Hashing to Utelize Blake

Currently we are using MD5 which is prone to fingerprinting attacks. This should be avoided. Please switch the hashing algorithm to Blake

Create dedicated content detection endpoint

We currently check for media types in our content detection.
While this is nice, a dedicated endpoint would make much more sense than scanning the data directly after detecting its content. Adding a separate endpoint would also allow users to decide if they want to continue scanning with our API or if they were just interested in the content type of the media they provided.
Adding this endpoint would also allow us to remove some of the current return data (like if an image is a png or an image is an image) since this would be redundant.

Implement a CDN

For our production we need a CDN. We are probably going to use googles CDN system (Cloud CDN) since implementing our own would take too much time.
See here: https://cloud.google.com/cdn

Implement Media Detection

We require media detection that works reliably:

A good to have would be changeable data specifications, but this is not strictly required.

Implement worker collection

We need collections for a server / worker system to distribute compute easily and allow customers to predict their own image data if our servers are too slow for them. This requires a worker collection that can be queried for specific work (e.g. something reached in from a customer) and works like a ring list so we can take the latest one out of it and start prediction on it.

Check if images are actually scanned after "taking" a job

We require a system that checks if once a client has taken a job it is actually returning a result at some point. Currently we just "assume" that if a client takes a job it is scanned and a result is posted to our API at some point.
I'm considering using a dead letter queue or something else to check if an image has been in the dead letter queue for a certain time and if it has just repost it to the processing queue. The time could be 1 minute or something around this time since we should handle scans within a minute of taking them.

This is a more complex issue and we require help in solving this. I feel like there has to be a better approach to handling clients failing to scan an image / potentially reporting why they couldn't to the API.

Migrate to other Real Time Communication Platform

We want to move away from Amazon's SQS system because it locks our users into using AWS.
The best solution I could find so far is RabbitMQ. We do not want to use Kafka because its clients favor java heavily and this is a no go for us.
If anyone has any recommendations for a system regarding this, besides RabbitMQ we are very much willing to try out multiple solutions to find the best one for our needs.