Code Monkey home page Code Monkey logo

doxing-on-twitter's Introduction

Prevention and Anonymization of Dox Content on Twitter

This work was done as part of the course IST-597 Fundamentals of Privacy at Penn State.

This repository contains the python code, trained model and the demo of the tool I developed.

Demo 1 - Classification of a Tweet to identify doxing
Demo 2 - Nudging the user to not post the Tweet that contains sensitive content
Demo 3 - Anonymizing the Tweet content if the user ignores the nudge.

Summary:

  1. Trained a Machine Learning model for automatic detection of doxed data on Twitter.
  2. Developed a nudge-based technology to alert the doxer that the tweet contains private information.
  3. Created a prototype that add noise to tweets with doxed information in real-time by implementation of hashing algorithm.

Part I: Detection of doxed content

Using Twitter's streaming API, we collected 2000 tweets. We input the following keywords to the search API: IP address, SSN, Social Security Number, and SSA and manually annotated the dataset to look for doxing. We exclude the tweets with invalid (e.g., 8.780.255.255) or local/public IP addresses (e.g., 127.0.0.1, 192.168.x.x, 8.8.8.8). We model our problem as a binary classification task, with two classes being doxed tweet and non-doxed/benign tweets. We used huggingface's implementaiton of DistilBert for Sequence Classification for this task.

Table 1: The results of binary classification task generated using DistilBERT based embeddings.

Screenshot 2023-02-20 at 3 19 32 PM

Part II: Real-time Nudging

We present the prototype of this nudging mechanism in below figures. Figure 1 is the initial prompt from the system, emulating a Twitter platform and containing the text box to curate the tweet. If a tweet is classified as non-doxing, the sequence is terminated. However, if our machine learning model identifies it as a doxed tweet, the author is nudged. Figure 2 displays the nudging message and the text box where the author inputs his/her choice to proceed. Figure 3 shows if the response is N.

Figure 1: The assumed landing page of Twitter.

prompt1

Figure 2: The nudge shared with the user to reconsider the content of the tweet

nudge2_w_policy

Figure 3: A response from the system if the user discards the draft.

response_n

Part III: Data Anonymization

We anonymize the tweet if the author decides to continue to post it. We leverage regular expressions and spacy's Named Entity Recognition (NER) API. We generate two regexes to extract the IP address and the URLs (We observed that in some cases, the sensitive information is not explicitly written in the tweet but could be accessible through a URL. Therefore, we add noise to URLs as well). We also observed that along with IP address and SSN, tweets contain other types of personal information too (e.g., Full Name, Location Coordinates). Although our model is not trained to detect such PIIs, we attempt to anonymize this sensitive information by a pre-trained NER model provided by Spacy to extract entities. We mask all the entities identified by the NER.

Figure 4: Anonymized response from the system if user discards the nudge.

response2

Note:

  1. Due to privacy concerns, I couldn't share the dataset. The dataset contains user's SSN and IP address information.
  2. I acknowledge that the inter-annotator reliability of our annotation process is difficult to establish, and in future work, every tweet should receive at least three annotations.

Acknowledgement: I thank Younes Karimi, a PhD candidate in the Information Systems and Technology Department of Penn State, for his inputs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.