adrianco / megpt Goto Github PK

License: Apache License 2.0

Python 100.00%

megpt's Introduction

meGPT - upload an author's content into an LLM

I have 20 years of public content I've produced and presented over my career, and I'd like to have an LLM that is trained to answer questions and generate summaries of my opinions, in my "voice", At this point, this LLM doesn't exist, but to encourage development I have organized my public content and references to sources in this repo. For now, if you want to know my opinions on things, you have to watch all my videos and listent to my podcasts yourself!

My own content is stored or linked to in authors/virtiual_adrianco and consists of:

4 published books (pdf of two provided), ~10 forewords to books, ~100 blog posts (text)
Twitter archive 2008-2022 (conversation text)
Mastodon.social - 2021-now https://mastodon.social/@adrianco (RSS at https://mastodon.social/@adrianco.rss)
Github projects (code)
Blog posts mostly at https://adrianco.medium.com
~100 presentation decks (images) greatest hits: https://github.com/adrianco/slides/tree/master/Greatest%20Hits
~20 podcasts (audio conversations, should be good Q&A training material)
~50 videos of talks and interviews (audio/video/YouTube playlists)

If another author wants to use this repo as a starting point, clone it and add your own directory of content under authors. If you want to contribute it freely for other people to use as a training data set, then send a pull request and I'll include it here.

Creative Commons - attribution share-alike. Permission explicitly granted for anyone to use as a training set to develop the meGPT concept. Free for use by any author/speaker/expert resulting in a Chatbot that can answer questions as if it was the author, with reference to published content. I have called my own build of this virtual_adrianco - with opinions on cloud computing, sustainability, performance tools, microservices, speeding up innovation, Wardley mapping, open source, chaos engineering, resilience, Sun Microsystems, Netflix, AWS etc. etc. I'm happy to share any models that are developed. I don't need to monetize this, I'm semi-retired and have managed to monetize this content well enough already, I don't work for a big corporation any more..

I am not a Python programmer

All the code in this repo has been written by the free version of ChatGPT 4 based on short prompts, with no subsequent edits, in a few minutes of my time here and there. I can read Python amd mostly make sense of it but I'm not an experienced Python programmer. Look in the relevant issue for a public link to the chat thread that generated the code. This is a ridiculously low friction and easy way to write code.

Building an Author

To use this repo, clone it to a local disk, setup the python environment, run the build.py script for an author and it will walk through the published content table for that author processing each line in turn. The build script will create a downloads/ directory and create a state.json file in it which records successful processing steps so that incremental runs of build.py will not re-run the same downloads. Each kind of data needs a corresponding script in the processors directory.

git clone https://github.com/adrianco/megpt.git
cd megpt
python -m venv venv

Windows:

venv\Scripts\activate

macOS/Linux:

source venv/bin/activate

pip install -r requirements.txt

Run the build script

Usage: build.py <author>
python build.py virtual_adrianco

For test purposes process a single kind of data from an arbitrary URL, output to downloads without updating the state.json file

Usage: python process.py <author> <Kind> <SubKind> <URL>

Current functional status

Build.py and process.py appear to be operating correctly and book_processor.py correctly downloaded pdfs of books, Any raw file downloads can clone this processor. Each website download is going to need customized extraction, and the correct div name for The New Stack (thenewstack.io) has been added as a Subkind, and correct text content download is working for stories.

Notes

I have been assembling my content for a while, and will update the references table now and again https://github.com/adrianco/meGPT/blob/main/authors/virtual_adrianco/published_content.csv

YouTube videos have transcripts with index offsets into the video itself but the transcript quality isn't good, and they can only be read via API by the owner of the video. It's easier to download videos with pytube and process them with whisper to generate more curated transcripts that identify when the author is talking if there is more than one speaker.

Twitter archive - the raw archive files were over 100MB and too big for github. The extract_conversations script was used to pull out only the tweets that were part of a conversation, so they can be further analyzed to find questions and answers. The code to do this was written by ChatGPT, worked first time, but if there are any problems with the output I'm happy to share the raw tweets. File an issue.

Mastodon archive - available as an RSS feed. Medium blog platform - available as an RSS feed. Need to import an RSS feed. It also would be good to have this be incremental so that the training material can be updated efficiently as new blog posts and toots appear.

Issues have been created to track development of ingestion processing code.

megpt's People

Contributors

Stargazers

Watchers

megpt's Issues

Setup python and instructions for running code

Proposed method below seems like the simple option (after asking ChatGPT for advice)

Create a requirements.txt File

The requirements.txt file lists all the dependencies needed for your project. This file is used by pip to install the required packages. Here's an example:

# requirements.txt

numpy==1.21.2
pandas==1.3.3
matplotlib==3.4.3
requests==2.26.0
# Add more dependencies as needed

To generate this file automatically, you can use pip freeze if your environment already has the dependencies installed:

pip freeze > requirements.txt

Using a virtual environment is a best practice for Python projects, as it isolates your project’s dependencies from the global Python installation. Here’s how to create and use a virtual environment:

Create a Virtual Environment:

python -m venv venv
This command creates a virtual environment named venv in your project directory.
Activate the Virtual Environment:
Windows:
venv\Scripts\activate

macOS/Linux:
source venv/bin/activate

Once activated, your terminal prompt will change, indicating that you’re now using the virtual environment.
Install Dependencies:
With the virtual environment active, install the dependencies listed in requirements.txt:

pip install -r requirements.txt

Process podcast for ingestion

Podcasts are usually conversations so voice recognition is needed to identify the author and extract question and answer pairs from the transcript. Similar to video ingestion.

Process Twitter archive for ingestion

Since we are looking for answers to questions, the starting point is to filter the tweet archive to only include conversations, tweets that are linked to other tweets. In my case this reduced the size of the file to about 20% of the raw feed. Code to do this was developed by asking ChatGPT to generate it, and it seemed to work. Full chat sequence follows... The final code for the script has been saved to the repo along with the output conversations file.

Prompt: I'd like a script that processes a twitter archive directory structure to remove private content like DMs and leave the public content intact

ChatGPT: Sure, I can help you with that. Here’s a Python script that processes a Twitter archive directory to remove private content like Direct Messages (DMs) while keeping public content intact. This script assumes that the Twitter archive is in the standard directory structure as provided by Twitter's data export tool.

Requirements:

Python installed on your system.
os and json modules (these are standard Python libraries).

Script:

import os
import json

def remove_private_content(archive_dir):
    # Define the paths for the private content files to be removed
    dm_paths = [
        os.path.join(archive_dir, 'data', 'direct-messages.js'),
        os.path.join(archive_dir, 'data', 'direct-messages-group.js')
    ]

    # Remove the Direct Messages files
    for dm_path in dm_paths:
        if os.path.exists(dm_path):
            os.remove(dm_path)
            print(f"Removed {dm_path}")
        else:
            print(f"{dm_path} does not exist")

    # Handle other potential private data files if necessary
    other_private_files = [
        os.path.join(archive_dir, 'data', 'account.js'),  # Contains account-related information
        os.path.join(archive_dir, 'data', 'contact.js'),  # Contains contacts data
        os.path.join(archive_dir, 'data', 'mobile.js')    # Contains mobile data
    ]

    for file_path in other_private_files:
        if os.path.exists(file_path):
            with open(file_path, 'r') as f:
                data = f.read()
            
            if 'window.YTD.account' in data:
                data = data.replace('window.YTD.account', 'window.YTD.account_public')
            elif 'window.YTD.contact' in data:
                data = data.replace('window.YTD.contact', 'window.YTD.contact_public')
            elif 'window.YTD.mobile' in data:
                data = data.replace('window.YTD.mobile', 'window.YTD.mobile_public')

            with open(file_path, 'w') as f:
                f.write(data)
                print(f"Processed {file_path} for public data only")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        remove_private_content(archive_directory)
    else:
        print("Invalid directory path")

Explanation:

Removing DM Files: The script first attempts to remove the Direct Messages files (direct-messages.js and direct-messages-group.js). These are typically where private conversations are stored.
Processing Other Files: The script looks for other files that may contain private data (like account.js, contact.js, and mobile.js) and modifies their content to indicate they now only contain public data by renaming the JavaScript variable inside the files. This is a simplistic approach and assumes that renaming the variable is enough to "publicize" the content. Depending on the actual content, further processing might be needed.

How to Run:

Save the script as clean_twitter_archive.py.
Open a terminal or command prompt.
Navigate to the directory where you saved the script.
Run the script using python clean_twitter_archive.py.
Enter the path to your Twitter archive directory when prompted.

This script will help you remove private content from your Twitter archive while keeping your public data intact. Adjust the script as needed based on the actual structure and contents of your Twitter archive.

Prompt: remove deleted tweets as well

ChatGPT: To extend the script to remove deleted tweets as well, we'll need to identify where the deleted tweets are stored and ensure they're removed from the archive. Typically, deleted tweets might still appear in the tweets.js file but be marked in a way that indicates they were deleted.

Enhanced Script:

import os
import json

def remove_private_content_and_deleted_tweets(archive_dir):
    # Define the paths for the private content files to be removed
    dm_paths = [
        os.path.join(archive_dir, 'data', 'direct-messages.js'),
        os.path.join(archive_dir, 'data', 'direct-messages-group.js')
    ]

    # Remove the Direct Messages files
    for dm_path in dm_paths:
        if os.path.exists(dm_path):
            os.remove(dm_path)
            print(f"Removed {dm_path}")
        else:
            print(f"{dm_path} does not exist")

    # Handle other potential private data files if necessary
    other_private_files = [
        os.path.join(archive_dir, 'data', 'account.js'),  # Contains account-related information
        os.path.join(archive_dir, 'data', 'contact.js'),  # Contains contacts data
        os.path.join(archive_dir, 'data', 'mobile.js')    # Contains mobile data
    ]

    for file_path in other_private_files:
        if os.path.exists(file_path):
            with open(file_path, 'r') as f:
                data = f.read()
            
            if 'window.YTD.account' in data:
                data = data.replace('window.YTD.account', 'window.YTD.account_public')
            elif 'window.YTD.contact' in data:
                data = data.replace('window.YTD.contact', 'window.YTD.contact_public')
            elif 'window.YTD.mobile' in data:
                data = data.replace('window.YTD.mobile', 'window.YTD.mobile_public')

            with open(file_path, 'w') as f:
                f.write(data)
                print(f"Processed {file_path} for public data only")

    # Process tweets.js to remove deleted tweets
    tweets_path = os.path.join(archive_dir, 'data', 'tweets.js')
    if os.path.exists(tweets_path):
        with open(tweets_path, 'r') as f:
            content = f.read()

        # Strip off the JS variable assignment to parse JSON
        json_content = content[content.index('['): content.rindex(']') + 1]
        tweets = json.loads(json_content)

        # Filter out deleted tweets
        public_tweets = [tweet for tweet in tweets if not tweet.get('tweet', {}).get('is_deleted')]

        # Convert back to JSON and re-add the JS variable assignment
        new_content = 'window.YTD.tweets.part0 = ' + json.dumps(public_tweets, indent=2) + ';'

        with open(tweets_path, 'w') as f:
            f.write(new_content)
            print(f"Processed {tweets_path} to remove deleted tweets")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        remove_private_content_and_deleted_tweets(archive_directory)
    else:
        print("Invalid directory path")

Explanation:

Removing DM Files: This part remains unchanged and removes files containing direct messages.
Processing Other Files: This part remains unchanged and handles other private data files by renaming their variables.
Removing Deleted Tweets:
- The script opens the tweets.js file and reads its content.
- It extracts the JSON part from the file content, which is enclosed within square brackets.
- The script then filters out tweets marked as deleted (is_deleted flag).
- It writes the filtered tweets back to tweets.js while preserving the JavaScript variable assignment at the beginning.

How to Run:

Save the script as clean_twitter_archive.py.
Open a terminal or command prompt.
Navigate to the directory where you saved the script.
Run the script using python clean_twitter_archive.py.
Enter the path to your Twitter archive directory when prompted.

This enhanced script will help you remove both private content and deleted tweets from your Twitter archive, ensuring only your public and active tweets remain. Adjust the script further based on the exact structure of your Twitter archive if necessary.

Prompt: I'd like to process the tweets.js file from a twitter archive to extract all the conversations, and ignore standalone tweets and retweets.

ChatGPT: To process the tweets.js file from a Twitter archive and extract all the conversations while ignoring standalone tweets and retweets, you can follow these steps:

Parse the tweets.js file to extract the tweet data.
Identify tweets that are part of conversations. These tweets typically have a in_reply_to_status_id field.
Ignore standalone tweets and retweets.
Save the conversation tweets to a new file or print them out.

Here’s a Python script to achieve this:

import os
import json

def extract_conversations(tweets_path, output_path):
    if os.path.exists(tweets_path):
        with open(tweets_path, 'r') as f:
            content = f.read()

        # Strip off the JS variable assignment to parse JSON
        json_content = content[content.index('['): content.rindex(']') + 1]
        tweets = json.loads(json_content)

        # Extract tweets that are part of conversations
        conversations = [tweet for tweet in tweets if tweet.get('tweet', {}).get('in_reply_to_status_id')]

        # Save conversations to a new file
        with open(output_path, 'w') as f:
            json.dump(conversations, f, indent=2)
            print(f"Extracted {len(conversations)} conversation tweets to {output_path}")
    else:
        print(f"{tweets_path} does not exist")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    tweets_file = os.path.join(archive_directory, 'data', 'tweets.js')
    output_file = os.path.join(archive_directory, 'data', 'conversations.json')
    
    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        extract_conversations(tweets_file, output_file)
    else:
        print("Invalid directory path")

Explanation:

Reading the Tweets File: The script opens tweets.js and reads its content.
Parsing the JSON Data: It extracts the JSON array from the JavaScript file by removing the surrounding JavaScript code.
Filtering Conversations: The script filters tweets to find those that are part of conversations by checking the in_reply_to_status_id field.
Saving the Results: The filtered conversation tweets are saved to a new file named conversations.json.

How to Run:

Save the script as extract_conversations.py.
Open a terminal or command prompt.
Navigate to the directory where you saved the script.
Run the script using python extract_conversations.py.
Enter the path to your Twitter archive directory when prompted.

This script will extract all the conversation tweets from your Twitter archive and save them to a new file, ignoring standalone tweets and retweets. Adjust the script as necessary based on the exact structure of your Twitter archive.

Prompt: My Twitter archive also has a tweets-part1.js file. How it’s this related to the tweets.js file?

ChatGPT: In a Twitter archive, the presence of files like tweets.js, tweets-part1.js, tweets-part2.js, etc., indicates that your tweet data has been split into multiple parts due to the large number of tweets. Twitter splits these files to make it easier to manage and process the data without hitting file size limits.

Here's a brief explanation of these files:

tweets.js: Contains the first portion of your tweet data. This file is usually named tweets-part0.js in the context of multiple parts but is sometimes simply called tweets.js if there's only one part or if it is the first file.
tweets-part1.js, tweets-part2.js, etc.: These are additional parts containing subsequent tweets. They follow the same structure as tweets.js but contain different tweet data.

Understanding File Structure

Each file typically has the following structure:

window.YTD.tweets.part0 = [ ... ];
window.YTD.tweets.part1 = [ ... ];
window.YTD.tweets.part2 = [ ... ];

The variable window.YTD.tweets.partX indicates which part of the tweet data is being represented. The JSON array following the assignment contains tweet objects.

Updated Script to Handle Multiple Files

Here’s an updated Python script to process all tweet parts (tweets.js, tweets-part1.js, tweets-part2.js, etc.) and extract only conversation tweets:

import os
import json
import glob

def extract_conversations(archive_dir, output_file):
    tweet_files = glob.glob(os.path.join(archive_dir, 'data', 'tweets*.js'))
    conversations = []

    for tweet_file in tweet_files:
        if os.path.exists(tweet_file):
            with open(tweet_file, 'r') as f:
                content = f.read()
            
            # Find the JSON array within the JavaScript file
            json_content = content[content.index('['): content.rindex(']') + 1]
            tweets = json.loads(json_content)

            # Extract tweets that are part of conversations
            conversation_tweets = [
                tweet for tweet in tweets 
                if tweet.get('tweet', {}).get('in_reply_to_status_id') or tweet.get('tweet', {}).get('in_reply_to_user_id')
            ]
            conversations.extend(conversation_tweets)
            print(f"Processed {tweet_file}: Found {len(conversation_tweets)} conversation tweets")
    
    # Save conversations to a new file
    with open(output_file, 'w') as f:
        json.dump(conversations, f, indent=2)
        print(f"Extracted a total of {len(conversations)} conversation tweets to {output_file}")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    output_filename = "conversations.json"
    
    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        output_path = os.path.join(archive_directory, 'data', output_filename)
        extract_conversations(archive_directory, output_path)
    else:
        print("Invalid directory path")

Explanation

Using glob: The script uses the glob module to find all files matching the pattern tweets*.js in the archive's data directory. This ensures it processes all tweet parts, including tweets.js, tweets-part1.js, tweets-part2.js, etc.
Parsing JSON Content: The script reads each file's content, strips out the JavaScript variable assignment, and extracts the JSON array.
Identifying Conversations: It filters tweets that are part of conversations. This includes:
- Tweets with a non-null in_reply_to_status_id (indicating they are replies).
- Tweets with a non-null in_reply_to_user_id (indicating they are replies to another user's tweet).
Storing Conversations: The filtered conversation tweets are appended to a list, which is then saved to conversations.json.

How to Run

Save the script as extract_conversations.py.
Open a terminal or command prompt.
Navigate to the directory where you saved the script.
Run the script using python extract_conversations.py.
Enter the path to your Twitter archive directory when prompted.

This script will effectively extract all conversation tweets from multiple parts of your Twitter archive, ensuring you have a consolidated list of conversations across all your tweet data files.

Process blog for ingestion

Blogging platfiorms like Medium.com support RSS feeds, so a generic RSS feed extractor should be ok. Since blogs get added to over time, the RSS feed reader should remember its current point and process incrementally when re-run for updates.

Recognize and label voices in audio

Needed for podcasts and video processing. https://www.descript.com may be useful for this.

Process books for ingestion

I've included the pdf source of one of the books I wrote, it had a co-author, so it might be good to tag which chapters were written by each author. This would also take care of forewords which have a different author.

Use published_content.csv to incrementally download and process content

Each author has an authors//published_content.csv file that indicates the type of content and URLs to where it is obtained from. Need a script that walks through this table, creates a subdirectory in downloads/ if it doesn't exist, incrementally pulls down and processes the raw content. Maintain a state.json file that records what has been downloaded and processed when.

Process YouTube playlist for ingestion

Youtube has transcripts but they aren't very good and it's not possible to download them from youtube's API unless you uploaded the video yourself. ChatGPT was used to build some code to do this but the pytube library comes with a command line tool that downloads a whole playlist to a directory. Then Whisper can be used to generate a transcript. Ideally, the author voice would be recognized and labeled in the transcript, for cases where the video is of an interview or there are multiple speakers.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

adrianco / megpt Goto Github PK

megpt's Introduction

meGPT - upload an author's content into an LLM

I am not a Python programmer

Building an Author

Current functional status

Notes

megpt's People

Contributors

Stargazers

Watchers

Forkers

megpt's Issues

Requirements:

Script:

Explanation:

How to Run:

Enhanced Script:

Explanation:

How to Run:

Explanation:

How to Run:

Understanding File Structure

Updated Script to Handle Multiple Files

Explanation

How to Run

Recommend Projects

Recommend Topics

Recommend Org