commit 88d7e9fcf3572e9ff931811fa069af23652a62a9 · ewancroft.uk/bluesky-markov

+1

.python-version

··· 1 + 3.12.0

+65 -95

README.md

··· 4 4 5 5 This repository contains a set of scripts to create and post Markov chain-generated content to a Bluesky account using data retrieved from another Bluesky account. 6 6 7 - ## Files 8 - 9 - ### `bsky_api.py` 10 - 11 - Provides functions to log in to a Bluesky account and resolve DID (Decentralized Identifier). 12 - 13 - - **Functions:** 14 - - `login(handle_env_var, app_pass_env_var)`: Logs in to a Bluesky account using environment variables for handle and app password. 15 - - `DID_resolve(handle)`: Resolves and retrieves the DID document for a given handle. 16 - 17 - ### `clean.py` 18 - 19 - Contains a function to clean HTML content from posts. 20 - 21 - - **Functions:** 22 - - `clean_content(content)`: Removes HTML tags, HTML entities, usernames, special characters, and colon-enclosed words from content. 23 - 24 - ### `markov_gen.py` 25 - 26 - Includes functions to retrieve posts, generate Markov chain text, and refresh the Markov dataset. 27 - 28 - - **Functions:** 29 - - `retrieve_posts(client, client_did)`: Retrieves posts from a Bluesky account. 30 - - `generate(markov, char_limit)`: Generates text using the Markov model within a character limit. 31 - - `refresh_dataset(markov, source_posts)`: Refreshes the Markov dataset with new posts. 32 - - `get_account_posts(client, client_did)`: Gets and cleans posts from a Bluesky account. 33 - 34 - ### `time_utils.py` 35 - 36 - Provides utility functions for time calculations and sleep management. 37 - 38 - - **Functions:** 39 - - `calculate_refresh_interval()`: Calculates a random refresh interval between 30 minutes to 3 hours. 40 - - `calculate_next_refresh(current_time, refresh_interval)`: Calculates the next refresh time. 41 - - `format_time_remaining(time_remaining)`: Formats the time remaining until the next refresh. 42 - - `sleep_until_next_refresh(next_refresh)`: Sleeps until the next refresh time. 7 + ## Table of Contents 43 8 44 - ### `main.py` 9 + - [Features](#features) 10 + - [Requirements](#requirements) 11 + - [Installation](#installation) 12 + - [Usage](#usage) 13 + - [Logging](#logging) 14 + - [File Structure](#file-structure) 15 + - [Contributing](#contributing) 16 + - [License](#license) 45 17 46 - The main script to run the bot. It sets up environment variables, logs in to accounts, retrieves and processes posts, and generates and posts new content. 18 + ## Features 47 19 48 - - **Workflow:** 49 - 1. Load environment variables. 50 - 2. Log in to source and destination Bluesky accounts. 51 - 3. Retrieve and clean posts from the source account. 52 - 4. Generate new content using the Markov model. 53 - 5. Post generated content to the destination account. 54 - 6. Repeat the process at calculated intervals. 20 + - Fetches posts from a specified source account. 21 + - Cleans and processes the retrieved content to ensure quality. 22 + - Utilises Markov chain text generation to create new posts. 23 + - Automatically posts generated content to a designated destination account. 24 + - Logs all significant events and errors for debugging purposes. 55 25 56 - ### `requirements.txt` 26 + ## Requirements 57 27 58 - Lists the Python dependencies required to run the scripts. 28 + To run this project, you will need the following: 59 29 60 - - **Dependencies:** 61 - - `atproto` 30 + - Python 3.x 31 + - Required libraries (install via `pip`): 62 32 - `dotenv` 63 33 - `markovchain` 64 - 65 - ### `example.env.txt` 66 - 67 - Example environment variables file. Copy this to `.env` and fill in your credentials. 68 - 69 - - **Environment Variables:** 70 - - `SOURCE_HANDLE`: The handle of the source Bluesky account. 71 - - `SRC_APP_PASS`: The app password for the source Bluesky account. 72 - - `DST_APP_PASS`: The app password for the destination Bluesky account. 73 - - `DESTINATION_HANDLE`: The handle of the destination Bluesky account. 74 - - `CHAR_LIMIT`: The character limit for generated posts. 34 + - `atproto` 75 35 76 - ## Setup and Usage 36 + You can install the required libraries using: 37 + ```bash 38 + pip install python-dotenv markovchain atproto 39 + ``` 77 40 78 - 1. **Clone the repository:** 41 + ## Installation 79 42 80 - ```sh 43 + 1. Clone the repository: 44 + ```bash 81 45 git clone https://github.com/ewanc26/bluesky-markov.git 82 46 cd bluesky-markov 83 47 ``` 84 48 85 - 2. **Install the dependencies:** 86 - 87 - ```sh 88 - pip3 install -r requirements.txt 49 + 2. Create a `.env` file in the root directory of the project and add your environment variables: 50 + ```plaintext 51 + SOURCE_HANDLE=your_source_handle 52 + DESTINATION_HANDLE=your_destination_handle 53 + CHAR_LIMIT=280 54 + SRC_APP_PASS=your_source_app_password 55 + DST_APP_PASS=your_destination_app_password 89 56 ``` 90 57 91 - 3. **Set up environment variables:** 92 - 93 - - Copy `example.env.txt` to `.env`: 58 + ## Usage 94 59 95 - ```sh 96 - cp example.env.txt .env 97 - ``` 60 + 1. Navigate to the project directory. 61 + 2. Run the main script: 62 + ```bash 63 + python src/main.py 64 + ``` 98 65 99 - - Edit `.env` and fill in your Bluesky handles and app passwords. 66 + The application will log into the source and destination accounts, retrieve posts, generate new content based on the retrieved posts, and post the generated content to the destination account at random intervals. 100 67 101 - 4. **Run the bot:** 68 + ## Logging 102 69 103 - ```sh 104 - python3 -u 'src/main.py' 105 - ``` 70 + All logs are stored in the `log` directory. The logs are written to `general.log`, where you can find details about the application's execution, including: 106 71 107 - ## Notes 72 + - Successful logins 73 + - Retrieved posts 74 + - Generated content 75 + - Errors and exceptions 108 76 109 - - Ensure that you have valid Bluesky handles and app passwords set in the `.env` file. 110 - - The bot continuously runs and generates new posts at intervals between 30 minutes and 3 hours. 111 - - Press `Ctrl+C` to stop the bot. 77 + ## File Structure 112 78 113 - ## Project Structure 79 + The project has the following structure: 114 80 115 - ```plaintext 116 - bluesky-markov/ 81 + ``` 82 + project-root/ 83 + │ 84 + ├── log/ 85 + │ └── general.log # Log file for application events 117 86 │ 118 87 ├── src/ 119 - │ ├── .env # Environment configuration file 120 - │ ├── bsky_api.py # Module for Bluesky API interactions 121 - │ ├── clean.py # Module for cleaning content 122 - │ ├── markov_gen.py # Module for Markov chain text generation 123 - │ ├── main.py # Main script for the bot 124 - │ ├── time_utils.py # Module for time-related utilities 125 - │ ├── example.env.txt # Example environment configuration file 88 + │ ├── clean.py # Contains functions to clean retrieved content 89 + │ ├── markov_gen.py # Handles Markov chain text generation 90 + │ ├── time_utils.py # Utilities for time management and scheduling 91 + │ ├── bsky_api.py # Functions for interacting with the Bluesky API 92 + │ └── main.py # Main application logic 126 93 │ 127 - ├── requirements.txt # Python dependencies 128 - ├── LICENSE # Licensing file 129 - └── README.md # Project README file 94 + ├── .env # Environment variables for authentication 95 + └── README.md # Project documentation 130 96 ``` 97 + 98 + ## Contributing 99 + 100 + Contributions are welcome! If you have suggestions for improvements or find bugs, feel free to open an issue or submit a pull request. 131 101 132 102 ## License 133 103

+1 -1

requirements.txt

··· 1 1 atproto 2 - dotenv 2 + python-dotenv 3 3 markovchain

+25 -3

src/bsky_api.py

··· 1 1 from atproto import IdResolver, Client 2 2 import os 3 3 import dotenv 4 + import logging 5 + 6 + # Ensure the log directory exists 7 + log_directory = 'log' 8 + if not os.path.exists(log_directory): 9 + os.makedirs(log_directory) 10 + 11 + # Set up logging to a file in the log directory 12 + logging.basicConfig( 13 + filename=os.path.join(log_directory, 'general.log'), 14 + level=logging.DEBUG, 15 + format='%(asctime)s - %(levelname)s - %(message)s' 16 + ) 4 17 5 18 def login(handle_env_var, app_pass_env_var): 6 19 try: ··· 9 22 app_pass = os.getenv(app_pass_env_var) 10 23 11 24 if not handle or not app_pass: 25 + logging.error("Handle or app password missing in environment variables.") 12 26 raise ValueError("Handle or app password missing in environment variables") 13 27 28 + logging.debug("Attempting to log in with handle: %s", handle) 14 29 client = Client() 15 30 client.login(handle, app_pass) # Access credentials securely 16 31 32 + logging.info("Login successful for handle: %s", handle) 17 33 return client 18 34 19 35 except Exception as e: 20 - print(e) 36 + logging.exception("An error occurred during login: %s", e) 21 37 quit(1) 22 38 23 39 def DID_resolve(handle): 24 40 try: 41 + logging.debug("Resolving DID for handle: %s", handle) 25 42 resolver = IdResolver() 26 43 did = resolver.handle.resolve(handle) 44 + logging.debug("Resolved DID: %s", did) 45 + 27 46 did_doc = resolver.did.resolve(did) 47 + logging.debug("Resolved DID Document: %s", did_doc) 28 48 29 49 package = {"did": did, "did_doc": did_doc} 50 + logging.info("Successfully resolved DID and DID Document.") 30 51 31 52 return package 53 + 32 54 except Exception as e: 33 - print(e) 34 - return None 55 + logging.exception("An error occurred while resolving DID: %s", e) 56 + return None

+30 -3

src/clean.py

··· 1 1 from html import unescape 2 2 import re 3 + import logging 4 + import os 5 + 6 + # Ensure the log directory exists 7 + log_directory = 'log' 8 + if not os.path.exists(log_directory): 9 + os.makedirs(log_directory) 10 + 11 + # Set up logging to a file in the log directory 12 + logging.basicConfig( 13 + filename=os.path.join(log_directory, 'general.log'), 14 + level=logging.DEBUG, 15 + format='%(asctime)s - %(levelname)s - %(message)s' 16 + ) 3 17 4 18 def clean_content(content): 19 + logging.debug("Original content: %s", content) # Log the original content 20 + 5 21 cleaned_content = re.sub('<[^<]+?>', '', content) # Remove HTML tags 22 + logging.debug("After removing HTML tags: %s", cleaned_content) 23 + 6 24 cleaned_content = unescape(cleaned_content) # Decode HTML entities 7 - cleaned_content = re.sub(r'@\w+', '', cleaned_content) # Remove usernames 25 + logging.debug("After decoding HTML entities: %s", cleaned_content) 26 + 27 + # Updated regex to remove usernames based on domain patterns 28 + domain_regex = r'@\w+\.([a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?' 29 + cleaned_content = re.sub(domain_regex, '', cleaned_content) # Remove usernames based on domain 30 + logging.debug("After removing usernames: %s", cleaned_content) 31 + 8 32 cleaned_content = re.sub(r'[^\w\s.,!?;:]', '', cleaned_content) # Remove special characters 33 + logging.debug("After removing special characters: %s", cleaned_content) 34 + 9 35 cleaned_content = re.sub(r':\w+:', '', cleaned_content) # Remove words enclosed with colons 10 - 11 - return cleaned_content 36 + logging.debug("After removing words enclosed with colons: %s", cleaned_content) 37 + 38 + return cleaned_content

+46 -12

src/main.py

··· 1 1 import os 2 2 from datetime import datetime 3 - import dotenv 3 + import logging 4 + from dotenv import load_dotenv 4 5 from bsky_api import login, DID_resolve 5 6 from markov_gen import generate, refresh_dataset, get_account_posts 6 7 from time_utils import calculate_refresh_interval, calculate_next_refresh, sleep_until_next_refresh 7 8 from markovchain.text import MarkovText 8 9 9 - dotenv.load_dotenv() 10 + # Ensure the log directory exists 11 + log_directory = 'log' 12 + if not os.path.exists(log_directory): 13 + os.makedirs(log_directory) 14 + 15 + # Set up logging to a file in the log directory 16 + logging.basicConfig( 17 + filename=os.path.join(log_directory, 'general.log'), 18 + level=logging.DEBUG, 19 + format='%(asctime)s - %(levelname)s - %(message)s' 20 + ) 21 + 22 + logging.info("NEW EXECUTION OF APPLICATION\n\n\n") 23 + 24 + # Load environment variables 25 + load_dotenv() 10 26 11 27 source_handle = os.getenv("SOURCE_HANDLE") 12 28 destination_handle = os.getenv("DESTINATION_HANDLE") 13 29 char_limit = int(os.getenv("CHAR_LIMIT", 280)) 14 30 15 - source_client = login("SOURCE_HANDLE", "SRC_APP_PASS") 16 - source_did_package = DID_resolve(source_handle) 17 - source_did = source_did_package['did'] 31 + # Log environment variable loading 32 + logging.debug("Loaded environment variables: SOURCE_HANDLE=%s, DESTINATION_HANDLE=%s, CHAR_LIMIT=%d", 33 + source_handle, destination_handle, char_limit) 18 34 19 - destination_client = login("DESTINATION_HANDLE", "DST_APP_PASS") 35 + # Login to source client and resolve DID 36 + try: 37 + source_client = login("SOURCE_HANDLE", "SRC_APP_PASS") 38 + logging.info("Successfully logged in to source account.") 39 + 40 + source_did_package = DID_resolve(source_handle) 41 + source_did = source_did_package['did'] 42 + logging.info("Resolved source DID: %s", source_did) 43 + 44 + destination_client = login("DESTINATION_HANDLE", "DST_APP_PASS") 45 + logging.info("Successfully logged in to destination account.") 46 + 47 + except Exception as e: 48 + logging.exception("An error occurred during setup: %s", e) 49 + quit(1) 20 50 21 51 markov = MarkovText() 22 52 ··· 24 54 global markov, char_limit, destination_client 25 55 26 56 generated_text = ' '.join(generate(markov, char_limit)) 27 - 28 - print("Generated Text for Post:", generated_text) 57 + logging.debug("Generated text for post: %s", generated_text) 29 58 30 59 try: 31 60 response = destination_client.send_post( ··· 33 62 langs=['en'] 34 63 ) 35 64 post_link = response['uri'] 36 - print(f"Posted to destination Bluesky account successfully: {post_link}") 65 + logging.info("Posted to destination Bluesky account successfully: %s", post_link) 37 66 except Exception as e: 38 - print(f"Error posting to destination Bluesky account: {e}") 67 + logging.error("Error posting to destination Bluesky account: %s", e) 39 68 40 69 try: 41 70 while True: 42 71 current_time = datetime.now() 43 72 refresh_interval = calculate_refresh_interval() 44 73 next_refresh = calculate_next_refresh(current_time, refresh_interval) 74 + 75 + logging.debug("Current time: %s, Refresh interval: %s, Next refresh: %s", current_time, refresh_interval, next_refresh) 45 76 46 77 source_posts = get_account_posts(source_client, source_did) 78 + logging.debug("Fetched source posts for DID: %s", source_did) 79 + 47 80 markov = refresh_dataset(markov, source_posts) 81 + logging.info("Markov dataset refreshed.") 82 + 48 83 generate_and_post_example() 49 - 50 84 sleep_until_next_refresh(next_refresh) 51 85 52 86 except KeyboardInterrupt: 53 - print("\nExiting...") 87 + logging.info("Exiting on user interrupt.")

+93 -32

src/markov_gen.py

··· 1 + import logging 2 + import os 1 3 from markovchain.text import MarkovText 2 - from clean import clean_content 4 + from clean import clean_content # Assuming clean_content is in clean.py 5 + 6 + # Ensure the log directory exists 7 + log_directory = 'log' 8 + if not os.path.exists(log_directory): 9 + os.makedirs(log_directory) 10 + 11 + # Set up logging to a file in the log directory 12 + logging.basicConfig( 13 + filename=os.path.join(log_directory, 'general.log'), 14 + level=logging.DEBUG, 15 + format='%(asctime)s - %(levelname)s - %(message)s' 16 + ) 3 17 4 18 def retrieve_posts(client, client_did): 5 19 post_list = [] 6 20 has_more = True 7 - cursor = None # Initialize cursor for pagination 21 + cursor = None # Initialise cursor for pagination 22 + 23 + logging.info(f"Starting to retrieve posts for client ID: {client_did}") 8 24 9 25 while has_more: 10 26 try: 27 + # Use cursor for pagination 11 28 if cursor: 12 - posts = client.app.bsky.feed.post.list(client_did, limit=100, cursor=cursor) 29 + data = client.app.bsky.feed.post.list(client_did, limit=100, cursor=cursor) 13 30 else: 14 - posts = client.app.bsky.feed.post.list(client_did, limit=100) 15 - except Exception as e: 16 - print(f"Error fetching posts: {e}") 17 - break 31 + data = client.app.bsky.feed.post.list(client_did, limit=100) 32 + 33 + logging.debug(f"Fetched data with cursor: {cursor}") 34 + 35 + # Log the structure of the returned data 36 + logging.debug("Data response structure: %s", data) 18 37 19 - if not posts.records.items(): 20 - print("No more posts found.") 21 - break 38 + # Check for the correct attribute containing the posts 39 + if not hasattr(data, 'records') or not data.records: 40 + logging.info("No more posts found or 'records' attribute is missing.") 41 + break 22 42 23 - for post in posts.records.items(): 24 - post_list.append(post) 43 + # Fetch posts from the 'records' attribute 44 + posts = data.records 25 45 26 - # Check if the response has a cursor for pagination 27 - cursor = getattr(posts, 'next_cursor', None) 28 - has_more = bool(cursor) # Continue if there is a next_cursor 46 + # Add the posts to the post_list 47 + post_list.extend(posts.values()) # Access values directly from the dictionary 29 48 30 - return post_list 49 + logging.info(f"Retrieved {len(posts)} posts.") 31 50 32 - def generate(markov, char_limit): 33 - generated_text = markov() 51 + # Get the next cursor for pagination 52 + cursor = data.cursor # Use the 'cursor' returned from the data for the next page 53 + has_more = bool(cursor) # Continue if there is a next_cursor 34 54 35 - if len(generated_text) > char_limit: 36 - generated_text = generated_text[:char_limit] 55 + except Exception as e: 56 + logging.error(f"Error fetching posts: {e}") 57 + break 37 58 38 - print("Generated Text:", generated_text) 59 + logging.info(f"Completed retrieval of posts for client ID: {client_did}. Total posts retrieved: {len(post_list)}") 39 60 40 - words = generated_text.split() 61 + # Log the structure of the first few posts for debugging 62 + for post in post_list[:5]: # Show only the first 5 posts for brevity 63 + logging.debug("Post structure: %s", post) 41 64 42 - return words 65 + return post_list 43 66 44 67 def refresh_dataset(markov, source_posts): 45 - markov = MarkovText() 68 + if not markov: 69 + markov = MarkovText() # Initialise if not passed as an argument 46 70 47 71 if source_posts: 48 - print(f"Fetched {len(source_posts)} original posts and replies from the source account.") 72 + logging.info(f"Fetched {len(source_posts)} original posts and replies from the source account.") 49 73 50 74 for post in source_posts: 51 - markov.data(post, part=True) 75 + if isinstance(post, str): 76 + markov.data(post, part=True) # Directly add the string if it's a simple string 77 + elif hasattr(post, 'text'): 78 + markov.data(post.text, part=True) # Add if it's an object with text 79 + elif hasattr(post, 'value') and hasattr(post.value, 'text'): 80 + markov.data(post.value.text, part=True) # Access text correctly 81 + else: 82 + logging.warning("Post does not have 'text' attribute, skipping: %s", post) 52 83 53 - markov.data('', part=False) 84 + markov.data('', part=False) # Ensuring a proper closure of data input 85 + 86 + # Log the current dataset state for debugging 87 + logging.debug("Current length of posts: %s\n\n\n", len(source_posts)) 88 + logging.debug("Current Markov dataset: %s", markov.storage) 54 89 55 90 return markov 56 91 92 + def generate(markov, char_limit): 93 + try: 94 + generated_text = markov() 95 + except KeyError as e: 96 + logging.error("KeyError occurred during generation: %s", e) 97 + return [] 98 + 99 + if len(generated_text) > char_limit: 100 + generated_text = generated_text[:char_limit] 101 + 102 + logging.debug("Generated Text: %s", generated_text) 103 + 104 + words = generated_text.split() 105 + return words 106 + 57 107 def get_account_posts(client, client_did): 58 108 posts = retrieve_posts(client, client_did) 59 - 109 + 60 110 # Debugging: Print structure of the first post 61 111 if posts: 62 - print("First post structure:", posts[0]) 63 - 64 - # Adjusted list comprehension to handle tuples and access text field from Record instance 65 - return [clean_content(post[1].text) for post in posts if hasattr(post[1], 'text')] 112 + logging.debug("First post structure: %s", posts[0]) 113 + 114 + # Ensure we are accessing the text correctly 115 + cleaned_posts = [] 116 + for post in posts: 117 + if hasattr(post, 'value') and hasattr(post.value, 'text'): 118 + cleaned_text = clean_content(post.value.text) 119 + cleaned_posts.append(cleaned_text) 120 + elif hasattr(post, 'text'): # Fall back to checking for a direct 'text' attribute 121 + cleaned_text = clean_content(post.text) 122 + cleaned_posts.append(cleaned_text) 123 + else: 124 + logging.warning("Post does not have 'value' or 'text', skipping: %s", post) 125 + 126 + return cleaned_posts

+26 -3

src/time_utils.py

··· 1 1 import random 2 2 from datetime import datetime, timedelta 3 3 import time 4 + import logging 5 + import os 6 + 7 + # Ensure the log directory exists 8 + log_directory = 'log' 9 + if not os.path.exists(log_directory): 10 + os.makedirs(log_directory) 11 + 12 + # Set up logging to a file in the log directory 13 + logging.basicConfig( 14 + filename=os.path.join(log_directory, 'general.log'), 15 + level=logging.DEBUG, 16 + format='%(asctime)s - %(levelname)s - %(message)s' 17 + ) 4 18 5 19 def calculate_refresh_interval(): 6 20 # Calculate a random refresh interval between 30 minutes to 3 hours 7 - return random.randint(1800, 10800) 21 + refresh_interval = random.randint(1800, 10800) 22 + logging.debug("Calculated refresh interval: %d seconds", refresh_interval) 23 + return refresh_interval 8 24 9 25 def calculate_next_refresh(current_time, refresh_interval): 10 26 # Calculate the next refresh time based on the current time and refresh interval 11 - return current_time + timedelta(seconds=refresh_interval) 27 + next_refresh = current_time + timedelta(seconds=refresh_interval) 28 + logging.debug("Calculated next refresh time: %s", next_refresh) 29 + return next_refresh 12 30 13 31 def format_time_remaining(time_remaining): 14 32 hours = int(time_remaining.total_seconds()) // 3600 15 33 minutes = int((time_remaining.total_seconds() % 3600) // 60) 16 34 seconds = int(time_remaining.total_seconds() % 60) 17 - return f"{hours} hour{'s' if hours != 1 else ''}, {minutes} minute{'s' if minutes != 1 else ''}, {seconds} second{'s' if seconds != 1 else ''}" 35 + formatted_time = f"{hours} hour{'s' if hours != 1 else ''}, {minutes} minute{'s' if minutes != 1 else ''}, {seconds} second{'s' if seconds != 1 else ''}" 36 + logging.debug("Formatted time remaining: %s", formatted_time) 37 + return formatted_time 18 38 19 39 def sleep_until_next_refresh(next_refresh): 20 40 current_time = datetime.now() ··· 22 42 23 43 if time_remaining.total_seconds() > 0: 24 44 print(f"Time until next post refresh: {format_time_remaining(time_remaining)}") 45 + logging.info("Sleeping for %d seconds until next refresh.", time_remaining.total_seconds()) 25 46 time.sleep(time_remaining.total_seconds()) 47 + else: 48 + logging.warning("Next refresh time is in the past. No sleep required.")