ChatGPT is going to take my jerb!


There’s a lot of talk about Large-Language Model AIs like ChatGPT these days. People say the days of computer programming as a profession are numbered. I’ve seen stories of ChatGPT doing a pretty amazing job writing and debugging code.

I wanted to try it for myself. Is ChatGPT really going to take my jerb?

They Took Our Jobs

One of the hardest parts of programming is deciding exactly what you want the program to do. You have to think about all kinds of weird corner cases. 20% of the effort is spent on what it does when things are going right. 80% of the effort is on what it does, or should do, when things go wrong.  How computer systems fail is at least as important as how they work.

My feeling is that crafting a prompt to an AI so that it can write your program for you is itself an act of computer programming, and always will be. It’s just working at a higher level of abstraction. It will always be a skill that must be learned through practice.

That said, let’s give it a try. I will pretend I’m not a computer programmer, and have no interest in ever becoming a computer programmer. I will get ChatGPT to write a program for me.

The Mission

There’s a progressive-rock radio program on the University of Guelph radio station that I quite enjoy: The Sentinel’s Marvelous Kaleidoscope. But I can’t usually listen at its scheduled time, and the signal is poor in Waterloo. I would prefer it as a podcast, that I could download onto my phone and listen to while walking the dog.

The CFRU website hosts an archive of the show. Every week’s new episode appears there. I thought it would be cool if I could scrape the site and maintain a podcast RSS feed on my PC. Then subscribe to that RSS file in my media player, and sync to my phone automatically.

So, that’s the general gist of the problem to be solved. For reference, here’s a snippet of the HTML from the website we’re trying to scrape, with the important bits highlighted:

    <div id="main-content">
        <div class="container">
            <div id="content-area" class="clearfix">
                <div id="left-area" class="archive-filter">
                    <h1>Program Archives</h1>

                    <div class="grid-search">
                                                    <div class="archiveList-post" data-dayitem="true">
                                <a href="https://archive.cfru.ca/archive/2023/04/24/The Sentinel’s Marvellous Kaleidoscope - April 24, 2023 at 15:00 - CFRU 93.3.mp3" class="playnow play_archive button">play</a>
                                <div class="single_line">
                                    <div class="byline inline">3:00 pm</div>
                                    <div class="dash inline">
                                        –
                                    </div>
                                    <div class="show inline archive-title">The Sentinel’s Marvellous Kaleidoscope – April 24, 2023 at 15:00</div>
                                </div>

                            </div>

                                                        <div class="archiveList-post" data-dayitem="true">
                                <a href="https://archive.cfru.ca/archive/2023/04/17/The Sentinel’s Marvellous Kaleidoscope - April 17, 2023 at 15:00 - CFRU 93.3.mp3" class="playnow play_archive button">play</a>
                                <div class="single_line">
                                    <div class="byline inline">3:00 pm</div>
                                    <div class="dash inline">
                                        –
                                    </div>
                                    <div class="show inline archive-title">The Sentinel’s Marvellous Kaleidoscope – April 17, 2023 at 15:00</div>
                                </div>

                            </div>

This stuff is buried in 1700 lines of HTML boilerplate, but everything needed for this task is in the highlighted lines.

And here’s what the typical RSS feed XML file will look like, with one item already in it:

<?xml version='1.0' encoding='us-ascii'?>
<rss version="2.0">
    <channel>
        <title>The Sentinel's Marvelous Kaleidoscope</title>
        <description>Through this lens, you will be guided on audible explorations, gazing across the ever-changing horizon of music. The fantastical soundscape is filled with landmarks, relics and anomalies in progressive, fusion and alternative styles, near and far. Each observational session is constellatory, pulverizing, mystifying? and simply marvellous.</description>
        <link>C:\Users\Ron Harding\Documents\git\KaleidoscopePodcaster\SentinelsMarvelousKaleidoscope.rss</link>
        <item>
            <title>April 24, 2023</title>
            <link>https://archive.cfru.ca/archive/2023/04/24/The%20Sentinel%E2%80%99s%20Marvellous%20Kaleidoscope%20-%20April%2024%2C%202023%20at%2015%3A00%20-%20CFRU%2093.3.mp3</link>
            <guid isPermaLink="true">https://archive.cfru.ca/archive/2023/04/24/The%20Sentinel%E2%80%99s%20Marvellous%20Kaleidoscope%20-%20April%2024%2C%202023%20at%2015%3A00%20-%20CFRU%2093.3.mp3</guid>
            <pubDate>Mon, 24 Apr 2023 15:00:00 -0500</pubDate>
        </item>
    </channel>
</rss>

Attempt #1

Here’s my prompt to ChatGPT, written as if I was somebody who didn’t know anything about programming. Just a simple statement of what I want the program to do:

This web page contains a list of audio tracks for a weekly music program: http://www.cfru.ca/?s=The%20Sentinel%26%238217%3Bs%20Marvellous%20Kaleidoscope Every week, a new episode is added at the start of the list. Write a Python script to read the web-site and update a podcast RSS file in XML format. I will run the script weekly. Each time the script is run, it should read the web page, and add any new episodes to the RSS file.

And ChatGPTs response:

Something went wrong. If this issue persists please contact us through our help center at help.openai.com.

Off to a really good start, ChatGPT. But apparently that’s a common problem… refresh page and try again. The real response:

Sure! Here’s a Python script that reads the website and updates a podcast RSS file in XML format:

import requests
from bs4 import BeautifulSoup
import datetime
import xml.etree.ElementTree as ET

# URL of the website containing the audio tracks
url = 'http://www.cfru.ca/?s=The%20Sentinel%26%238217%3Bs%20Marvellous%20Kaleidoscope'

# Path of the RSS file
rss_file_path = '/path/to/podcast.rss'

# Read the RSS file
tree = ET.parse(rss_file_path)
root = tree.getroot()

# Get the latest pubDate from the existing RSS file
last_pub_date = root.findall('.//{http://www.w3.org/1999/02/22-rdf-syntax-ns#}date')[-1].text

# Fetch the website and parse it using BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the audio tracks on the website
audio_tracks = soup.find_all('div', class_='su-column-inner su-clearfix')

# Loop through the audio tracks and add any new ones to the RSS file
for audio_track in audio_tracks:
    track_date_str = audio_track.find('span', class_='su-note').text.strip()
    track_date = datetime.datetime.strptime(track_date_str, '%B %d, %Y')
    track_title = audio_track.find('a').text.strip()
    track_url = audio_track.find('a')['href']
    if track_date > datetime.datetime.strptime(last_pub_date, '%Y-%m-%dT%H:%M:%SZ'):
        item = ET.SubElement(root, 'item')
        title = ET.SubElement(item, 'title')
        title.text = track_title
        link = ET.SubElement(item, 'link')
        link.text = track_url
        pub_date = ET.SubElement(item, 'pubDate')
        pub_date.text = track_date.strftime('%a, %d %b %Y %H:%M:%S GMT')
        description = ET.SubElement(item, 'description')
        description.text = track_title
        enclosure = ET.SubElement(item, 'enclosure')
        enclosure.set('url', track_url)
        enclosure.set('length', '0')
        enclosure.set('type', 'audio/mpeg')

# Write the updated RSS file
tree.write(rss_file_path)

To use this script, replace /path/to/podcast.rss with the path to your RSS file, and run it once a week using a scheduler like cron. The script will read the website and add any new episodes to the RSS file.

Well, f*#&## me, that actually looks plausible! At a quick glance, anyway. Way better than I expected. The Python code looks syntactically-correct. Comments suggest it’s trying to do the right things. It seems to have some idea of what XML elements are supposed to be in an RSS feed file.

But does it work?

Traceback (most recent call last):
  File "chatgpt.py", line 17, in 
    last_pub_date = root.findall('.//{http://www.w3.org/1999/02/22-rdf-syntax-ns#}date')[-1].text
IndexError: list index out of range

Nope.

If I wasn’t a computer programmer, well-versed in Python, and with some understanding of HTML and XML, I’d be dead in the water at this point. All I could do would be to tell ChatGPT about the error, and see if it can fix it. But I know a bit more about why that line didn’t work: The .//{http://www.w3.org/1999/02/22-rdf-syntax-ns#}date element it’s searching for doesn’t exist in a typical RSS file. It should be <lastPubDate>, without any namespace specifier. So I can help ChatGPT along. But I don’t want to help it too much… I’m supposed to be a novice at this, after all.

Debugging…

Prompt: That doesn’t work. Line 17 gets a “list index out of range” error. The <lastPubDate> element of an RSS file doesn’t look anything like that.

The response (with the deltas highlighted by me):

I apologize for the mistake in my previous response. Here is an updated Python script that should work:

import requests
from bs4 import BeautifulSoup
import datetime
import xml.etree.ElementTree as ET

# URL of the website containing the audio tracks
url = 'http://www.cfru.ca/?s=The%20Sentinel%26%238217%3Bs%20Marvellous%20Kaleidoscope'

# Path of the RSS file
rss_file_path = '/path/to/podcast.rss'

# Read the RSS file
tree = ET.parse(rss_file_path)
root = tree.getroot()

# Get the latest pubDate from the existing RSS file
last_pub_date = root.findall('.//{http://purl.org/dc/elements/1.1/}date')[-1].text

# Fetch the website and parse it using BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the audio tracks on the website
audio_tracks = soup.find_all('div', class_='su-column-inner su-clearfix')

# Loop through the audio tracks and add any new ones to the RSS file
for audio_track in audio_tracks:
    track_date_str = audio_track.find('span', class_='su-note').text.strip()
    track_date = datetime.datetime.strptime(track_date_str, '%B %d, %Y')
    track_title = audio_track.find('a').text.strip()
    track_url = audio_track.find('a')['href']
    if track_date > datetime.datetime.strptime(last_pub_date, '%a, %d %b %Y %H:%M:%S %Z'):
        item = ET.SubElement(root, 'item')
        title = ET.SubElement(item, 'title')
        title.text = track_title
        link = ET.SubElement(item, 'link')
        link.text = track_url
        pub_date = ET.SubElement(item, 'pubDate')
        pub_date.text = track_date.strftime('%a, %d %b %Y %H:%M:%S GMT')
        description = ET.SubElement(item, 'description')
        description.text = track_title
        enclosure = ET.SubElement(item, 'enclosure')
        enclosure.set('url', track_url)
        enclosure.set('length', '0')
        enclosure.set('type', 'audio/mpeg')

# Write the updated RSS file
tree.write(rss_file_path)

This script assumes that the pubDate element in your RSS file is in the format of %a, %d %b %Y %H:%M:%S %Z (e.g. Tue, 20 Apr 2021 18:46:00 GMT). If the format is different, you will need to modify line 17 accordingly.

Again, I apologize for any confusion my previous response may have caused.

It was at this point, that I realized that the code wasn’t actually looking for the <lastPubDate> element at all. Which is good, because that’s actually an optional element, and an RSS reed is not required to have it. I think it was trying to find the <pubDate> subelement in the last <item> of the feed… which is probably why it was indexing with [-1] (the last element). That makes a lot of sense.

But it also highlights one of those weird corner cases I was talking about: What if the podcast RSS file doesn’t have any <item>s in it yet, because this script it being run for the first time? The script will fail, in that case.

But this a moot point, because it’s still using a really bizarre search string in the findall(), which will not find anything at all.

And, sure enough:

Traceback (most recent call last):
  File "chatgpt.py", line 17, in 
    last_pub_date = root.findall('.//{http://purl.org/dc/elements/1.1/}date')[-1].text
IndexError: list index out of range

So, this time I edited that line of code to something that I thought would work, and it did. It correctly found the last <pubDate>. The script just exited without doing anything. Some debugging on my part found that audio_tracks = soup.find_all('div', class_='su-column-inner su-clearfix') was finding no audio tracks. Again, the search criteria it’s using for the find_all() call make so sense. All these “su-?????” classes it’s looking for seem to be related to a WordPress plugin called Shortcodes Ultimate, which is not used by the site we’re trying to parse (it’s not a WordPress site at all.)

Being a helpful sort, I gave that hint to ChatGPT. And confessed to my previous misunderstanding as well:

Sorry, I just realized line 17 wasn’t supposed to finding the <lastPubDate> element for the channel, it was actually trying to find the <pubDate> element of the last <item> in the channel. Which is quite sensible. But line 17 still doesn’t work. I tried changing it to “last_pub_date = root.findall(‘.//pubDate’)[-1].text”, and that works. It correctly finds the last <pubDate> element in the file.

However, the script then exits without doing anything. Line 24 does not find any audio tracks at all. Your search criteria for the find_all() seems strange. The web page has no <div> elements with class “su-column-inner” or “su-clearfix”.

ChatGPT came back with this:

I apologize for the confusion. Here is an updated Python script that should work:

import requests
from bs4 import BeautifulSoup
import datetime
import xml.etree.ElementTree as ET

# URL of the website containing the audio tracks
url = 'http://www.cfru.ca/?s=The%20Sentinel%26%238217%3Bs%20Marvellous%20Kaleidoscope'

# Path of the RSS file
rss_file_path = '/path/to/podcast.rss'

# Read the RSS file
tree = ET.parse(rss_file_path)
root = tree.getroot()

# Get the latest pubDate from the existing RSS file
last_pub_date = root.findall('.//pubDate')[-1].text

# Fetch the website and parse it using BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the audio tracks on the website
audio_tracks = soup.find_all('article', class_='su-post')

# Loop through the audio tracks and add any new ones to the RSS file
for audio_track in audio_tracks:
    track_date_str = audio_track.find('span', class_='su-note').text.strip()
    track_date = datetime.datetime.strptime(track_date_str, '%B %d, %Y')
    track_title = audio_track.find('a').text.strip()
    track_url = audio_track.find('a')['href']
    if track_date > datetime.datetime.strptime(last_pub_date, '%a, %d %b %Y %H:%M:%S %Z'):
        item = ET.SubElement(root, 'item')
        title = ET.SubElement(item, 'title')
        title.text = track_title
        link = ET.SubElement(item, 'link')
        link.text = track_url
        pub_date = ET.SubElement(item, 'pubDate')
        pub_date.text = track_date.strftime('%a, %d %b %Y %H:%M:%S GMT')
        description = ET.SubElement(item, 'description')
        description.text = track_title
        enclosure = ET.SubElement(item, 'enclosure')
        enclosure.set('url', track_url)
        enclosure.set('length', '0')
        enclosure.set('type', 'audio/mpeg')

# Write the updated RSS file
tree.write(rss_file_path)

In this updated script, I changed the search criteria for find_all() to look for all <article> elements with a class of su-post, which is the class used by the website for its audio track entries. I apologize for the confusion and any inconvenience caused by my previous responses. Please let me know if you have any further questions or issues.

ChatGPT changed the findall() to look for <article> elements instead of <div> elements. It was obvious without even trying that this wouldn’t work either.

No, I’m afraid that still won’t work. The web page contains no <article> elements at all. I suspect maybe you haven’t looked at the page itself to understand the HTML it returns.

I suggest looking for <a> elements with class “play_archive”. Those elements contain the actual links to the tracks.

Thank you for bringing this to my attention. I apologize for the confusion and for not properly examining the website. Here is the updated Python script that should work with the new search criteria:

import requests
from bs4 import BeautifulSoup
import datetime
import xml.etree.ElementTree as ET

# URL of the website containing the audio tracks
url = 'http://www.cfru.ca/?s=The%20Sentinel%26%238217%3Bs%20Marvellous%20Kaleidoscope'

# Path of the RSS file
rss_file_path = '/path/to/podcast.rss'

# Read the RSS file
tree = ET.parse(rss_file_path)
root = tree.getroot()

# Get the latest pubDate from the existing RSS file
last_pub_date = root.findall('.//pubDate')[-1].text

# Fetch the website and parse it using BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the audio tracks on the website
audio_tracks = soup.find_all('a', class_='play_archive')

# Loop through the audio tracks and add any new ones to the RSS file
for audio_track in audio_tracks:
    track_url = audio_track['href']
    track_date_str = audio_track.find('span').text.strip()
    track_date = datetime.datetime.strptime(track_date_str, '%B %d, %Y')
    track_title = audio_track.find_previous('h2').text.strip()
    if track_date > datetime.datetime.strptime(last_pub_date, '%a, %d %b %Y %H:%M:%S %Z'):
        item = ET.SubElement(root, 'item')
        title = ET.SubElement(item, 'title')
        title.text = track_title
        link = ET.SubElement(item, 'link')
        link.text = track_url
        pub_date = ET.SubElement(item, 'pubDate')
        pub_date.text = track_date.strftime('%a, %d %b %Y %H:%M:%S GMT')
        description = ET.SubElement(item, 'description')
        description.text = track_title
        enclosure = ET.SubElement(item, 'enclosure')
        enclosure.set('url', track_url)
        enclosure.set('length', '0')
        enclosure.set('type', 'audio/mpeg')

# Write the updated RSS file
tree.write(rss_file_path)

In this updated script, I changed the search criteria for find_all() to look for all <a> elements with a class of play_archive, which is the class used by the website for its audio track links. I also updated the code to extract the title and publication date from the appropriate elements on the page.

I hope this updated script works correctly for you. Let me know if you have any further issues or questions.

Alright… let’s see what we get now. Nope… still fails.

No good. Fails on line 29 now:

Traceback (most recent call last):
File “chatgpt.py”, line 29, in
track_date_str = audio_track.find(‘span’).text.strip()
AttributeError: ‘NoneType’ object has no attribute ‘text’

It finds the <a> elements correctly, but you’re then looking for a <span> subelement under the <a>, but that subelement does not exist… which you would realize if you actually looked at the HTML.

I’m thinking now you might be better off if instead of searching for the <a> elements, you go up a level to search for <div> elements, with a class “archiveList-post”. You can then find inside that both the <a> element, with the track’s URL, and a <div> element that contains the track’s name. Unfortunately, there is no element that contains the track’s date by itself… you’ll have to extract the date out of the track name.

And, one last response from ChatGPT:

I apologize for the mistakes in my previous responses. You are right, I should have looked more closely at the HTML of the page. Thank you for your patience.

Here is a revised version of the script that should work better. It searches for <div> elements with class “archiveList-post”, extracts the track URL and name from the <a> and <div> elements inside each one, and extracts the track date from the name using a regular expression.

import requests
import datetime
import re
import xml.etree.ElementTree as ET

# URL of the web page with the audio tracks
url = 'http://www.cfru.ca/?s=The%20Sentinel%26%238217%3Bs%20Marvellous%20Kaleidoscope'

# Regular expression to extract the track date from the track name
date_regex = re.compile(r'\d{4}-\d{2}-\d{2}')

# Read the web page
response = requests.get(url)
html = response.text

# Parse the HTML into an ElementTree
root = ET.fromstring(html)

# Find the channel element of the RSS file
channel = root.find('channel')

# Find the last pubDate in the RSS file
last_pub_date_str = channel.find('lastBuildDate').text
last_pub_date = datetime.datetime.strptime(last_pub_date_str, '%a, %d %b %Y %H:%M:%S %z')

# Find all the audio tracks on the web page
audio_tracks = root.findall('.//div[@class="archiveList-post"]')

# Create a new RSS item for each audio track and add it to the channel
for audio_track in audio_tracks:
    # Extract the track URL and name
    track_url = audio_track.find('a').get('href')
    track_name = audio_track.find('div').text.strip()
    
    # Extract the track date from the name using the regular expression
    date_match = date_regex.search(track_name)
    if date_match is None:
        continue
    track_date_str = date_match.group(0)
    track_date = datetime.datetime.strptime(track_date_str, '%Y-%m-%d')
    
    # Skip tracks that were already in the RSS file
    if track_date <= last_pub_date:
        continue
    
    # Create a new RSS item for the track
    item = ET.Element('item')
    channel.append(item)
    
    # Set the title, link, and pubDate of the item
    item_title = ET.SubElement(item, 'title')
    item_title.text = track_name
    
    item_link = ET.SubElement(item, 'link')
    item_link.text = track_url
    
    item_pub_date = ET.SubElement(item, 'pubDate')
    item_pub_date.text = track_date.strftime('%a, %d %b %Y %H:%M:%S %z')

# Write the updated RSS file to disk
tree = ET.ElementTree(root)
tree.write('podcast.xml', encoding='UTF-8', xml_declaration=True)

I hope this version works better for you!

And… it has completely borked the entire thing. It no longer reads the RSS file at all; it’s trying to read the RSS file content from the web-page HTML.

It’s using a regular-expression to pull the date out of the track name… that’s good; how I would do it. But this regex will not match the dates as they appear in the HTML.

Giving up

At this point, I’m giving up on it. I can’t get it to generate working code. It’s always close, but never quite there. It’s quite obvious that it never tried fetching the HTML from the website and trying to understand the structure of it, so it could parse it correctly. Perhaps there’s a technical limitation of ChatGPT there (it just isn’t able to fetch content from other sources without a defined ChatGPT API.) And expecting it to look at that HTML and pick the important bits out of 1700 lines of fluff is probably a very tall order for a Large Language Model AI, which is basically just pattern-recognition on a grand scale.

My own solution

Here’s my own initial version of the code. It took me one morning to learn about the format of an RSS feed XML file, and get it basically functional. I was lazy with parsing the HTML; I knew everything I needed for each episode could be found in a single <a> element, so I just used a regular-expression to find them.

import urllib.request
import urllib.parse
import html
import re
import os
import xml.etree.ElementTree as ET
from datetime import datetime, timezone, timedelta
import email.utils

rss_feed = 'SentinelsMarvelousKaleidoscope.rss'
archive_url = "http://www.cfru.ca/?s=The%20Sentinel%26%238217%3Bs%20Marvellous%20Kaleidoscope"

# Read in existing RSS feed
rss = ET.parse(rss_feed)
existing_episodes = set()
first_item_index = -1
for i, child in enumerate(rss.find("channel")):
    if child.tag == 'item':
        if first_item_index == -1:
            first_item_index = i
        existing_episodes.add( child.find('link').text )
if first_item_index == -1:
    # There were no items... first one will go at the end.
    first_item_index = len(rss.find("channel"))

def fix_indentation(current, parent=None, index=-1, depth=0, indentation='    '):
    for i, node in enumerate(current):
        fix_indentation( node, current, i, depth+1 )
    if parent is not None:
        if index == 0:
            parent.text = '\n' + (indentation*depth)
        else:
            parent[index-1].tail = '\n' + (indentation*depth)
        if index == len(parent)-1:
            current.tail = '\n' + (indentation*(depth-1))
        
with urllib.request.urlopen( archive_url ) as response:
    archived_shows_html = response.read().decode( 'UTF-8' )

# The archive is always returned in reverse chronological order, newest episodes first.
changes = False
for show in re.findall( r'<a href="(.*)" class="playnow play_archive button">play</a>', archived_shows_html ):
    show = html.unescape( show )
    parsed = urllib.parse.urlparse( show )
    assert parsed.params == ""
    assert parsed.query == ""
    assert parsed.fragment == ""

    episode_name = os.path.basename( parsed.path )
    quoted_path = urllib.parse.quote(parsed.path)
    new_url = urllib.parse.urlunparse( (parsed.scheme, parsed.netloc, quoted_path, "", "", "") )

    if new_url in existing_episodes:
        # We already have this one.  Skip it.
        print( "Skipping existing episode '%s'" % episode_name )
        continue

    print( "Found new episode '%s'" % episode_name )
    # Pull out the date and time.
    date_time = re.match( r'.* - (.*) - .*', episode_name )[1]
    print( date_time )
    timestamp = datetime.strptime( date_time, "%B %d, %Y at %H:%M" )
    timestamp = timestamp.replace( tzinfo=timezone(timedelta(hours=-5), "EST") )

    # Create a new <item>
    item = ET.Element( 'item' )
    title = ET.SubElement( item, 'title' )
    title.text = timestamp.strftime( "%B %d, %Y" )
    link = ET.SubElement( item, 'link' )
    link.text = new_url
    guid = ET.SubElement( item, 'guid', isPermaLink="true" )
    guid.text = new_url
    pubDate = ET.SubElement( item, 'pubDate' )
    pubDate.text = email.utils.format_datetime( timestamp )
    
    # Add it to the RSS
    channel = rss.find( "./channel" )
    channel.insert( first_item_index, item )
    first_item_index = first_item_index + 1
    changes = True

if changes:
    fix_indentation( rss.getroot() )
    rss.write( rss_feed, xml_declaration=True )
,

Leave a Reply

Your email address will not be published. Required fields are marked *