A simple script for checking whether a webpage has changed or not

Suppose that you want to keep track if a webpage (say a blog) has changed since last time you visited it. Instead of visiting the blog, you simply hit some command and a YES/NO-answer is given. If you think this is a great idea, read on… if you are only looking for the Alfred extension, scroll down to the bottom of the post.

The idea is to save the MD5 hash of the contents of a webpage (we dont want to save the whole page since that would not be very space efficient!):

\textnormal{MEMORY}[f_{\textnormal{MD5}}(\textnormal{URL})]\leftarrow f_{\textnormal{MD5}}(\textnormal{contents})

The next time we call the command we check if the value has changed:

\mathbf{if} (\textnormal{MEMORY}[f_{\textnormal{MD5}}(\textnormal{URL})] = f_{\textnormal{MD5}}(\textnormal{contents})) \ \mathbf{then}
   \textnormal{Not changed}
\mathbf{else}
   \textnormal{Has changed}
   \textnormal{MEMORY}[f_{\textnormal{MD5}}(\textnormal{URL})]\leftarrow f_{\textnormal{MD5}}(\textnormal{contents})

Really simple, eh? Below follows a simple bash script which performs the task:

#!/bin/sh

url='google.com'
tmp_dir=$HOME'/.tmp/'

# assert that temporary directory exists
mkdir -p $tmp_dir

# create file based on hash of url
file=$tmp_dir$(md5 -q -s $url)

# does file exist?
if [ ! -f "$file" ]; then
    touch $file
fi

# hash contents that url points to and put them in temporary file
curl $url 2>/dev/null | md5 > $file'tmp'

echo $url:

# check if hash of contents have changed
if diff $file $file'tmp' >/dev/null ; then
	stat=$(stat -f "%m" "$file");
	echo Unchanged since at least $((($(date +%s) - ${stat%% *})/86400)) days ago.
	rm $file'tmp'
else
	echo Has changed.
	mv $file'tmp' $file
fi

If you would like a higher resolution on the time, you can change the line

echo Unchanged since at least $((($(date +%s) - ${stat%% *})/86400)) days ago.

to

  1. Seconds:
    echo Unchanged since at least $((($(date +%s) - ${stat%% *}))) seconds ago.
  2. Minutes:
    echo Unchanged since at least $((($(date +%s) - ${stat%% *})/60)) minutes ago.
  3. Hours:
    echo Unchanged since at least $((($(date +%s) - ${stat%% *})/3600)) hours ago.

There are some drawbacks:

  1. The script uses a primitive hashing technique. For instance, it distinguishes between http://www.google.comgoogle.comhttp://google.com and http://www.google.com, although these addresses point to the very same destination. This means that it will keep independent records of each variation of the URL. In order to remedy this, one has to extract the domain from the URL and hash that instead of the URL.
  2. Many pages have session id’s and dynamic content related to ads. This will cause the script to think that the page has changed. Well, in fact it has. But maybe not the content that we are interested in. A fix for this is to extract the information that is not ads och session id’s and hash that instead. In other words, the hash of the relevant information will be stored in the temporary cache file, instead of the hash of the entire page.

Also note that the script creates a hidden folder in your home folder called ‘.tmp’. This folder will contain the cached hashvalues. You can remove it at any time by typing:

rm -r ~/.tmp

in the Terminal.

Alfred extension

Of course, I made an Alfred extension of the script. In order to use it, you will need the Powerpack.

The extension can be found here! As always, I take no responsibility for whatever damage it may cause.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s