When the BBC recently redesigned the BBC Sport website, a screen-scraping script that I wrote a year or so ago suddenly stopped working. While I was rewriting it, I thought I’d take the opportunity to improve it a little, release the code via GitHub, and write a blog about it.
A little history
Feel free to skip this section if you just want to get to the code.
Up until not so long ago, it was impossible to find a decent online football fixture calendar — that is, something that you could point your iPhone at, or add to your Google Calendar, and track all the upcoming fixtures for your chosen team.
This is because the “data” in question — the actual fixture list — has been deemed copyright by the Premier League (and the Football League). Many small websites have fallen foul of this ruling, and while the legality of the case has been questioned, it was nonetheless true that any websites publishing this information found themselves in potentially hot water with the authorities.
(Of course, an official channel could have made this information available in the necessary format, but large media agencies are not noted for their forward-thinking approach to technology. Either way, attempting to copyright and license this information — factual, widely published data about which football team is playing which, and when — clearly is ludicrous. However, these are blog posts for another time).
It occurred to me that it wouldn’t be too difficult to convert the fixture lists published on the BBC Sport website into iCal format, suitable for most calendar clients. This information is highly accurate, and updated promptly and automatically (in the event of a fixture being rescheduled, for example). The code discussed below does just that.
Addendum: Google have since published fixture lists for pretty much all worldwide sporting leagues; quite how these are generated, and what their legal position is, I don’t know. (Any insight into this would be appreciated). They’re not ideal, though; you can only subscribe on a competition level, such as “All Manchester United’s Premier League matches”, and you need to add FA Cup games (etc) separately. Also, the format of the fixture is listed as “Away Team @ Home Team”, which is counterintuitive to British readers. So, the following code does still have some purpose.
The code itself is actually straight-forward. (In fact, compared to the long background piece above, this part of the blog will be very brief).
Initially, I used cURL to load the contents of a fixture list from the BBC site (such as, http://www.bbc.co.uk/sport/football/teams/manchester-united/fixtures), then wrote a regular expression to strip out relevant data from the HTML source (teams, dates and so on).
I then compiled this into an iCal data export manually, writing out code along the lines of:
… and so on.
This worked well, but when the layout of the BBC site changed (thus changing the HTML source, and thus breaking the regular expression) I took the opportunity to make everything more robust. This meant:
- Loading content via
- Parsing the resulting content using PHP Simple HTML DOM Parser
- Building the iCal file using iCalCreator
The DOM parser is more flexible than a simple regular expression (and less likely to break if the HTML changes slightly), and iCalCreator is more reliable way of building the iCal files. Any slight overhead is of no consequence, given the relatively low frequency at which the script will be called (see below).
Obviously, it would be trivial to host this script, allow users to specify their favourite team, and thus publish calendars to thousands of individuals. However, this could fall foul of the legal situation described above. Screen-scraping itself is also a legal grey area, and chances are that the BBC website’s T&Cs forbid it.
However, given that the code is publicly available, there is nothing stopping you from hosting a copy of this script yourself, if you so wish. Simply download all files, modify the main PHP script to include the URL of your team’s fixtures, and upload everything to a web server of your choosing.
You can then point your iPhone (or other calendar client) to your server, and everything should work nicely. For example, if you have uploaded everything to a folder on
fixtures, your calendar URL will be:
All code can be downloaded from GitHub. Please note that the PHP Simple DOM HTML Parser and iCalCreator are entirely the work of two separate third parties, and they retain full copyright of and credit for their work.
This is the first time we’ve actually “released” code, and although it is relatively tidy, it could perhaps be better documented, or better follow accepted standards. Any feedback or suggestions, on how any aspect of the code could be improved, will be gratefully received.