There’s been lots of political chatter about metadata in the media of late. As someone who has worked with web-based metadata — categorisation, collection, specifications — over a number of years, let me bring some points to the table.
The simplest description of metadata is: data about data. Some metadata is determined or created by the web page creator. There are metadata tags contained within web page code, such as keywords and a page description. This may include a key image representing the page, such as those designed to be picked up when an item is shared on Facebook.
It’s worth noting that these can all be easily faked and obfuscated to at least partially disguise the true nature of the website contents. One could tag a page as being about, let’s say, bike riding — when the page contents might actually be about something more sinister. A few actual paragraphs about bikes, interspersed with the sinister content, could be all that is needed to make the page show up in metadata and searches as being innocuous and unworthy of further investigation.
Then there’s metadata not defined by the web site creator, such the page address/URL (which stands for uniform resource locator), the date and time the page was accessed, the web browser and operating system used to access the page, the ISP (internet service provider) used to access the web, and the IP (internet protocol) address of the user. All easily captured. None really indicative of the true contents of a page.
What happens when a virtual ISP is used to hide a user’s IP address? The number of people who use these to troll on Twitter suggests that this method of hiding one’s IP address is far more widespread here than the government may realise. How much of this activity — if any — would the ISP be able to detect and record? And wouldn’t any potential terrorist go straight to a virtual ISP (which can be set up via a US provider in just a few minutes) if they wanted to conduct any illicit activity?
There’s also private browsing. Private browsers can be downloaded by anyone and used to view basic web page content — excluding multimedia — without revealing the user’s IP address. Again, how much of this activity would the ISP be able to detect and record?
Get Crikey FREE to your inbox every weekday morning with the Crikey Worm.
What about browser caching? Currently beloved of online advertisers and also accessed and used constantly by Facebook, your browser’s cache may be constantly accessed by web sites you visit — another reason to keep cleaning it out. Have you ever wondered why you visit a web site on a particular topic, and then the next time you visit Facebook there’s an ad for that same web site, or on that same topic? Will that sort of data call register as user activity?
And let’s say an ad is served up to you on a slightly off-colour topic, without your permission or request, while you are browsing another unrelated web page; the ad or content is simply called into the page from another address. Will this perhaps register as you calling up that web address yourself? Will the ISP separately store, and be able to identify, URLs/addresses you visited directly and those called in to your browser by other means?
There’s also posting of data — open form submission, etc. What metadata would be stored here? And then there’s https browsing, credit cards, online stores, online banking … I’m getting tired just thinking about it. I’m also worried about what information Google and Facebook are collecting — something we have little chance of monitoring or regulating.
In this day and age, perhaps we do need to take steps to store metadata, with a high level of security placed around its access. However, it’s essential that an appropriate metadata scheme be put together — defining what can be collected, how, in what format, and the parameters for collection and storage — by those who truly know what they’re talking about.
*Liz Van Dort is a content and information management consultant who has been working with metadata for over 12 years.