Semantic Embed: Part 3

Librarians are all about categorizing things. So is RDF. Are they a good match? That’s what I was thinking about at my third New York Semantic Web Meetup event…

The Librarian and RDF (Barbara McGlamery)

Barbara McGlamery, the first librarian of the evening, is actually an ex-librarian and, recently, an ex-ontologist for Time Inc (she just left Time for Martha Stewart). She talked about the application of the Semantic Web to Time’s online content, which basically follows an ontology-lite approach that consists of 1) setting up ontologies to define some rules and properties, 2) importing them into taxonomies, where resources (like ‘Will Smith’) are described, and 3) creating ‘navigational taxonomies’ so that editors and other people can access the information in whatever ways they want (for example, by using alternate names). Whenever an editor publishes a new article, he or she manually tags it with all the relevant resources, which makes it possible for machines to do basic inferences (like noticing that you were reading an article about Will Smith and recommending articles to you about movies Will Smith starred in, based on its awareness of the RDF triple: ‘Will Smith is a leadPerformerIn Hancock’). Which sounds great, except that the inferencing part didn’t work that well. McGlamery explained that the information just ended up being too heavy, which meant that inferencing was slow and couldn’t be very complex.

I thought Time’s attempt to plunge into the Semantic Web was admirable (they were apparently very early adopters of the technology), but I couldn’t quite understand their reasons for it until it became clear that it was just-another-Old-Media story. Sure, Time was adopting innovative technology, but it was for decidedly non-innovative ends: as another means of control over their content. After her talk, I asked McGlamery why Time had even bothered with all this Semantic Web inferencing for their article-recommendation feature – why not just recommend articles that were popular with other readers like you? That’s how most recommendation engines work. McGlamery’s reply was that Time is a hundred-year-old company and therefore favors the ‘curatorial’ approach over the crowdsourcing one, which I think explains why ontologies looked so good to them. I’ve talked before about how ontologies are in some sense a form of control – though I think they can be used for great things, especially in the news business. The question is just whether Time is going about using them in the right way…

…which is something I won’t answer, but instead briefly describe:

Jon Phipps’ Rant about RDA

Actually, this story isn’t much different from the first one. Jon Phipps’ rant was also about old control-systems adjusting (or failing to adjust to) to the new landscape of data and metadata. RDA stands for ‘Resource Description and Access’ and at this point consists of 1300+ pages intended to represent the collective wisdom of generations of catalogers. Phipps still thinks cataloging is worth doing (especially the informal kind that everyone does when they tag a photo in Flickr or bookmark a site on Delicious), but was mostly frustrated about the inflexibility of legions of catalogers in transitioning from their old rules to new ones.

Quote of the evening (from an audience member): “You can’t even get people to use Excel in the public library system. RDF? Forget about it.”

Advertisements

Semantic Embed: Part 2

This is my second posting on an event by the New York Semantic Web Meetup, which covers all aspects of the W3C recommended Semantic Web from technology to business. An offshoot Meetup, which will focus more on natural language processing, computational linguistics, and machine learning is supposed to start having meetings in January, and I plan to be there. See my first Meetup post here.

Semantic Web Programming – the book (John Hebeler)
The first slide in John Hebeler’s presentation last night had just one sentence: “Our ability to create information far exceeds our ability to manage it,” which is actually the best and most succinct argument for the Semantic Web that I’ve heard thus far. Hebeler made his point more visceral by asking us to guess how many files there were on his MacBook (the answer is over a million, about twice as many as most of us guessed). Imagining that many files on every computer hooked up to the Internet (there were over 1.5 billion Internet users as of June 30) is already overwhelming. And the bigger this mass of information gets, the stronger its pull toward entropy and the more we lose control. It’s something that should scare us, Hebeler said, because all that information is only as useful to us as our tools to sort through it; if we can’t find what we want, it’s the same as having lost it.

Luckily, Hebeler sees our salvation in the Semantic Web – or more specifically, in a highly flexible knowledge base that can handle both complex and simple types of data – and he’s co-authored the book to guide us there. It looks like it’s pretty easy to use: I’m not much of a programmer, but even I could follow the examples, all of which are demonstrated using Java code in the book. In trying to integrate data from, for example, Facebook and Gmail, which represent it in totally different formats, Hebeler gave us seven basic steps, or areas of code:

1) Knowledge-base creation

2) How to query it – just a simple search

3) Setting up your ontologies

4) Ontology/instance alignment – combine two ontologies, for example by teaching your program that what one ontology calls an “individual” is the same thing as what the other calls a “person,” or that “Kathryn” is equivalent to “Kate”

5) Reasoner – your program won’t incorporate its new understanding of equivalencies until you apply the reasoner

6) OWL restriction – allows you to apply constraints

7) Rules – allows you to apply rules

He and the other co-authors also maintain a website where they field questions and add updates about the book.

Lucene (Otis Gospodnetic)

The Lucene presentation by Otis Gospodnetic was aimed primarily at programmers who might want to use the Lucene software for indexing and searching of text. Lucene is actually just one piece of Apache Lucene, an Apache Software Foundation open-source project that includes other sub-projects like Nutch (a framework for building web-crawlers) and Solr (a search server). All of it, of course, is free, and since I’m not expert enough to vouch for any of it, I’d suggest checking out the Apache Lucene website where everything is available for download.

Semantic Embed: Part 1

In her quest to bring you the most authentic, up-to-date news about the evolution of the web, this reporter is venturing where few go: straight to the heart of NYC’s little-known Semantic Web community. It is there, buried in rule interchange formats and Unicode, hidden behind coke-bottle glasses and tablet PCs, that she hopes to find the people who are actually building the web.

Last night was my second New York Semantic Web Meetup event, so I knew a little more about what to expect (free pizza and liberal use of PowerPoint, unusually high Y to X chromosome ratio). The night was divided between two speakers: Mike Cataldo (CEO of Cambridge Semantics, which uses semantic web technology to solve businesses’ problems) and Lee Feigenbaum (a VP at Cambridge Semantics and co-chair of W3C‘s SPARQL working group – which I’ll explain later). It alternated between pretty heavy business-talk (“…and that’s game-changing!”) and tech-talk (“Supplant the mystifying OPTIONAL/!bound method of negation with a dedicated construct), but here’s what I was able to scrape together:

Cambridge Semantics
So Cambridge Semantics provides “practical solutions for today‚Äôs business problems using the most advanced semantic technology” – what does that mean? Essentially, they make it easier to get the data a company needs out of the applications that keep it locked up. A cool feature is that they have a plug-in to use Microsoft Excel as both a source for the data and as an interface for looking at it.


Apparently, there are a couple companies using Cambridge Semantics technology now, including a biopharmaceutical firm in Belgium and a startup called Book Of Odds that calculates the odds of various everyday activities.

SPARQL

As Lee Feigenbaum told me later, if SPARQL is working the way it should, most people shouldn’t even know that it’s there. That said, it’s probably useful to know a little about this core Semantic Web technology, if only to get a better idea of what the Semantic Web might be capable of.


SPARQL is a query language – it’s built for asking questions (and getting back the right answers). It’s got to be able to ask questions that pull together data from a lot of different sources in new, complex ways. The example query Feigenbaum gave in his talk was: “What are the names of all landlocked countries with a population greater than 15 million?” To answer that question, SPARQL first has to know about words like “country” and “population” (that “population” is a property of “country,” for example, and that “population” should refer to a number) and then combine information from different databases to get the right answer. What SPARQL does, then, is a whole lot more powerful than what Google does (Google just matches words in your question to popular pages where the same words show up). Try typing the example question into Google: when I did, the first hit was an entreaty for the world to help “the landlocked heart of Africa” and the second was actually a reference to Feigenbaum’s lecture. I’d tell you how long it took to get the right answer, except that I got sick of looking through the irrelevant documents somewhere on the sixth page of results.


So it’s easy to see how SPARQL can be a great piece of technology. It’s also easy to see why SPARQL is a semantic web technology – it can only come up with answers if the information it’s looking at is written in a computer-understandable language – RDF in this case. One of the main things that gets people excited about the Semantic Web is it’s question-answering ability, and SPARQL is what’s going to make that possible.*

*Note: actually what Feigenbaum was talking about last night was SPARQL 2 – the next version of SPARQL that he’s helping to develop at the W3C. In the interests of space and your waning interest, I’m not going to outline the differences between SPARQL and SPARQL 2 – if you’re really concerned about it, take a look at Feigenbaum’s presentation slides yourself.