Semantic Embed: Part 3

Librarians are all about categorizing things. So is RDF. Are they a good match? That’s what I was thinking about at my third New York Semantic Web Meetup event…

The Librarian and RDF (Barbara McGlamery)

Barbara McGlamery, the first librarian of the evening, is actually an ex-librarian and, recently, an ex-ontologist for Time Inc (she just left Time for Martha Stewart). She talked about the application of the Semantic Web to Time’s online content, which basically follows an ontology-lite approach that consists of 1) setting up ontologies to define some rules and properties, 2) importing them into taxonomies, where resources (like ‘Will Smith’) are described, and 3) creating ‘navigational taxonomies’ so that editors and other people can access the information in whatever ways they want (for example, by using alternate names). Whenever an editor publishes a new article, he or she manually tags it with all the relevant resources, which makes it possible for machines to do basic inferences (like noticing that you were reading an article about Will Smith and recommending articles to you about movies Will Smith starred in, based on its awareness of the RDF triple: ‘Will Smith is a leadPerformerIn Hancock’). Which sounds great, except that the inferencing part didn’t work that well. McGlamery explained that the information just ended up being too heavy, which meant that inferencing was slow and couldn’t be very complex.

I thought Time’s attempt to plunge into the Semantic Web was admirable (they were apparently very early adopters of the technology), but I couldn’t quite understand their reasons for it until it became clear that it was just-another-Old-Media story. Sure, Time was adopting innovative technology, but it was for decidedly non-innovative ends: as another means of control over their content. After her talk, I asked McGlamery why Time had even bothered with all this Semantic Web inferencing for their article-recommendation feature – why not just recommend articles that were popular with other readers like you? That’s how most recommendation engines work. McGlamery’s reply was that Time is a hundred-year-old company and therefore favors the ‘curatorial’ approach over the crowdsourcing one, which I think explains why ontologies looked so good to them. I’ve talked before about how ontologies are in some sense a form of control – though I think they can be used for great things, especially in the news business. The question is just whether Time is going about using them in the right way…

…which is something I won’t answer, but instead briefly describe:

Jon Phipps’ Rant about RDA

Actually, this story isn’t much different from the first one. Jon Phipps’ rant was also about old control-systems adjusting (or failing to adjust to) to the new landscape of data and metadata. RDA stands for ‘Resource Description and Access’ and at this point consists of 1300+ pages intended to represent the collective wisdom of generations of catalogers. Phipps still thinks cataloging is worth doing (especially the informal kind that everyone does when they tag a photo in Flickr or bookmark a site on Delicious), but was mostly frustrated about the inflexibility of legions of catalogers in transitioning from their old rules to new ones.

Quote of the evening (from an audience member): “You can’t even get people to use Excel in the public library system. RDF? Forget about it.”

Frag men tation

This is my last piece directly about the 8th International Semantic Web Conference, though I’ll continue to be inspired by the things I picked up there. My other two pieces were about ontologies and data visualization.

I was sorry to leave ISWC today – and not just because of my fondness of suburban Virginia. For a humanities-oriented undergrad in a crowd of expert scientists and researchers, I learned an astonishing amount in the past few days – certainly enough to spark my interest in continuing to learn and discuss more, if only I could find a corollary to the community I just  met somewhere online.

The problem is that the great discussions I witnessed or joined at ISWC are – online – fragmented into such specialized subgroups that they have no place for a beginner like me. What’s more, the subjects that are addressed are so narrow that they turn into conversations restricted to a few participants and many uninvolved witnesses (when they take place on listservs, many unread emails). Experts talking to experts – great for solving specific technical problems, horrible for sharing the general knowledge and thoughts that spark real insight.

What I thought was greatest about ISWC was that nearly everyone was an expert at something different – which meant that any discussion with a largish audience or any happenstance encounter between two specialists (say, a natural language researcher and an RDF programmer) had to avoid the technical jargon of either specialty and instead frame everything with the best precision afforded by regular English (believe me, this is much more difficult than dipping into a pre-made vocabulary). Not only is that a good exercise for anyone who wants understand her own subject area more clearly, it’s also the best way to discover parallels between disparate fields. Whereas a psychologist might have nothing to say about “distributed computing,” when talking about “parallel processing,” he may actually turn out to be quite the expert. It’s almost a test of ontology-matching – suddenly finding out that the concept I represent as X in my specialty’s categorization structure is practically the same as what you happened to call Y in yours. What’s the significance of that? As David Karger and Jim Hendler stressed at Wednesday’s mentoring lunch, that discovery is usually one of the best conditions for creative insight. Science is well accustomed to seeing a person with a problem figure out how to solve it by happening upon a person with the solution in another field. All they needed was to run into each other.

But where is that uncontrolled, randomly-matching social space online? Online, communications function more as structured discussions than freewheeling conversations. In shedding their haphazardness, they lose a lot of creativity.

I should probably point out that this isn’t particularly a problem of the Semantic Web; fragmentation and increasingly self-selected groups are a byproduct of the web in general. It may become intensified by the Semantic Web – which decreases the randomness and facilitates intentional self-selection on the web – but it’s a problem for everyone. Only maybe a little more so for scientists, who have always tended to break into autonomous, self-referential subgroups of experts rather easily.

Specialized, self-selected communities reinforce their own ideas and biases and make insightful leaps more difficult, to the detriment of all fields. The web facilitates self-selection and thus fragmentation. What is to be done?

A Call for Data Visualization

This is my second post about the themes/problems with the Semantic Web, inspired by the three-day International Semantic Web Conference, which I’m attending. My first post was about Ontology Alignment. This post is about the critical need for ways to visualize the Semantic Web. I have no idea what my next post will be about.

Until people have some visual way of understanding the Semantic Web, it’s going to have an extremely difficult time breaking out of its self-selected, academic bubble.

That’s a problem, because even though technologists here understand it and have some great ideas about applying it to various industries, it won’t be until random people with unrelated, unforeseen problems really get the Semantic Web that there’s even a chance of that coveted aha! moment where it suddenly rises to prominence. (Note: whether or not that moment is going to happen – or should – is a matter of dispute). I don’t mean that people need to be able to squeeze their eyes shut and imagine some tangled, shifting blob-thing that is the Semantic Web, though that would be nice – I’ve tried asking around for good metaphors for the Semantic Web (it’s just a giant database – well, more like a multi-dimensional database, if you can imagine that – or like that big computer they use in Minority Report – a ‘web of data’ sure – but what does data look like?). It isn’t easy. But what we can see is semantic data, and if it can be presented in a clear, flexible, and – importantly – intuitive way (no SPARQL querying required for interaction!) then I think people would start to get what it’s all about.

I’ve seen lots of applications of Semantic Web data, but what I keep looking for are visualizations. The difference being that applications show you data regarding some particular problem or in some particular context, whereas a good visualization would provide a much broader look at a bunch of different kinds of data that can be easily manipulated and viewed in as many creative, useful ways as possible. Stuff like Allegrograph and Simile are much closer to what I’m imagining, but still not as broad and flexible as I’d like. I want something that I can point a friend to and say, “You’re looking at the Semantic Web! This is the data of the web, and these are the ways we have of linking it up so far. If it looks like a mess to you, try playing around until you’re looking at something important to you in a way that makes sense to you.”

Why aren’t applications good enough for this? Because:

1) they don’t look that different from regular mashups, so it’s hard for users to grasp the significance of the technology. From then on, they’ll think of the Semantic Web as something that solved some minor data integration problem, and won’t be able to imagine it in different contexts or solving different sorts of problems.

2) the people here aren’t going to come up with every potential application of Semantic Web data and – quite possibly – they won’t come up with the best applications. Someone else – less tech-savvy but more plugged into marketing or social networking or whatever – might be able to leverage the technology in a much better way, if only they understood its full power.

I’ve heard rumors (won’t say where) that a panel/workshop focusing on data visualization was turned down by the ISWC (apparently, this topic was also brought up at the Town Hall meeting, which I didn’t attend). At any rate, “interface” was among the terms most frequently associated with papers that were turned down, as they told us during Tuesday’s opening ceremony. I accept that, as a primarily academic conference, ISWC is catering more toward Semantic technologists/scholars than industry-oriented people. But I feel strongly enough about the need for better visualization of the Semantic Web to argue that this is a mistake. It reflects the internally-oriented nature of the Semantic Web academic community, which could benefit greatly from outside perspectives (Tom Mitchell - who’s not a traditional Semantic Web guy – gave such an intruiging keynote this morning in part because he’s able to bring foreign ideas and solutions to the community). The Semantic Web movement is past ready to open itself up to the rest of the world and making the Semantic Web into something everyone can see and understand is the first step.

Ontology Alignment (is not the SameAs but is CloselyRelatedTo) Reconciling Worldviews

For the next three days, I’ll be reporting from the 8th International Semantic Web Conference (ISWC), taking place near Washington DC. A lot of what’s going on here is very technical, so rather than repeat everything I’m hearing, I’m going to talk about the broader themes that I see emerging. After this conference, I may try to tie them together into one comprehensive post.

This is my first theme. It’s about ontology alignment but is nevertheless very interesting. Yes, actually, it really is.

An ontology is basically a taxonomy of concepts and categories and the relationships between them – it’s sort of like a network but includes heritability (if I specify properties about some group, like “dogs can bark,” then it carries down to things within that group, so we know that Shih Tzus can bark). Ontologies are pretty key to the Semantic Web because expressing relationships between concepts is essentially defining those concepts – I could turn philosopher and argue that the meaning of something can only be found in the way it relates to other things. Or I could not, and just argue that defining things in terms of their relationships is a really useful way to do it, especially if the point is to make machines understand those things and be able to reason about them. That’s why a large percentage of the people here are obsessed with building ontologies about certain things (like jet engines).

But ontologies are personal. What if I think of “Shih Tzu” as a sub-category of “pets” but you think it belongs under “dinner proteins?” Or how about if a liberal defines a homosexual relationship as a type of family and a conservative thinks it belongs under sexual perversion? There’s no way the world would ever be able to agree on one definitive ontology. Nor should it. The way we categorize things, the way we cut up and connect up everything in the world is key to who we are, how we think, and what we do. I – an atheist and cognitive psychology nerd – would go so far as to say that the human soul exists in our subjective, idiosyncratic ways of linking up information. So to impose a single ontology on the whole world – no matter how well thought out and exhaustive it is – would be tantamount to mind control or soul stealing.

To their credit, most semantic technologists I’ve talked to think this way also. That’s why they’re encouraging ontologies to be fruitful and multiply and represent as many worldviews as there are ontology-builders (though ideally there would be more than 15. (I’m joking, I’m sure there are over 22 people who can build ontologies)). But having a bunch of rivaling ontologies out there that define and categorize things in unique ways doesn’t sound like much of an organized system of data, right? That’s true, and that’s why a lot of other people are involved in aligning ontologies – matching up the instances of some concept that shows up in different ontologies.

But…they’re still not doing it that well. That’s something Pat Hayes brought up during his keynote this morning. His topic was “blogic,” or, the new form of logic (formal logic) that’s required for the web. One of his problems with using traditional logic for the web is that people are mapping instances between different ontologies using the relationship “SameAs” – even though the fact that they come from different ontologies means they’re clearly not the same as each other. People are usually aware of that, but there’s still not much they can do because there’s no “SortOfSameAs” or “SameAsInThisOneParticularWay” relationships in traditional logic that they can use instead.

Ontology alignment is still a Big Problem and it’s acknowledged as such by much of the Semantic Web community. If anyone knows of good solutions in the works, I’d love to hear about them or add to this post with some comments.

Semantic Embed: Part 2

This is my second posting on an event by the New York Semantic Web Meetup, which covers all aspects of the W3C recommended Semantic Web from technology to business. An offshoot Meetup, which will focus more on natural language processing, computational linguistics, and machine learning is supposed to start having meetings in January, and I plan to be there. See my first Meetup post here.

Semantic Web Programming – the book (John Hebeler)
The first slide in John Hebeler’s presentation last night had just one sentence: “Our ability to create information far exceeds our ability to manage it,” which is actually the best and most succinct argument for the Semantic Web that I’ve heard thus far. Hebeler made his point more visceral by asking us to guess how many files there were on his MacBook (the answer is over a million, about twice as many as most of us guessed). Imagining that many files on every computer hooked up to the Internet (there were over 1.5 billion Internet users as of June 30) is already overwhelming. And the bigger this mass of information gets, the stronger its pull toward entropy and the more we lose control. It’s something that should scare us, Hebeler said, because all that information is only as useful to us as our tools to sort through it; if we can’t find what we want, it’s the same as having lost it.

Luckily, Hebeler sees our salvation in the Semantic Web – or more specifically, in a highly flexible knowledge base that can handle both complex and simple types of data – and he’s co-authored the book to guide us there. It looks like it’s pretty easy to use: I’m not much of a programmer, but even I could follow the examples, all of which are demonstrated using Java code in the book. In trying to integrate data from, for example, Facebook and Gmail, which represent it in totally different formats, Hebeler gave us seven basic steps, or areas of code:

1) Knowledge-base creation

2) How to query it – just a simple search

3) Setting up your ontologies

4) Ontology/instance alignment – combine two ontologies, for example by teaching your program that what one ontology calls an “individual” is the same thing as what the other calls a “person,” or that “Kathryn” is equivalent to “Kate”

5) Reasoner – your program won’t incorporate its new understanding of equivalencies until you apply the reasoner

6) OWL restriction – allows you to apply constraints

7) Rules – allows you to apply rules

He and the other co-authors also maintain a website where they field questions and add updates about the book.

Lucene (Otis Gospodnetic)

The Lucene presentation by Otis Gospodnetic was aimed primarily at programmers who might want to use the Lucene software for indexing and searching of text. Lucene is actually just one piece of Apache Lucene, an Apache Software Foundation open-source project that includes other sub-projects like Nutch (a framework for building web-crawlers) and Solr (a search server). All of it, of course, is free, and since I’m not expert enough to vouch for any of it, I’d suggest checking out the Apache Lucene website where everything is available for download.

Review of Jeffrey Stibel’s Wired For Thought: The Internet is a brain…kind of

“Not ‘The Internet is like a brain…The Internet is a brain,’” is the argument of scientist/entrepreneur Jeffrey Stibel in his recent book Wired For Thought: How the Brain is Shaping the Future of the Internet. Working off that premise, Stibel compares the Internet’s network of web-pages to the brain’s neural network, distributed computing software to mental parallel processing, and the development and future of the Net to the maturation process of the human brain. A fairly successful businessman with interesting, if somewhat sci-fi, credentials (he’s the the chairman of BrainGate, a company that implants computer chips in people’s brains so that they can control electrical devices with their minds) Stibel provides a provocative and intelligent angle on the Internet – plus good background material on the brain – though his attempt to market the book as a business self-help resource wasn’t totally convincing. Still, just looking at the Internet from the perspective of brain science – which could be considered the earliest and most mature investigation of information architecture – could be so valuable that it’s in our best interest to listen to what Stibel has to say.

So is the Internet a brain?

Well…kind of. It’s not a carbon-based organ at the center of an animal’s nervous system. It’s not, as Stibel admits, going to become conscious or pursue it’s own desires, and it won’t have to interpret sensory perceptions and form an internal representation of the outside world. Which, really, is a lot of what the brain does. But whether or not the Internet actually is a brain, I think, is irrelevant. Stibel’s main argument for how the Internet is a brain is really about how the Internet grows like a brain. Like the brain, the Internet is the product of blind evolution, not intelligent design. And while that may seem like a small and relatively obvious point, it’s the most compelling one in his book because it’s got huge implications for how we attempt to make sense of the Internet.

We’re All Blind Watchmakers

Who designs the Internet? All of us, which is to say, none of us. There are lots of people involved in designing apps, services, websites, and in coming up with new uses for the Internet, but the way these bits and pieces become integrated into it is through primitive natural selection. If a service or site works well, then – ideally, but also quite typically – people start using it and telling their friends about it, it shows up on Digg, then Google, others copy it, and poof! – it’s become a new part of the Internet. If the site doesn’t work well for any reason, it fades into obscurity and basically disappears. Of course, this ideal meritocracy is the same argument people have been using for free enterprise for a century – but on the Internet, we can also add the democratic nature of URLs (all sites are equally easy to access), a massive, extremely reactive audience, and incredible speed to the equation. In fact, everything happens so fast and there’s such immediate feedback on the Net, that it makes less sense for technologists or advertisers to think seriously about their consumers and design their products accordingly, than to just try something, or everything, and see what works. Stibel asks us to: “Imagine hundreds of thousands of variables and thousands of ad campaigns, all competing with one another to survive and flourish…The whole process goes so fast that often we don’t even know what is working or why.” Which sounds a lot more like natural selection than rational design.

So you could say that technologists or web entrepreneurs are designing the Internet, in that they design the pieces of it that sometimes work. Or you could say that everyone is designing the Internet through our behavior on it: every time we visit a site, tell our friends about it, or link between two sites. But if none of us understands completely how the Internet works, what will flourish or fail, or can predict what it will become, then you could also say that no one is designing the Internet. The Internet is out of our control.

Sound scary? Back in 1994, Kevin Kelly had already identified the network as “the icon of the 21st century” in his book Out of Control. He called it “the only organization capable of unprejudiced growth, or unguided learning…the banner of noncontrol…it conveys the logic both of Computer and of Nature – which in turn convey a power beyond understanding.” Kelly is excited, rather than scared, about the prospects of a new collective intelligence. But however you feel about it, the point seems to be that it’s no longer our choice; having set the grinding process of evolution upon an organism of our own creation, we can’t now reverse it. I wouldn’t want to anyway.

Why Messy Works

Reconceptualizing the Internet as a product of blind evolution is a surprisingly useful way to think about it. We can stop thinking of super-successful web CEO’s as geniuses (though some of them are) and of their creations as perfect. The viral technologies that overwhelm the web can actually be less well-designed and perform more poorly than their mainstream competitors, as long as they fill some unique need of their users adequately well. Often, that need is one that the designers didn’t anticipate – as is the case with MySpace and Twitter, points out blogger Cody Brown. Studying a successful site’s overall design probably won’t help you succeed, because the reason for its success may have little to do with its design or performance. What’s important to remember, says Clayton M. Christensen in his book The Innovator’s Dilemma, is that technological evolution, like biological evolution, follows an S-curve: development consists of a lot of incremental improvements disrupted occasionally by sudden leaps that can come in the form of a lower-quality product that is either significantly cheaper (low-end disruption) or addresses a brand new need (new-market disruption). Christensen predicted what Wired reporter Robert Capps recently called The Good Enough Revolution – the explosion in cheap, low end technology like the Flip camera or the MP3.

What this means is something that Web 2.0 champions already know: the most successful things on the web are usually messy. Take Google – it’s not smart enough to answer your questions, but does a good enough job by matching your queries to well-linked sites. Or Wikipedia – its open editing system means that we can never be sure that everything is true, but we can expect a high probability that it will be. Or Craigslist – an article in Wired, aptly titled Why Craigslist Is Such a Mess, decried its messy design and functionality, but had to admit that it’s still really really successful. Stibel explains this triumph-of-the-messy as yet another parallel between the Internet and the brain – the brain also relies on messy mental shortcuts (like stereotypes, rules-of-thumb, and intuition) that work efficiently and 95% of the time – but it’s just the natural outcome of evolution. The things that compete best, in the jungle or on the Internet, are not the ones that do a few things perfectly but the ones that do almost everything pretty well.

Reversing Entropy’s Arrow

Let’s look at our new Internet-organism in terms of the Semantic Web. To me, the success of a web based on messy shortcuts seems to run counter to the Semantic Web vision of a well-organized database-web. Our minds aren’t well-organized; we don’t have every relevant memory or piece of knowledge linked meaningfully together. They act much more like the Web 2.0 examples: full of idiosyncratic, personal, sometimes weird links that become reinforced the more often they’re used.The tangle that results from all that is… actually pretty organized. Or at least organized well enough. We usually find what we need because the neural links we use the most are the strongest and, more importantly, we’re free to come up with odd, human-like associations that sometimes lead to revelatory insights. Stibel explains that it’s the “loopy, iterative process…found in the brain” and the surprising connections we come up during daydreams that make it as powerful as it is.The best mimics of the human brain so far don’t depend on machine computing but on learning from human behavior; a budding website called Hunch claims to help make decisions for you by correlating your personal data with hundreds of people like you.

Stibel praises the Semantic Web movement without addressing whether or not it’s a departure from the naturally evolving brain-like web. And it’s possible that the Semantic Web could add another layer of meaning to the web without disrupting the accumulation of all the non-semantic, nonsensical links that give rise to new ways of thinking. But it’s also possible that the web is headed toward a type of self-organization, based entirely on the behavior of its users and the patterns of links they unwittingly build with every click.

Semantic Embed: Part 1

In her quest to bring you the most authentic, up-to-date news about the evolution of the web, this reporter is venturing where few go: straight to the heart of NYC’s little-known Semantic Web community. It is there, buried in rule interchange formats and Unicode, hidden behind coke-bottle glasses and tablet PCs, that she hopes to find the people who are actually building the web.

Last night was my second New York Semantic Web Meetup event, so I knew a little more about what to expect (free pizza and liberal use of PowerPoint, unusually high Y to X chromosome ratio). The night was divided between two speakers: Mike Cataldo (CEO of Cambridge Semantics, which uses semantic web technology to solve businesses’ problems) and Lee Feigenbaum (a VP at Cambridge Semantics and co-chair of W3C‘s SPARQL working group – which I’ll explain later). It alternated between pretty heavy business-talk (“…and that’s game-changing!”) and tech-talk (“Supplant the mystifying OPTIONAL/!bound method of negation with a dedicated construct), but here’s what I was able to scrape together:

Cambridge Semantics
So Cambridge Semantics provides “practical solutions for today’s business problems using the most advanced semantic technology” – what does that mean? Essentially, they make it easier to get the data a company needs out of the applications that keep it locked up. A cool feature is that they have a plug-in to use Microsoft Excel as both a source for the data and as an interface for looking at it.


Apparently, there are a couple companies using Cambridge Semantics technology now, including a biopharmaceutical firm in Belgium and a startup called Book Of Odds that calculates the odds of various everyday activities.

SPARQL

As Lee Feigenbaum told me later, if SPARQL is working the way it should, most people shouldn’t even know that it’s there. That said, it’s probably useful to know a little about this core Semantic Web technology, if only to get a better idea of what the Semantic Web might be capable of.


SPARQL is a query language – it’s built for asking questions (and getting back the right answers). It’s got to be able to ask questions that pull together data from a lot of different sources in new, complex ways. The example query Feigenbaum gave in his talk was: “What are the names of all landlocked countries with a population greater than 15 million?” To answer that question, SPARQL first has to know about words like “country” and “population” (that “population” is a property of “country,” for example, and that “population” should refer to a number) and then combine information from different databases to get the right answer. What SPARQL does, then, is a whole lot more powerful than what Google does (Google just matches words in your question to popular pages where the same words show up). Try typing the example question into Google: when I did, the first hit was an entreaty for the world to help “the landlocked heart of Africa” and the second was actually a reference to Feigenbaum’s lecture. I’d tell you how long it took to get the right answer, except that I got sick of looking through the irrelevant documents somewhere on the sixth page of results.


So it’s easy to see how SPARQL can be a great piece of technology. It’s also easy to see why SPARQL is a semantic web technology – it can only come up with answers if the information it’s looking at is written in a computer-understandable language – RDF in this case. One of the main things that gets people excited about the Semantic Web is it’s question-answering ability, and SPARQL is what’s going to make that possible.*

*Note: actually what Feigenbaum was talking about last night was SPARQL 2 – the next version of SPARQL that he’s helping to develop at the W3C. In the interests of space and your waning interest, I’m not going to outline the differences between SPARQL and SPARQL 2 – if you’re really concerned about it, take a look at Feigenbaum’s presentation slides yourself.