07 September 2011

How things are changing

A student in the 461 class did a starter dealing with translations. It led me to recall this item and I thought it is worth sharing. This was from a special report on Managing information in the Economist on 25 Feb 2010. I have highlighted the parts that the starter prompted me to recall.

How internet companies profit from data on the web

PSST! Amazon.com does not want you to know what it knows about you. It not only tracks the books you purchase, but also keeps a record of the ones you browse but do not buy to help it recommend other books to you. Information from its e-book, the Kindle, is probably even richer: how long a user spends reading each page, whether he takes notes and so on. But Amazon refuses to disclose what data it collects or how it uses them.

It is not alone. Across the internet economy, companies are compiling masses of data on people, their activities, their likes and dislikes, their relationships with others and even where they are at any particular moment—and keeping mum. For example, Facebook, a social-networking site, tracks the activities of its 400m users, half of whom spend an average of almost an hour on the site every day, but does not talk about what it finds. Google reveals a little but holds back a lot. Even eBay, the online auctioneer, keeps quiet.

“They are uncomfortable bringing so much attention to this because it is at the heart of their competitive advantage,” says Tim O’Reilly, a technology insider and publisher. “Data are the coin of the realm. They have a big lead over other companies that do not ‘get’ this.” As the communications director of one of the web’s biggest sites admits, “we’re not in a position to have an in-depth conversation. It has less to do with sensitive considerations like privacy. Instead, we’re just not ready to tip our hand.” In other words, the firm does not want to reveal valuable trade secrets.

The reticence partly reflects fears about consumer unease and unwelcome attention from regulators. But this is short-sighted, for two reasons. First, politicians and the public are already anxious. The chairman of America’s Federal Trade Commission, Jon Leibowitz, has publicly grumbled that the industry has not been sufficiently forthcoming. Second, if users knew how the data were used, they would probably be more impressed than alarmed.

Where traditional businesses generally collect information about customers from their purchases or from surveys, internet companies have the luxury of being able to gather data from everything that happens on their sites. The biggest websites have long recognised that information itself is their biggest treasure. And it can immediately be put to use in a way that traditional firms cannot match.

Some of the techniques have become widespread. Before deploying a new feature, big sites run controlled experiments to see what works best. Amazon and Netflix, a site that offers films for hire, use a statistical technique called collaborative filtering to make recommendations to users based on what other users like.

The technique they came up with has produced millions of dollars of additional sales. Nearly two-thirds of the film selections by Netflix’s customer come from the referrals made by computer.

EBay, which at first sight looks like nothing more than a neutral platform for commercial exchanges, makes myriad adjustments based on information culled from listing activity, bidding behaviour, pricing trends, search terms and the length of time users look at a page. Every product category is treated as a micro-economy that is actively managed. Lots of searches but few sales for an expensive item may signal unmet demand, so eBay will find a partner to offer sellers insurance to increase listings.

The company that gets the most out of its data is Google. Creating new economic value from unthinkably large amounts of information is its lifeblood. That helps explain why, on inspection, the market capitalisation of the 11-year-old firm, of around $170 billion, is not so outlandish. Google exploits information that is a by-product of user interactions, or data exhaust, which is automatically recycled to improve the service or create an entirely new product.

Vote with your mouse
Until 1998, when Larry Page, one of Google’s founders, devised the PageRank algorithm for search, search engines counted the number of times that a word appeared on a web page to determine its relevance—a system wide open to manipulation. Google’s innovation was to count the number of inbound links from other web pages. Such links act as “votes” on what internet users at large believe to be good content. More links suggest a webpage is more useful, just as more citations of a book suggests it is better.

But although Google’s system was an improvement, it too was open to abuse from “link spam”, created only to dupe the system. The firm’s engineers realised that the solution was staring them in the face: the search results on which users actually clicked and stayed. A Google search might yield 2m pages of results in a quarter of a second, but users often want just one page, and by choosing it they “tell” Google what they are looking for. So the algorithm was rejigged to feed that information back into the service automatically.

From then on Google realised it was in the data-mining business. To put the model in simple economic terms, its search results give away, say, $1 in value, and in return (thanks to the user’s clicks) it gets 1 cent back. When the next user visits, he gets $1.01 of value, and so on. As one employee puts it: “We like learning from large, ‘noisy’ data sets.”

Making improvements on the back of a big data set is not a Google monopoly, nor is the technique new. One of the most striking examples dates from the mid-1800s, when Matthew Fontaine Maury of the American navy had the idea of aggregating nautical logs from ships crossing the Pacific to find the routes that offered the best winds and currents. He created an early variant of a “viral” social network, rewarding captains who submitted their logbooks with a copy of his maps. But the process was slow and laborious.

Wizard spelling
Google applies this principle of recursively learning from the data to many of its services, including the humble spell-check, for which it used a pioneering method that produced perhaps the world’s best spell-checker in almost every language. Microsoft says it spent several million dollars over 20 years to develop a robust spell-checker for its word-processing program. But Google got its raw material free: its program is based on all the misspellings that users type into a search window and then “correct” by clicking on the right result. With almost 3 billion queries a day, those results soon mount up. Other search engines in the 1990s had the chance to do the same, but did not pursue it. Around 2000 Yahoo! saw the potential, but nothing came of the idea. It was Google that recognised the gold dust in the detritus of its interactions with its users and took the trouble to collect it up.

Two newer Google services take the same approach: translation and voice recognition. Both have been big stumbling blocks for computer scientists working on artificial intelligence. For over four decades the boffins tried to program computers to “understand” the structure and phonetics of language. This meant defining rules such as where nouns and verbs go in a sentence, which are the correct tenses and so on. All the exceptions to the rules needed to be programmed in too. Google, by contrast, saw it as a big maths problem that could be solved with a lot of data and processing power—and came up with something very useful.

For translation, the company was able to draw on its other services. Its search system had copies of European Commission documents, which are translated into around 20 languages. Its book-scanning project has thousands of titles that have been translated into many languages. All these translations are very good, done by experts to exacting standards. So instead of trying to teach its computers the rules of a language, Google turned them loose on the texts to make statistical inferences. Google Translate now covers more than 50 languages, according to Franz Och, one of the company’s engineers. The system identifies which word or phrase in one language is the most likely equivalent in a second language. If direct translations are not available (say, Hindi to Catalan), then English is used as a bridge.

Google was not the first to try this method. In the early 1990s IBM tried to build a French-English program using translations from Canada’s Parliament. But the system did not work well and the project was abandoned. IBM had only a few million documents at its disposal, says Mr Och dismissively. Google has billions. The system was first developed by processing almost 2 trillion words. But although it learns from a big body of data, it lacks the recursive qualities of spell-check and search.
The design of the feedback loop is critical. Google asks users for their opinions, but not much else. A translation start-up in Germany called Linguee is trying something different: it presents users with snippets of possible translations and asks them to click on the best. That provides feedback on which version is the most accurate.

Voice recognition highlights the importance of making use of data exhaust. To use Google’s telephone directory or audio car navigation service, customers dial the relevant number and say what they are looking for. The system repeats the information; when the customer confirms it, or repeats the query, the system develops a record of the different ways the target word can be spoken. It does not learn to understand voice; it computes probabilities.

To launch the service Google needed an existing voice-recognition system, so it licensed software from Nuance, a leader in the field. But Google itself keeps the data from voice queries, and its voice-recognition system may end up performing better than Nuance’s—which is now trying to get access to lots more data by partnering with everyone in sight.

Re-using data represents a new model for how computing is done, says Edward Felten of Princeton University. “Looking at large data sets and making inferences about what goes together is advancing more rapidly than expected. ‘Understanding’ turns out to be overrated, and statistical analysis goes a lot of the way.” Many internet companies now see things the same way. Facebook regularly examines its huge databases to boost usage. It found that the best single predictor of whether members would contribute to the site was seeing that their friends had been active on it, so it took to sending members information about what their friends had been up to online. Zynga, an online games company, tracks its 100m unique players each month to improve its games.

“If there are user-generated data to be had, then we can build much better systems than just trying to improve the algorithms,” says Andreas Weigend, a former chief scientist at Amazon who is now at Stanford University. Marc Andreessen, a venture capitalist who sits on numerous boards and was one of the founders of Netscape, the web’s first commercial browser, thinks that “these new companies have built a culture, and the processes and the technology to deal with large amounts of data, that traditional companies simply don’t have.”

Recycling data exhaust is a common theme in the myriad projects going on in Google’s empire and helps explain why almost all of them are labelled as a “beta” or early test version: they truly are in continuous development. A service that lets Google users store medical records might also allow the company to spot valuable patterns about diseases and treatments. A service where users can monitor their use of electricity, device by device, provides rich information on energy consumption. It could become the world’s best database of household appliances and consumer electronics—and even foresee breakdowns. The aggregated search queries, which the company makes available free, are used as remarkably accurate predictors for everything from retail sales to flu outbreaks.

Together, all this is in line with the company’s audacious mission to “organise the world’s information”. Yet the words are carefully chosen: Google does not need to own the data. Usually all it wants is to have access to them (and see that its rivals do not). In an initiative called “Data Liberation Front” that quietly began last September, Google is planning to rejig all its services so that users can discontinue them very easily and take their data with them. In an industry built on locking in the customer, the company says it wants to reduce the “barriers to exit”. That should help save its engineers from complacency, the curse of many a tech champion. The project might stall if it started to hurt the business. But perhaps Google reckons that users will be more inclined to share their information with it if they know that they can easily take it back.

04 September 2011

Just in case you hadn't seen this

CHAPEL HILL - Dr. Gary Marchionini, dean and Cary C. Boshamer Distinguished Professor at the School of Information and Library Science (SILS) at the University of North Carolina at Chapel Hill, has been selected to receive the Award of Merit, the highest honor presented by the American Society of Information Science and Technology (ASIS&T).

The award is "bestowed annually to an individual who has made a noteworthy contribution to the field of information science, including the expression of new ideas, the creation of new devices, the development of better techniques and outstanding service to the profession of information science."

The award will be presented by the president of ASIS&T during the annual meeting in October.

"Dr. Marchionini is more than deserving of this award," said Dr. Ben Shneiderman, professor, Computer Science and founding director of the Human-Computer Interaction Laboratory at the University of Maryland. "He has always thoughtfully provided intellectual leadership with broad theories and followed through by implementing working systems that provided inspiration for others. His work and his personal style are inspirational. He chooses meaningful paths for groundbreaking research that has impact. He works very hard, while engaging with people on a personal and human basis, a rare skill among academic superstars."

The award, which consists of an engraved Revere bowl and a certificate, includes an inscription that reads:

"Dr. Gary Marchionini is an internationally renowned distinguished professor who has contributed a lifetime of extraordinary accomplishments to the field of information science. He excels in a number of research areas including digital libraries; information seeking in electronic environments and interactive information retrieval; human-computer interaction and design; health information technologies; information policy; and, more recently, social media such as YouTube. His contributions have resulted in further development of thought, better techniques, and outstanding service to the field of information science through sharing the results of his substantial research throughout the world.

"Gary has published more than 200 articles, book chapters and technical reports on these research topics as well as publishing results of his research on the usability of personal health records, multimedia browsing strategies, personal identity in cyberspace and other areas of research. Several of his publications have been cited hundreds of times. He continuously shares the results of his research at home and around the world, most recently as an invited presenter of the prestigious Ranganathan Lectures in Bangalore, India (three lectures). Earlier this year, Gary was appointed to serve on the President's Council of Advisors on Science and Technology (PCAST) Health Information Technology (HIT) Report Workgroup.

"Through a combination of research, teaching, and service to the community, Gary has demonstrated his passion for improving the ways in which people use computers to find and use the information they need. At every step, he has demonstrated that he is an expert in this field of information science, standing above others by envisioning a need, and then attacking problems with fervor and an enthusiasm unlike most researchers. He focuses on the impact of his work and reaches for the ultimate benefit to users of the projects and products of his efforts, changing the world for the better."

Marchionini has served as dean of SILS since April 1, 2010. A faculty member since 1998, he heads the school's Interaction Design Laboratory. He serves on the Campus Research Computing Committee and has led or assisted in leading numerous campus initiatives since arriving at Carolina. He was nominated by his students and selected as the school's Outstanding Teacher of the Year in 2009, and he received the prestigious Faculty Award for Excellence in Doctoral Mentoring for the UNC at Chapel Hill campus in 2010.

Marchionini served as president of ASIS&T in 2009. He chaired the National Institutes of Health/National Library of Medicine's Biomedical Library and Informatics Review Committee, and he previously was editor-in-chief of the Association for Computing Machinery's (ACM) "Transactions on Information Systems" from 2002 to 2008. He has served on more than a dozen editorial boards and is editor of the Morgan-Claypool book series, "Information Concepts, Retrieval and Services." He is also the author of "Information Seeking in Electronic Environment," and "Information Concepts: From Books To Cyberspace Identities."

In addition to his impressive publishing record, Marchionini has been awarded numerous grants from the National Science Foundation and other foundations, as well as research awards from companies including Microsoft, IBM and Google. He is the author of  "Information Seeking in Electronic Environments," part of a Cambridge University Press series.

Marchionini earned a doctorate in curriculum development, focusing on mathematics education in 1981, and a master's degree in secondary mathematics education from Wayne State University in 1974. He graduated with a bachelor's degree in mathematics and English from Western Michigan University in 1971.

Before arriving at UNC, he was a faculty member at the University of Maryland for 15 years. He served on the faculty and as a researcher at Wayne State from 1978 to 1983 and taught mathematics at the East Detroit Public Schools for seven years.

Seen in my email


You're invited to a talk at Duke University on October 4 by Michael Nielsen, who will be speaking on Doing Science in the Open. Nielsen is one of the pioneers of quantum computation, and recently has been working on a book called Reinventing Discovery and advocating for a more open scientific culture. In this talk, he will be discussing the history of scholarly communication and collaboration, some possibilities enabled by new technologies, and ideas on how to change the culture of science and scholarship to make them more open and collaborative.
 
The talk will be in Love Auditorium in the LSRC (Levine Science Research Center) on Duke's West Campus at 4pm on Tuesday, October 4, and will be followed by a reception just outside the auditorium. You can find more information at http://bit.ly/nielsen-oct4 and in the longer abstract/bio below. This talk is open to the public and we have a large venue, so please share information about it with anyone you think may be interested.
 
 
TITLE
Doing Science in the Open
 
ABSTRACT
The net is transforming many aspects of our society, from finance to friendship.  And yet scientists, who helped create the net, are extremely conservative in how they use it.  Although the net has great potential to transform science, most scientists remain stuck in a centuries-old system for the construction of knowledge.
 
The talk is in two parts.  In the first part, I describe some striking leading-edge projects that show how online tools can radically change and improve science.  And in the second part I discuss why these tools haven't spread to all corners of science, and how we can change that.
 
In the first part, we'll see how mass online collaboration is being used by some of the world's top mathematicians to solve challenging mathematical problems.  These collaborations use online tools to dramatically amplify a group's collective intelligence, and so expand our capacity to solve problems at the limit of human problem-solving ability.
 
I'll also describe how online citizen science projects are enabling amateurs to make scientific discoveries.  There were early attempts to do this in the 1990s and 2000s, with projects such as SETI@Home and Clickworkers.  But while intriguing, these projects produced limited scientific outcomes.  I'll describe a second wave of citizen science projects that live up to the early promise, and which are producing a stream of important scientific discoveries.
 
These examples illustrate some of the ways the net can change science. In the second part of the talk I discuss the major cultural barriers that inhibit scientists from using or developing new tools.  We'll see that scientists have strong incentives to keep their best ideas and data secret, hoarding them against the possibility of future journal publication.  I'll describe how we can create a much more open scientific culture, one that will truly make the net work for science.
 
BIO
Michael Nielsen is an author and an advocate of open science. His book about open science, Reinventing Discovery, will be published by Princeton University Press in 2011.  Prior to his book, Michael was an internationally known scientist who helped pioneer the field of quantum computation.  He co-authored the standard text in the field, and wrote more than 50 scientific papers, including invited contributions to Nature and Scientific American.  His work on quantum teleportation was recognized in Science Magazine's list of the Top Ten Breakthroughs of 1998. Michael was educated at the University of Queensland, and as a Fulbright Scholar at the University of New Mexico. He worked at Los Alamos National Laboratory, as the Richard Chace Tolman Prize Fellow at Caltech, was Foundation Professor of Quantum Information Science and a Federation Fellow at the University of Queensland, and a Senior Faculty Member at the Perimeter Institute for Theoretical Physics. In 2008, he gave up his tenured position to work fulltime on open science.
 
 
Gary Marchionini, PhD
Dean and Cary C. Boshamer Professor
100 Manning Hall
School of Information and Library Science Chapel Hill, NC  27599 "esse quam videri"....To be rather than to seem.  North Carolina State motto

02 September 2011

Finding ways to use human computation power

Jessica's starter on Thursday reminded me of something. This is a Google talk from five years ago, but the ideas in it are still valid. I'll add the discussion of the talk and embed the video. If it interests you, or the concept of Google Talks interests you, you might want to look at some of the other talks as well.
If you look around at other talks, you might also find one by one of our alums, given at a time when he was still in the PhD program here.
Google TechTalks July 26, 2006 
Luis von Ahn is an assistant professor in the Computer Science Department at Carnegie Mellon University, where he also received his Ph.D. in 2005. Previously, Luis obtained a B.S. in mathematics from Duke University in 2000. He is the recipient of a Microsoft Research Fellowship. 
Tasks like image recognition are trivial for humans, but continue to challenge even the most sophisticated computer programs. This talk introduces a paradigm for utilizing human processing power to solve problems that computers cannot yet solve. Traditional approaches to solving such problems focus on improving software. I advocate a novel approach: constructively channel human brainpower using computer games. For example, the ESP Game, described in this talk, is an enjoyable online game -- many people play over 40 hours a week -- and when people play, they help label images on the Web with descriptive keywords. These keywords can be used to significantly improve the accuracy of image search. People play the game not because they want to help, but because they enjoy it. I describe other examples of "games with a purpose": Peekaboom, which helps determine the location of objects in images, and Verbosity, which collects common-sense knowledge. I also explain a general approach for constructing games with a purpose.

01 September 2011

Saw this up in the Davis stacks...


Have a good weekend everybody...


Starter from 09/01/2011

http://video.news.com.au/2098053756/App-turns-chocolate-bars-into-games

New way of smart phone scanning.

-Brian Z

developerWorks: a web development resource

If you're interested in any aspect of web development, IBM's developerWorks site is a great resource. Most of the content is targeted to experienced IT professionals, but there are also a variety of articles and tutorials for beginners.

For example, if today's discussion on Linux made you want to try it for yourself, you could check out the Linux technical topic or the series that helps new Linux users learn basic tasks.

Here are a few other topics on dW that might be interesting and relevant to this course: