The psychologists call it “deindividuation”. It’s what happens when social norms are withdrawn because identities are concealed. The classic deindividuation experiment concerned American children at Halloween. Trick-or-treaters were invited to take sweets left in the hall of a house on a table on which there was also a sum of money. When children arrived singly, and not wearing masks, only 8% of them stole any of the money. When they were in larger groups, with their identities concealed by fancy dress, that number rose to 80%. The combination of a faceless crowd and personal anonymity provoked individuals into breaking rules that under “normal” circumstances they would not have considered.
Deindividuation is what happens when we get behind the wheel of a car and feel moved to scream abuse at the woman in front who is slow in turning right. It is what motivates a responsible father in a football crowd to yell crude sexual hatred at the opposition or the referee. And it’s why under the cover of an alias or an avatar on a website or a blog – surrounded by virtual strangers – conventionally restrained individuals might be moved to suggest a comedian should suffer all manner of violent torture because they don’t like his jokes, or his face. Digital media allow almost unlimited opportunity for wilful deindividuation. They almost require it. The implications of those liberties, of the ubiquity of anonymity and the language of the crowd, are only beginning to be felt.
In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms