Considering Web Classification

For those with a more traditional background in library science, or simply with experience in cataloging departments, I think it may be too easy to feel that cataloging has to be a manual process, controlled by the human cataloger. This may be the case with books, because they have physical dimensions and cataloging-in-publication data which needs to be entered into a cataloging system, either through the process of copy cataloging or original cataloging. Moreover, some libraries may take the liberty to add subject headings to cataloging records that meet the criteria of their own hand-selected collections. However, web resources are a different beast. Classifying web resources can seem like a daunting task because there is such a proliferation of content on the Web, including not just static webpages, but blogs, wiki’s, and videos. The discussion of cataloging web resources once revolved around deciding how to classify just webpages, but now it is a question of classifying web content, which relies increasingly on metadata standards like Dublin Core. The Dublin Core Initiative measures not only standard bibliographic attributes, but those unique to the Web, such as creator(s), format, type of resource, etc.

I think for awhile now we have seen a move away from Library of Congress classification (LCC) or Dewey Decimal classification (DDC), especially in regards to classifying the semantic web. In fact, I have not seen any earnest discussion of applying these classification schemas to web resources. The two projects that had earnestly tried to apply LCC or DDC were the CyberStacks Project out of Iowa State University and OCLC’s NetFirst. These projects seem all but dead now. I think the reason is that applying the alphanumeric codes of LCC and DDC is a process which relies on human matching of subject disciplines, which is just too much of a Sisyphean task when it comes to Web resources. In other words, it is still too difficult for artificial intelligence and machine learning to pin down subject disciplines based on keyword analysis. That being said, we are not without commercialized computer resources to aid in the classification of web resources. There are automated tools which index just about anything they are programmed to index, like web-based keywords or metatags.

These tools make the bibliographic management of the web possible. Bibliometric mapping of the Web can produce large databases of indexed material, which puts the Internet in the cross-hairs of catalogers. So ideally, the best “system” to classify Web materials is to use the many tools that are available to digital librarians which allow for taking bibliographic snapshots of the Web, such as webcrawlers designed for the purpose.

As far as the ephemeral nature of the Web goes, I do not think LIS professionals need to concern themselves too much with cataloging Web material that eventually disappears due to link rot. Canonical webpages – or webpages of content that are sponsored – will provide enough material for catalogers to work on. I see this as being no different than cataloging books that have gone through the publication process. There has always been a certain authority that measures bibliographic worth. Of course, I am aware of the implications of leaving out self-created folk content. But the original purpose of cataloging was to capture the whole of knowledge as nearest as possible, and there is enough information out there to catalog, in print form and on the Web, in order to accomplish this objective.

At any rate, indexing the semantic web through the use of automated products produces large and numerous digital libraries. My ideal system for classifying web resources would be, for starters, a greater emphasis on this endeavor. But also the application of useful digital tools to aid the cataloger in matching content to knowledge base.

The Infinite Archive

Eric Ketelaar’s paper, Archives as Spaces of Memory, struck me as an important contribution to the paradigmatic postmodern literature on archives. Ketelaar’s paper is divided into two main sections. In the first section, he discusses the differences between legal records and archival records. This discussion is framed by an interesting contextual history of the Nuremberg trials. The second section of Ketelaar’s paper focuses on the concept of Archives 2.0, in which the use of Web 2.0 technologies such as “annotation systems, wikis, clusters of blogs, social network visualisations, social recommender systems, and new ways of visualising conversations…” (18) can enliven the use and impact of archives on society. Throughout the paper, Ketelaar’s thesis remains clear. He argues that archival records – when opened up to a community for participatory interaction – can strengthen communal bonds which invariably heal societies that have undergone a traumatic experience or sequence of traumas.

When discussing the Nuremberg trials, Ketelaar argued that the law itself, even the successful service of justice through and by the law, is not enough to bring closure to the victims of an atrocity. He quotes Dutch psychologist Nico Frijda who says: “the past for victims and survivors, and their families, is ‘unfinished business’: they go on searching for meaning how the humiliations, the cruelties, the systematic destruction could have come about” (13). In other words, when the trial is over, the perpetrators of a crime are dealt with accordingly by the justice system, but the memory of what happened – the trauma – continues to affect the victims. The courts, however, are impartial and unemotional, and as far as they are concerned, when guilt has been proven and criminals are indicted, there is nothing left for them to do. Indeed, legal records in a trial are meant to be used by the prosecutors to serve an objective, finite end. Once the case is closed, the records are sealed away. As Ketelaar writes, “[t]he law aspires to a degree of finality, that neither History nor Memory does” (11).

Ketelaar’s conception of the “infinite archive” suggests that records are meant to be used ad infinitum for purposes that are restorative and creative. He says that “[a] record is never finished, never complete, the record is ‘always in a process of becoming” (12). This is the main difference between the two record groups as discussed by Ketelaar. He would likely maintain that legal records are these stale things which, while they are very important in their own right and can certainly be archived, they are not infinitely archival. According to Ketelaar, archives can heal trauma(s) because the records contained within have the power to serve what he refers to as “memory-justice” (13). Indeed, archival records, unlike law records, can be used or “activated” by the victims of history. They can be tapped for their healing powers by victimized or marginalized groups of people. Legal records cannot.

I think this is an important consideration. Knowing that archival records can be used as therapeutic resources, it becomes imperative to discover new and effective ways of providing access to archives. This is why Ketelaar shifts in his discussion to talk about Archives 2.0. By now, it is obvious that new media and social networking have produced novel ways of engaging in cultural modes of thought and creation. Ketelaar brings up some important concepts in this section such as “parallel provenance” and “co-creatorship.” In terms of archives, these concepts support the Records Continuum Model of Frank Upward. Ketelaar writes, “the social and cultural phenomenon of co-creatorship entails a shift of the traditional paradigm of the organic nature of records and the principle of provenance” (15). Participatory archives is important, then, for the reasons mentioned above. Releasing the fixity of archives allows for the process of re-creation and reconciliation, which is vital for the health of society. As emotional fixity can result in depression and dissociation from society, participatory archives can only be a good thing. Still, there are problems inherent in releasing archives for public use and activation. For instance, Archives 2.0 increases the problem of ensuring data protection, consent, and privacy. Ketelaar does admit that “[t]his needs a new generation of access policies, tools and practices, less collection driven, but directed towards archives as social spaces and records as social entities” (18). So despite the altruism Ketelaar exhibits in his call to release the archives, one can sense that new traumas could emerge in these social spaces.

Beginning thoughts on IR systems

Following the logic of Zavalina and Vassilieva in Understanding the Information Needs of Large-Scale Digital Library Users (2014), I think information retrieval (IR) systems should be informed by the information-seeking behaviors of the user community. This ensures that the IR system is designed with the users in mind and that the main purpose of the system is to help users acquire their informational needs. As a principle of design, this is also necessary if the system is to have a democratizing effect. You want to have an IR system that empowers the user, allowing them to easily navigate the interface and satisfy their needs through an intuitive and smart system. This seems pretty much like the ideal.

But saying an IR system should be “informed” by user behavior is different from saying that an IR system should “adapt” to user behavior. The former presupposes that the IR system designers understand and can predict the searching habits of individuals. They would then try to accommodate a wide range of user search styles through the implementation of useful tools, like relevance rankings or context help. Adapting a system around users, however, means that the IR system you would get would look like something akin to Google, where popularity and site traffic dictate what will be optimized.

Of course, it is no secret among LIS professionals that search skills among the general population suffer from a lack of information literacy and specific knowledge of IR systems and how the system retrieves user inputted keywords. Khapre and Basha in A Theoretical Paradigm of Information Retrieval in Information Science and Computer Science (2012) mentioned the principle of least effort. While the idea inherent in the principle of least effort is from the design perspective meant to optimize retrieval based on limited user knowledge, the phenomenon of least effort in information-seeking behavior is still problematic. In a matching program, where a user comes up with a query which is analyzed and matched to a document by organized keywords, broad and unfocused keywords will yield fuzzy search results.

Therefore an IR system cannot adapt to users without sacrificing its functionality for precision. An IR system must be able to handle very specific intellectual queries at a very granular level. I think this question poses a central dilemma in the field of information retrieval and access. Indeed, there is a lot of cognitive dissonance between “man and machine,” as it were. User expectations are way too high. People have become spoiled with the ease of performing Google searches and obtaining instant results to whatever research requirements they have. But I think it is important to realize that IR systems are sophisticated tools that require a sophisticated understanding of how to use them. In Khapre and Basha’s article, they pointed out that technology can change our thoughts and, importantly, that “technology is making it difficult for users to recognize that it is external, known only to the simple “interface value””. This concept of interface value is an important one in human-computer interaction, because users have expectations of the IR system which they take at “interface value.” But they are completely ignorant of the internal coding of the IR system, which is considerably complex and based on algorithmic science that usually escapes the end user’s interest or opportunity for study.

The Organization of Information

Taylor and Joudrey (2012) concluded their book, The Organization of Information, by stating that there is much work to be done in both information organization and the development of retrieval systems. With the diffusion of information in today’s world, the effort to analyze, arrange, classify, and make readily available millions of resources is a task that requires sophisticated programming of bibliographic networks, as well as endless hours of critical and analytical work from trained catalogers or indexers. Taylor and Joudrey showed that, despite advances in technology, the human mind is still needed to interpret a myriad of information resources by providing subject analysis, controlled vocabulary, and classification schemes to the descriptive practice of knowledge management.

We have now witnessed almost two centuries of bibliographic control, with many of the foundational principles of cataloging and description still in use today. For example, collocation – the intellectual and proximal ordering of bibliographic materials – was an invention of nineteenth-century founders such as Anthony Panizzi, Charles Ammi Cutter, and Melvil Dewey. These individuals saw the importance of creating subject headings and classification rules, which libraries shortly adopted thereafter in the form of dictionary catalogues, indexes, thesauri, and subject lists. The goal of these systems was to classify the entirety of all knowledge. This all started with the Dewey Decimal Classification system, which had ten main discipline classes with 10,000 subdivisions in which books could be classified. This system was expanded by Cutter in the use of his Expanded Classification system, which included letters to represent subject classes. Cutter’s system ultimately found its way into the Library of Congress Classification system, rather to the chagrin of Dewey.

The development of computerized systems to aid in the structuring and retrieval of knowledge occurred in the late 1960s. Machine-readable Cataloging (MARC) was introduced in 1968. MARC formatting allowed computers to read and encode bibliographic records by utilizing a numeric coding system that corresponded to the areas of description in a written catalog record. These codes contained “variable fields” for areas of variable length (such as a book title or author name); “control fields” for numeric data points (call numbers, ISBNS, LCCN, etc.); and “fixed fields” for bibliographic data of a predetermined length and format, such as a three-letter language abbreviation.

Bibliographic networks were built to accommodate the MARC format. The first major network to emerge was the Ohio College Library Center, which morphed into the OCLC (Online Computer Library Center), still in use today. OCLC allows catalogers the ability to import bibliographic records from a shared network of libraries and information resource centers. Where importing occurs, this is referred to as copy cataloging. A cataloger will add an already-cataloged record to their system, engaging in authority work by ensuring their record was copied from a reliable source like the Library of Congress authority files. Almost all public and academic libraries use OCLC, and this system has streamlined the work of cataloging in technical service departments. But it is important to note that this technology is almost fifty years old now. There are nascent trends in the world of information science that go beyond the reach of time-honored bibliographic networks.

The classical arrangement of knowledge mentioned above was based on a narrow set of information resources; primarily books. But not all resources that users need to be able to search and retrieve are biblio-centric. For example, an information seeker may need to find an artifact. Knowledge artifacts are as varied as the name implies. They can include sound recordings, historical objects, websites, performance art pieces, even concepts. This last example of “concepts” perhaps best illustrates the point. Indeed, a knowledge artifact can be purely conceptual or abstract in nature. Yet, as an artifact, it still needs to be described and collocated for information retrieval. This is done though a “technical reading” of the artifact; a process of critical analysis whereby the cataloger or indexer attempts to define the aboutness of a work.

The process of defining aboutness, referred to as subject analysis by Taylor and Joudrey, is at the heart of information organization. Subject analysis is arguably the most important part of cataloging work, and it is certainly the trickiest. In order to determine the aboutness of a work, the cataloger must be able to accurately represent a knowledge artifact. But the artifact in question might possibly not contain any lexical content. In other words, it may be a nontextual information resource, and thus completely intangible intellectually without the creator’s original insight. Yet, as a cultural information resource, the knowledge artifact still has meaning, which requires it to be abstracted and indexed. How is this to be done? Well, there is still debate among LIS professionals regarding the best practices for subject analysis. The common practice is to isolate subject keywords in an aboutness statement. However, aboutness statements impose the cataloger’s perceptions onto a work, classifying the artifact in a hierarchical manner which may not be culturally precise. Herein lies the danger of subject analysis.

This creates a dilemma for classification of knowledge artifacts. For instance, in order to make an information resource readily retrievable, controlled vocabulary is required. Controlled vocabulary are specific terms which are used for describing all “like” resources. But, as we have seen, describing knowledge artifacts can be difficult. Indeed, sometimes during subject analysis, the cataloger can only describe the of-ness of an artifact (Taylor & Jourdrey, 309). As a general rule, controlled vocabulary makes it easier to find resources in an information system. But if an original cataloger incorrectly represents a knowledge artifact, any surrogate record for that artifact will invariably be misrepresented. Surrogate records can number into the hundreds of thousands. So if the goal of bibliographic networks is to create standardized subject headings in an interoperable system, then hundreds of thousands of inaccurate records could be created. Conversely, if controlled vocabulary is not used in the representation of a knowledge artifact, then that artifact will be made all but impossible to retrieve in an information system. This is the dilemma of subject analysis.

Another argument against classification schemes of the past is that they contain restrictive rules which hinder knowledge discovery. Knowledge discovery is the ability to make connections between wide-ranging subjects that otherwise would not be related in a traditional classification system. For example, we have entered an era where almost all data can be linked together in novel and entertaining ways. This is the basis for the Semantic Web. Internet users can link and categorize anything they want by creating tags or folksonomies that showcase niche interests and new subject matter. By analyzing the content of the semantic web, information scientists are working to harness these folksonomies to improve search engine functionality and retrieval tools. It is an exciting time, but it is also a daunting time. Intellectual mastery of the semantic web is necessary to preserve entrenched disciplines that contain thousands of years of knowledge.

In the future, newer forms of information systems will be tried and tested. These will include natural language processors and artificial intelligence systems. But bibliographic data will still be inputted by humans through the process or cataloging and resource description. This task may become easier for catalogers and indexers as information systems may improve on their ability to offer suggestions or provide prepopulated subject headings. But just the same, the work will continue. Taylor and Joudrey illustrated that knowledge management is not perfect. There are flaws and implicit biases in subject analysis. But where data integrity for abstract and philosophical content is concerned, human intervention is still required. Indeed, knowledge is still the province of human beings, not machines.