Considering Web Classification

For those with a more traditional background in library science, or simply with experience in cataloging departments, I think it may be too easy to feel that cataloging has to be a manual process, controlled by the human cataloger. This may be the case with books, because they have physical dimensions and cataloging-in-publication data which needs to be entered into a cataloging system, either through the process of copy cataloging or original cataloging. Moreover, some libraries may take the liberty to add subject headings to cataloging records that meet the criteria of their own hand-selected collections. However, web resources are a different beast. Classifying web resources can seem like a daunting task because there is such a proliferation of content on the Web, including not just static webpages, but blogs, wiki’s, and videos. The discussion of cataloging web resources once revolved around deciding how to classify just webpages, but now it is a question of classifying web content, which relies increasingly on metadata standards like Dublin Core. The Dublin Core Initiative measures not only standard bibliographic attributes, but those unique to the Web, such as creator(s), format, type of resource, etc.

I think for awhile now we have seen a move away from Library of Congress classification (LCC) or Dewey Decimal classification (DDC), especially in regards to classifying the semantic web. In fact, I have not seen any earnest discussion of applying these classification schemas to web resources. The two projects that had earnestly tried to apply LCC or DDC were the CyberStacks Project out of Iowa State University and OCLC’s NetFirst. These projects seem all but dead now. I think the reason is that applying the alphanumeric codes of LCC and DDC is a process which relies on human matching of subject disciplines, which is just too much of a Sisyphean task when it comes to Web resources. In other words, it is still too difficult for artificial intelligence and machine learning to pin down subject disciplines based on keyword analysis. That being said, we are not without commercialized computer resources to aid in the classification of web resources. There are automated tools which index just about anything they are programmed to index, like web-based keywords or metatags.

These tools make the bibliographic management of the web possible. Bibliometric mapping of the Web can produce large databases of indexed material, which puts the Internet in the cross-hairs of catalogers. So ideally, the best “system” to classify Web materials is to use the many tools that are available to digital librarians which allow for taking bibliographic snapshots of the Web, such as webcrawlers designed for the purpose.

As far as the ephemeral nature of the Web goes, I do not think LIS professionals need to concern themselves too much with cataloging Web material that eventually disappears due to link rot. Canonical webpages – or webpages of content that are sponsored – will provide enough material for catalogers to work on. I see this as being no different than cataloging books that have gone through the publication process. There has always been a certain authority that measures bibliographic worth. Of course, I am aware of the implications of leaving out self-created folk content. But the original purpose of cataloging was to capture the whole of knowledge as nearest as possible, and there is enough information out there to catalog, in print form and on the Web, in order to accomplish this objective.

At any rate, indexing the semantic web through the use of automated products produces large and numerous digital libraries. My ideal system for classifying web resources would be, for starters, a greater emphasis on this endeavor. But also the application of useful digital tools to aid the cataloger in matching content to knowledge base.

The Organization of Information

Taylor and Joudrey (2012) concluded their book, The Organization of Information, by stating that there is much work to be done in both information organization and the development of retrieval systems. With the diffusion of information in today’s world, the effort to analyze, arrange, classify, and make readily available millions of resources is a task that requires sophisticated programming of bibliographic networks, as well as endless hours of critical and analytical work from trained catalogers or indexers. Taylor and Joudrey showed that, despite advances in technology, the human mind is still needed to interpret a myriad of information resources by providing subject analysis, controlled vocabulary, and classification schemes to the descriptive practice of knowledge management.

We have now witnessed almost two centuries of bibliographic control, with many of the foundational principles of cataloging and description still in use today. For example, collocation – the intellectual and proximal ordering of bibliographic materials – was an invention of nineteenth-century founders such as Anthony Panizzi, Charles Ammi Cutter, and Melvil Dewey. These individuals saw the importance of creating subject headings and classification rules, which libraries shortly adopted thereafter in the form of dictionary catalogues, indexes, thesauri, and subject lists. The goal of these systems was to classify the entirety of all knowledge. This all started with the Dewey Decimal Classification system, which had ten main discipline classes with 10,000 subdivisions in which books could be classified. This system was expanded by Cutter in the use of his Expanded Classification system, which included letters to represent subject classes. Cutter’s system ultimately found its way into the Library of Congress Classification system, rather to the chagrin of Dewey.

The development of computerized systems to aid in the structuring and retrieval of knowledge occurred in the late 1960s. Machine-readable Cataloging (MARC) was introduced in 1968. MARC formatting allowed computers to read and encode bibliographic records by utilizing a numeric coding system that corresponded to the areas of description in a written catalog record. These codes contained “variable fields” for areas of variable length (such as a book title or author name); “control fields” for numeric data points (call numbers, ISBNS, LCCN, etc.); and “fixed fields” for bibliographic data of a predetermined length and format, such as a three-letter language abbreviation.

Bibliographic networks were built to accommodate the MARC format. The first major network to emerge was the Ohio College Library Center, which morphed into the OCLC (Online Computer Library Center), still in use today. OCLC allows catalogers the ability to import bibliographic records from a shared network of libraries and information resource centers. Where importing occurs, this is referred to as copy cataloging. A cataloger will add an already-cataloged record to their system, engaging in authority work by ensuring their record was copied from a reliable source like the Library of Congress authority files. Almost all public and academic libraries use OCLC, and this system has streamlined the work of cataloging in technical service departments. But it is important to note that this technology is almost fifty years old now. There are nascent trends in the world of information science that go beyond the reach of time-honored bibliographic networks.

The classical arrangement of knowledge mentioned above was based on a narrow set of information resources; primarily books. But not all resources that users need to be able to search and retrieve are biblio-centric. For example, an information seeker may need to find an artifact. Knowledge artifacts are as varied as the name implies. They can include sound recordings, historical objects, websites, performance art pieces, even concepts. This last example of “concepts” perhaps best illustrates the point. Indeed, a knowledge artifact can be purely conceptual or abstract in nature. Yet, as an artifact, it still needs to be described and collocated for information retrieval. This is done though a “technical reading” of the artifact; a process of critical analysis whereby the cataloger or indexer attempts to define the aboutness of a work.

The process of defining aboutness, referred to as subject analysis by Taylor and Joudrey, is at the heart of information organization. Subject analysis is arguably the most important part of cataloging work, and it is certainly the trickiest. In order to determine the aboutness of a work, the cataloger must be able to accurately represent a knowledge artifact. But the artifact in question might possibly not contain any lexical content. In other words, it may be a nontextual information resource, and thus completely intangible intellectually without the creator’s original insight. Yet, as a cultural information resource, the knowledge artifact still has meaning, which requires it to be abstracted and indexed. How is this to be done? Well, there is still debate among LIS professionals regarding the best practices for subject analysis. The common practice is to isolate subject keywords in an aboutness statement. However, aboutness statements impose the cataloger’s perceptions onto a work, classifying the artifact in a hierarchical manner which may not be culturally precise. Herein lies the danger of subject analysis.

This creates a dilemma for classification of knowledge artifacts. For instance, in order to make an information resource readily retrievable, controlled vocabulary is required. Controlled vocabulary are specific terms which are used for describing all “like” resources. But, as we have seen, describing knowledge artifacts can be difficult. Indeed, sometimes during subject analysis, the cataloger can only describe the of-ness of an artifact (Taylor & Jourdrey, 309). As a general rule, controlled vocabulary makes it easier to find resources in an information system. But if an original cataloger incorrectly represents a knowledge artifact, any surrogate record for that artifact will invariably be misrepresented. Surrogate records can number into the hundreds of thousands. So if the goal of bibliographic networks is to create standardized subject headings in an interoperable system, then hundreds of thousands of inaccurate records could be created. Conversely, if controlled vocabulary is not used in the representation of a knowledge artifact, then that artifact will be made all but impossible to retrieve in an information system. This is the dilemma of subject analysis.

Another argument against classification schemes of the past is that they contain restrictive rules which hinder knowledge discovery. Knowledge discovery is the ability to make connections between wide-ranging subjects that otherwise would not be related in a traditional classification system. For example, we have entered an era where almost all data can be linked together in novel and entertaining ways. This is the basis for the Semantic Web. Internet users can link and categorize anything they want by creating tags or folksonomies that showcase niche interests and new subject matter. By analyzing the content of the semantic web, information scientists are working to harness these folksonomies to improve search engine functionality and retrieval tools. It is an exciting time, but it is also a daunting time. Intellectual mastery of the semantic web is necessary to preserve entrenched disciplines that contain thousands of years of knowledge.

In the future, newer forms of information systems will be tried and tested. These will include natural language processors and artificial intelligence systems. But bibliographic data will still be inputted by humans through the process or cataloging and resource description. This task may become easier for catalogers and indexers as information systems may improve on their ability to offer suggestions or provide prepopulated subject headings. But just the same, the work will continue. Taylor and Joudrey illustrated that knowledge management is not perfect. There are flaws and implicit biases in subject analysis. But where data integrity for abstract and philosophical content is concerned, human intervention is still required. Indeed, knowledge is still the province of human beings, not machines.