Knowledge Discovery – Digital Téchnē

Miller_Metadata What jumped out at me after reading the first chapter of Steven J. Miller’s text on metadata was the wide applicability and the different typologies of metadata. It is easy to assume that metadata is a bibliographic concept; that only librarians, archivists, and data scientists are interested in metadata. But metadata is important in just about all sectors of private and public employment. Not only that, but metadata is all around us, and we engage with it – sometimes unknowingly – in our daily routines and hobbies. Cultural heritage institutions primarily deal with descriptive metadata in order to describe their resources better for searching, browsing, retrieval, and use. Descriptive metadata, as only one type of metadata in a sea of other metadatas, is laden with complexity and has a rich history.

Metadata, quite simply, is the detailed description of the granular levels or properties of objects, resources, or artistic works. It is commonly defined as “data about data,” but this is far too simplistic a definition. In Setting the Stage, Anne J. Gilliland defines metadata as “the sum total of what one can say about any information object at any level of aggregation.” Almost all metadata is intellectual, allowing users the ability to pinpoint specific items for retrieval in bibliographic databases, or to make informed judgments about the provenance or inherent properties of an item. This explanation is not entirely specific, but neither is the definition for subject analysis, which is determining the “aboutness” of an item; a task that relies heavily on metadata. As Miller writes, “[Metadata] allows users to identify the content, context, and meaning of digital resources, both individually and in relation to one another, and it allows users to retrieve individual resources and sets of related resources based on any number of shared characteristics” (Miller, 10). In other words, without metadata, there would be no order to information. The bibliographic universe would be fraught with entropy. Indeed, without metadata, it would be impossible to find relevant sources of information or even distill meaning from information.

Gilliland, however, shows that the application of a standard metadata schema is not the “Be-all, End-all” of information organization. Gilliland writes: “the distinctiveness of the various professional and object-based approaches (e.g., widely differing notions of provenance and collectivity as well as of structure), different institutional cultures, and divergent cultural approaches (e.g., those exemplified in indigenous protocols for archival and library materials) have left many professionals, and the communities they represent, feeling that their practices and needs have been shoehorned into structures that were developed by another community with quite different epistemologies, practices, and users.” This is a very important point. There can easily be resistance among institutions when adopting metadata standards. Therefore, metadata sharing, harvesting, and aggregating might not be appropriate depending on the cultural context of the institution and the clientele they serve.

Dublin Core (DC) metadata is the most common standard in the cultural heritage domain. I was previously unaware that there were so many more options within DC to qualify data in terms of refinements and encoding schemes. I have used DC before, but I did not really understand its structure, because I was not explicitly dealing with the underlying logic of the standard. My experience with digital content management software is confined to records creation in DSpace. Without taking a class on descriptive metadata before or being required to read any technical manuals on DC metadata as an undergraduate work-study student, using out-of-the-box software can obscure the practice of metadata creation. For instance, simply entering rote data values into input fields on a program with minimal DC support does not really get to the crux of the standard.

DC is designed to be flexible. So flexible, in fact, that application profiles are encouraged for local use. Therefore, there really is no wrong way to create metadata. In terms of interoperability, you can either create metadata that is compatible with the standard (which is encouraged), or you can create metadata that is not compatible, but is still adequate for localized use. In this sense, there is both “good” and “bad” metadata in terms of interoperability, but there really is no such thing as wrong metadata as long as it is operational for information search and retrieval purposes.

On that subject, Miller makes an important point at the end of Chapter 2 in his text. He says that “many implementers have found unqualified Dublin Core too simple for fully effective description of digital cultural heritage resources. Even Qualified Dublin Core does not always have the processing power that some implementers need in order to accomplish the search, browse, navigation, and identification functions they desire to implement for their users” (Miller, 54). In my experience, it seems like most users of information resources rarely see or desire DC elements when searching catalogs. In fact, most users generally have no need for DC metadata. Instead, what they require to find the information objects they are looking for are the local elements. DC elements are essential for machine readability, which in turn is necessary for cross-platform resource discovery. But as far as I can tell, the justification for DC remains the interoperability of linked data and convergent catalogs, which is still very much a “future perfect,” as most catalogs in the cultural heritage domain remain localized. And that may be how they remain indefinitely. For example, as Lynne C. Howarth points out: “[T]here remain legitimate reasons as to why a standards-in-common solution poses its own enduring problems, even within the compelling context of interoperability. Competition for market share, anti-trust laws within commercial applications, the inconvenience (and expense) of implementing a standard that is of primary benefit to others, the need for flexibility in innovation and testing, the inherent dynamism (instability and continuous upgrade) of information technology, and the considerable length of time required for developing, implementing, and evaluating a new standard (the standards development life-cycle), may conspire against full collaboration–and particularly international collaboration–among diverse interests and organization imperatives.”

Clearly, problems remain.

Taylor and Joudrey (2012) concluded their book, The Organization of Information, by stating that there is much work to be done in both information organization and the development of retrieval systems. With the diffusion of information in today’s world, the effort to analyze, arrange, classify, and make readily available millions of resources is a task that requires sophisticated programming of bibliographic networks, as well as endless hours of critical and analytical work from trained catalogers or indexers. Taylor and Joudrey showed that, despite advances in technology, the human mind is still needed to interpret a myriad of information resources by providing subject analysis, controlled vocabulary, and classification schemes to the descriptive practice of knowledge management.

We have now witnessed almost two centuries of bibliographic control, with many of the foundational principles of cataloging and description still in use today. For example, collocation – the intellectual and proximal ordering of bibliographic materials – was an invention of nineteenth-century founders such as Anthony Panizzi, Charles Ammi Cutter, and Melvil Dewey. These individuals saw the importance of creating subject headings and classification rules, which libraries shortly adopted thereafter in the form of dictionary catalogues, indexes, thesauri, and subject lists. The goal of these systems was to classify the entirety of all knowledge. This all started with the Dewey Decimal Classification system, which had ten main discipline classes with 10,000 subdivisions in which books could be classified. This system was expanded by Cutter in the use of his Expanded Classification system, which included letters to represent subject classes. Cutter’s system ultimately found its way into the Library of Congress Classification system, rather to the chagrin of Dewey.

The development of computerized systems to aid in the structuring and retrieval of knowledge occurred in the late 1960s. Machine-readable Cataloging (MARC) was introduced in 1968. MARC formatting allowed computers to read and encode bibliographic records by utilizing a numeric coding system that corresponded to the areas of description in a written catalog record. These codes contained “variable fields” for areas of variable length (such as a book title or author name); “control fields” for numeric data points (call numbers, ISBNS, LCCN, etc.); and “fixed fields” for bibliographic data of a predetermined length and format, such as a three-letter language abbreviation.

Bibliographic networks were built to accommodate the MARC format. The first major network to emerge was the Ohio College Library Center, which morphed into the OCLC (Online Computer Library Center), still in use today. OCLC allows catalogers the ability to import bibliographic records from a shared network of libraries and information resource centers. Where importing occurs, this is referred to as copy cataloging. A cataloger will add an already-cataloged record to their system, engaging in authority work by ensuring their record was copied from a reliable source like the Library of Congress authority files. Almost all public and academic libraries use OCLC, and this system has streamlined the work of cataloging in technical service departments. But it is important to note that this technology is almost fifty years old now. There are nascent trends in the world of information science that go beyond the reach of time-honored bibliographic networks.

The classical arrangement of knowledge mentioned above was based on a narrow set of information resources; primarily books. But not all resources that users need to be able to search and retrieve are biblio-centric. For example, an information seeker may need to find an artifact. Knowledge artifacts are as varied as the name implies. They can include sound recordings, historical objects, websites, performance art pieces, even concepts. This last example of “concepts” perhaps best illustrates the point. Indeed, a knowledge artifact can be purely conceptual or abstract in nature. Yet, as an artifact, it still needs to be described and collocated for information retrieval. This is done though a “technical reading” of the artifact; a process of critical analysis whereby the cataloger or indexer attempts to define the aboutness of a work.

The process of defining aboutness, referred to as subject analysis by Taylor and Joudrey, is at the heart of information organization. Subject analysis is arguably the most important part of cataloging work, and it is certainly the trickiest. In order to determine the aboutness of a work, the cataloger must be able to accurately represent a knowledge artifact. But the artifact in question might possibly not contain any lexical content. In other words, it may be a nontextual information resource, and thus completely intangible intellectually without the creator’s original insight. Yet, as a cultural information resource, the knowledge artifact still has meaning, which requires it to be abstracted and indexed. How is this to be done? Well, there is still debate among LIS professionals regarding the best practices for subject analysis. The common practice is to isolate subject keywords in an aboutness statement. However, aboutness statements impose the cataloger’s perceptions onto a work, classifying the artifact in a hierarchical manner which may not be culturally precise. Herein lies the danger of subject analysis.

This creates a dilemma for classification of knowledge artifacts. For instance, in order to make an information resource readily retrievable, controlled vocabulary is required. Controlled vocabulary are specific terms which are used for describing all “like” resources. But, as we have seen, describing knowledge artifacts can be difficult. Indeed, sometimes during subject analysis, the cataloger can only describe the of-ness of an artifact (Taylor & Jourdrey, 309). As a general rule, controlled vocabulary makes it easier to find resources in an information system. But if an original cataloger incorrectly represents a knowledge artifact, any surrogate record for that artifact will invariably be misrepresented. Surrogate records can number into the hundreds of thousands. So if the goal of bibliographic networks is to create standardized subject headings in an interoperable system, then hundreds of thousands of inaccurate records could be created. Conversely, if controlled vocabulary is not used in the representation of a knowledge artifact, then that artifact will be made all but impossible to retrieve in an information system. This is the dilemma of subject analysis.

Another argument against classification schemes of the past is that they contain restrictive rules which hinder knowledge discovery. Knowledge discovery is the ability to make connections between wide-ranging subjects that otherwise would not be related in a traditional classification system. For example, we have entered an era where almost all data can be linked together in novel and entertaining ways. This is the basis for the Semantic Web. Internet users can link and categorize anything they want by creating tags or folksonomies that showcase niche interests and new subject matter. By analyzing the content of the semantic web, information scientists are working to harness these folksonomies to improve search engine functionality and retrieval tools. It is an exciting time, but it is also a daunting time. Intellectual mastery of the semantic web is necessary to preserve entrenched disciplines that contain thousands of years of knowledge.

In the future, newer forms of information systems will be tried and tested. These will include natural language processors and artificial intelligence systems. But bibliographic data will still be inputted by humans through the process or cataloging and resource description. This task may become easier for catalogers and indexers as information systems may improve on their ability to offer suggestions or provide prepopulated subject headings. But just the same, the work will continue. Taylor and Joudrey illustrated that knowledge management is not perfect. There are flaws and implicit biases in subject analysis. But where data integrity for abstract and philosophical content is concerned, human intervention is still required. Indeed, knowledge is still the province of human beings, not machines.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Digital Téchnē

UW-Milwaukee SOIS – LIS and Philosophy

Category Archives: Knowledge Discovery

Initial Observations of Metadata

The Organization of Information