Academic Architect

A conversation with Mike Giarlo

In an era of digital scholarship, academic libraries around the world have been tackling challenging questions related to the storage, maintenance, and curation of digital data. To meet this need at Penn State, librarians and technologists have begun working together to create a digital repository, called ScholarSphere, that will better house, prevent the deterioration of, and improve the accessibility of electronic resources for decades to come. As a technical leader in the stewardship of Penn State's digital resources, ITS Digital Library Technologies will have a key role in this initiative.

Mike Giarlo is a digital library architect at DLT, tasked not only with designing the repository’s technical architecture, but also with educating and fostering a community of technologists around these important curatorial issues. Mike sat down with us to reflect on his journey to Penn State and envisage the exciting road ahead.

Mike, could you tell us a little bit about your role in DLT?

My primary role is to design and develop a technical architecture for Penn State’s and University Libraries’ digital assets, including research data, electronic records, and digital library collections. The nature of my work is not radically different from other technical architects, such as my colleagues in Penn State’s ITANA (Information Technology Architects in Academia) group. What makes a digital library architect different from other architects though, is how the application of IT architecture is applied to the problems of long-term information management (digital preservation) in order to enhance discovery and access. Providing enduring access to research and cultural heritage has been, and continues to be, the mission of libraries and archives.

Image of Mike smiling

How did you get involved in this field?

I’ve been working in academic and research libraries since 1999, during which time I’ve had the opportunity to do many things, including technical and workstation support, systems and storage administration, web and software development, as well as project management. From 2001 until 2005, I was employed at Rutgers University, and worked closely with their digital library architect to build a digital preservation architecture for the institution. I found this work, and the great challenge of preserving access to our digital cultural heritage, to be fascinating and fulfilling.

Around this time, I decided it would be beneficial to learn more about libraries and information management, so I started taking classes at Rutgers’ School of Communication and Information. I ultimately earned a master’s degree in library and information science. Between my time at Rutgers and Penn State I developed and honed my software development and project management skills, both of which I’d learned are extraordinarily useful in this line of work.

Immediately before coming aboard at Penn State, I was a software developer at the Repository Development Center at the Library of Congress, where we tackled large-scale digital preservation and data management issues at the world’s largest library.

What is a typical day for you like?

Image of Mike reading an iPad

My daily activities are varied and include: the planning and development of technology strategy, managing projects, meeting with stakeholders and sponsors to discuss requirements and strengthen relationships, serving on related committees and working groups, planning regional and national events for building community around the practice of digital curation, and evangelizing to generate interest and excitement about what we do.

How do you manage to marry the library-focused portion of your job with the IT?

My role particularly with regard to DLT’s Applications and Repository Services team is to translate between the two very different worlds of libraries and IT. My hybrid background in libraries and IT serve me well in this role.

In addition to helping guide the trajectory of our applications and repository efforts, I also keep my software development skills sharp by writing code for our repository projects, specifically for unit testing and developing application programming interfaces. Information modeling is a technical aspect of my job where my knowledge of libraries and IT converge. The practice of information modeling entails analysis of a particular domain (in this case the work of libraries), identifying the entities within that domain, characterizing the entities, and enumerating the relationships between entities. My experience in libraries and my library degree help me apply the practice of information modeling to the domain of libraries and the work we do in the repository space.

half profile of Mike looking towards the left

What is information modeling?

Modeling as a practice is employed throughout IT, and more broadly, as a way to better understand a domain before attempting to solve problems within that domain. The practice of modeling produces artifacts that may be used as touchstones throughout the course of a project, such as entity-relationship diagrams for databases, architectural diagrams for network and systems integration, or functional requirements and roadmaps for higher-level consumption.

image of the button on Mike's blue and white checkered shirt

How is DLT involved in the University Libraries’ Content Stewardship initiative?

The Content Stewardship program is a joint initiative between the Libraries and ITS. DLT and the Libraries are equal partners in the project and we work to determine the direction of the initiative and develop program activities. Most recently, we have been working together on the ScholarSphere project, which is a repository that will provide durable access to scholarly works and data at Penn State.

ScholarSphere allows members of the Penn State research community—faculty, staff, graduate students, and undergraduates alike—to deposit, search, share, and provide access to their scholarly works. The repository also provides preservation functions such as scheduled and on-demand audits of deposited works, the ability to upload multiple versions of files, characterization of files to enable future format migrations, regular file backups, and replication to disaster recovery sites.

Could you tell us a little about ScholarSphere?

ScholarSphere has been developed atop a flexible, format-agnostic repository architecture that will collect, store, and manage research data and documents across a broad spectrum of disciplines and data types. ScholarSphere’s initial services will support the direct deposit of scholarly and research materials, such as datasets, working papers, research reports, image collections, and other digital objects which the researcher wishes to share, search, and preserve.

The functionality provided by ScholarSphere, and the strong support it has received from both the University Libraries and ITS, positions it as an important element in the creation of researchers’ data management plans—especially for those seeking funding from granting agencies that require such plans. The Libraries and ITS already provide consulting services to researchers about how to write their data management plans, and thus ScholarSphere and these services will interlock like puzzle pieces to help researchers with their practical research and data management needs.

What are some of the benefits that ScholarSphere will provide Penn State researchers?

right profile image of Mike

The beta release of ScholarSphere will provide researchers with the ability to store their publications and data on a platform that simplifies sharing, citing, and discovering these materials. Researchers will be able to share their data and documents with the Penn State community, either by sharing directly with specified individuals or with established groups. A researcher will also be able to share each file at different access levels including read-only and edit modes, allowing her full control over who sees and edits her files. She may also restrict access to her digital assets so that they are accessible only to her.

Every file stored in ScholarSphere will have a globally unique identifier (URL) assigned to it, which makes it easier for a researcher to cite and publicize her research. The ScholarSphere service is committed to the long-term persistence of these URLs so that citations in fixed media (print journals for example) do not break over time. When a researcher uploads a file to ScholarSphere, she will have the opportunity to describe it so that it becomes more findable by other researchers. In addition, every file is run through a characterization service that extracts embedded metadata from the file and stores information about the file format, all of which makes the file even more easily discoverable.

And this is just the tip of the iceberg; ScholarSphere will continue to add widely requested features beyond its beta release, so there will be even more advanced functionality to look forward to in the near future.

This sounds like a huge undertaking, is Penn State collaborating with any other institutions to make it happen?

stacked image of Mike looking left serious, and looking left smiling

Absolutely. We couldn’t have delivered ScholarSphere as quickly as we did without collaboration from other institutions. We’ve taken a community-based approach to the technology underlying ScholarSphere by engaging the community around the Hydra Project. The Hydra Project is a collaborative effort to develop repository systems across institutions using common components. Hydra was started in 2008, and has grown to include many peer institutions such as Stanford University, Northwestern University, University of Virginia, University of Notre Dame, Indiana University, Columbia University, and others. The Hydra Project now includes more than a dozen active institutions, with even more lined up to join the community.

Can you explain Hydra’s technological approach for building repositories?

stacked image of Mike looking left serious, and looking left smiling

The Hydra technology stack is built on a mature, open-source platform consisting of Fedora Commons for digital asset management and Apache Solr for search and indexing. Developers from institutions in the Hydra community have built components integrating these systems and provided hooks and helpers for higher-level systems. These components are written in the Ruby programming language, and the web applications within this community are built using the Rails web framework (which is also written in Ruby).

The upshot of using these Hydra technologies is that they help organizations gain the ability to create web-based user interfaces for their repository services that feel very much like modern web applications and less so like the monolithic and rigid repository systems of the past.

My role in ScholarSphere has been to align our architectural roadmap with that of the Hydra community, assist with software development (with a side benefit of keeping my Ruby and Rails skills sharp), and to engage the Hydra community.

What has this collaborative experience been like so far?

The Hydra community has welcomed Penn State into the project, and we have already contributed back substantially. Not only has Penn State gained so much from leveraging the technologies and approach of this community—allowing us to meet our aggressive deadlines—but the Hydra community has already benefited from our work as well: a mutual “big win.”

Who is involved with the development of ScholarSphere at Penn State?

The ScholarSphere project is highly collaborative, with representation across ITS and the University Libraries. Collaborative work between ITS and the University Libraries set the groundwork for ScholarSphere. This work also includes a number of projects that I’ve co-led with Patricia Hswe, the Libraries’ digital collections curator.

Patricia’s and my positions are very much counterparts to one another, different sides of the same coin; in fact, Patricia and I started at Penn State on the same day, so even our hiring was coordinated between ITS and the Libraries.

How interesting! What have you two accomplished together so far?

My first project, which was also Patricia’s first project, was to conduct a systematic assessment of how the Libraries provide access to digital content. For the assessment, we interviewed a few dozen individuals at University Park—including folks from the Libraries, ITS, and outside organizations—in order to gain an understanding of the varied perspectives people have about our digital delivery systems. This turned out to be an invaluable project, not only because of what we learned about these systems, but also because it got us out in front of folks from all over campus, which also really helped to ground me socially at Penn State.

Patricia and I have worked on a number of projects following the assessment, including gathering curatorial use cases and producing a prototype repository application. ScholarSphere was informed by the incredible work that a large number of Penn Staters have been doing over the past two years. I doubt the ScholarSphere would have as much support as it does without all of the groundwork we’ve done collaboratively with the Libraries.

What’s trending in the digital libraries arena right now?

As digital research has grown and its products have multiplied over the past few years, research data has become a hot topic in academia, and there are now hundreds of repositories for depositing research data. Yet, digital libraries have typically not been designed with this use case in mind, and many institutions are finding they have to grow or adapt their digital library service offerings. However, there are many unanswered questions related to this growing field. For example: How does a researcher know where to deposit her data? How does a librarian know where to find the most relevant research data for a particular reference question? How does a scientist find relevant research data, so that her experiments can build on existing bodies of research?

To help answer some of these questions, I recently collaborated on a project with Purdue University Libraries called Databib.

Databib is a crowdsourced web-based registry of research data repositories that integrates with social networks and the semantic web, placing information about research data repositories into larger web ecosystems. Work on Databib was funded by a grant from the Institute of Museum and Library Services, which was awarded to Purdue and Penn State last year. The Databib beta is now live and is growing every week.

How are you building a community around digital curation?

Mike sitting on orange couch, looking left with hand on a pile of books

A significant aspect of my work at Penn State is community engagement, particularly among the practitioners of digital curation. As the community of digital curation practitioners has grown over the years, so too has the need for increased collaboration among technologists. A number of communities have sprung up, many of which focus on a very specific technology or approach. In 2010, a colleague at University of California-San Diego and I sought to bridge these disparate communities by creating a new program called CURATEcamp.

CURATEcamp is a series of unconference-style events focused on connecting practitioners and technologists interested in the practice of digital curation. Conversations tend to revolve around the sharing of best practices, common tools, and technologies. Presentation topics vary and include persistent identifiers, versioning, data transfer, packaging approaches, object structure, file system usage, archiving and storage, metadata standards, semantic ontologies, web discovery, and interoperability.

Why the “unconference” format?

The unconference format encourages all participants to be actively engaged in the workshop and gives everyone an opportunity to contribute across institutional, technical, and social boundaries. There are no spectators at CURATEcamp, only contributors.

Since its inception in 2010, we have run, sponsored, and organized six events across the country, attracting hundreds of participants from several dozen institutions. Our model seems to be going strong and is expanding—we are currently planning the next four CURATEcamp events for 2012.

What initially attracted you to Penn State back in 2009?

The digital library architect position posted by DLT offered a laundry list of nearly everything I wanted out of a job: a foundation in IT, a strong relationship with University Libraries, connections to the research community, and a position in ITS.

The opportunity represented what I had been working towards: a position in which I could serve as a leader in digital preservation and access initiatives for an academic institution.

And what do you love about your work today?

I love that the problems confronting me as a digital library architect are so diverse—there are technical, social, and organizational aspects of my job. The opportunity to tackle this spectrum of issues, in pursuit of helping academia to preserve and make available its digital output, helps keep both sides of my brain firing rather than just the analytical parts. I appreciate that the practices of digital preservation and curation are still in their infancy and there are few easy answers. I’ve learned that those of us working in this domain can be more successful by staying aware of each other’s work, and thus I enjoy the opportunity to engage peers nationally and build community across institutional and disciplinary boundaries. Since these practices are so new (as are positions like mine) I appreciate the opportunity to try different approaches and apply creative problem solving. But perhaps best of all is the freedom I have to be innovative.

image of the orange couch with a stack of books on it

More from this Issue