Starting a DH Project #
Since our lab spends a significant proportion of our time working on digital humanities projects, I have wanted for a while to write a guide on how you should think about “onboarding” a new DH project. Would you like to come work with us to examine some data that you already have? Do you have a cool question that you think would be responsive to some DH analysis if only you had the right data and tools? This is the page for you.
What’s a Good Project? #
What kinds of things should you have in mind for thinking about whether or not a project is even a good idea in the first place? Here’s a few hopefully relevant questions and bullet points that are worth thinking about.
What kind of data do you have, and how much?
A project is going to need at least enough text or material to be useful. How much is that? Hard to say! It’s not obvious. I’ve done projects with as few as a couple hundred journal articles (though then your conclusions will have to be correspondingly more modest, and the data will have to be of much higher quality, perhaps involving manual clean-ups), and with as many as a few dozen thousand.
Each kind of analysis you might be thinking of doing (see below) will have different extents to which it is “data-hungry,” and so one piece of early discussion that we should have will center around just what you want to do and, by extension, how much data you’ll need to get it done.
A “good project,” then, needs at least enough data to pull off the analyses you want.
What role do you want the digital analysis to play?
If you want a digital analysis to be the result of your work, that’s a very different thing than saying you want a digital analysis that can support or backstop your other work. You’ll need much higher quality analysis, more data, and much more thought about what kind of connection you can draw between the particular conclusions you want to draw and what the output of a particular analysis can tell you.
A “good project,” then, needs to know why it needs digital analysis, and how those analyses will fit into the landscape of other things that you want to do in a particular context.
How will you defend your choices to reviewers?
I know this is a weird thing to put in a “preliminary” document, but to be honest I think one of the best ways to think about these kinds of preparatory choices is to think about how you’d defend yourself to a recalcitrant reviewer. What journals did you include? What did you leave out? What data or questions that you can’t ask would an intuitive reader think that you nonetheless should have asked?
Answering these kinds of questions also raises disciplinary concerns. What a philosopher will see as the “weaknesses” of a digital analysis won’t be the same as what a “historian” will see, and both those will be very different from what a digital humanist in a DH journal is going to look at.
A “good project,” then, should know where it’s going to be published and how it hopes to defend itself from reviewers.
Getting Ready #
First and foremost, it’s extremely important to emphasize the amount of time that it will take just to assemble the data that you need to perform your analysis and to get it into working order with our tools. If you’re thinking about a short-term visit to work on a DH project, that will make it absolutely necessary that we assemble the data that you need well before you get here – think six months to a year, not a few weeks.
Our group is adept at analyzing scientific journal articles. With a bit more effort, we’ve also got some experience looking at books, and I imagine (though I haven’t yet tested!) that the kinds of tools and workflows that we have built will also be effective for studying other reasonably short-format things like grant proposals, correspondence, or other archival material.
What do you need to put together, then, to build a data project? We need (1) plain text of the content that you want, in the highest quality that you can find it, and (2) detailed, accurate metadata. For (1), the most important thing to say is that native-digital content available in full machine-readable text (such as the PLoS journals, which are available natively in XML) is best, native-digital content in PDF is second-best but still pretty good, and OCR is worst. OCR degrades in quality radically depending on the age of material, and this will introduce various kinds of systematic error into your analyses that are hard to control for and hard to understand.
Getting access to the plain text is a question of varying difficulty. Some publishers (Springer-Nature, e.g.) have a very open text and data mining license, and we can pretty much do whatever we want. Some are very closed off and we might have to negotiate legal agreements (which will be more complex if this is an inter-university collaboration). Some are somewhere in the middle (such as JSTOR, which requires that we fill in a form and wait for them to approve us, or Wiley, who have a sort of API for downloading full-texts that’s clunky and a bit slow).
Metadata is probably no big deal for journal articles – we can usually use CrossRef (the people who dispense DOIs) or the journal’s own website for this. For archival material, you will have to think very hard about what kinds of metadata you will need to properly perform and contextualize your analyses. Hopefully it’s available in an extant, machine-readable form somewhere.
Starting Out #
Of course, the simplest thing for us to do is to load your content into Sciveyor, our platform for text analysis. If we’re not in a position to do that (licensing, data formats, etc.), then we might be able to build a local version of Sciveyor, running just inside my office at UCLouvain, containing the data that you’re interested in. If we’re not able to do that either (say, your data is too complex to be able to be loaded into Sciveyor, or you want to perform a kind of analysis that isn’t currently available to us), you’ll have to work a little bit harder.
What kinds of things can Sciveyor do? The easiest way to see a quick list of the kinds of questions that you can ask is to read through the help site, which briefly summarizes the kinds of broad questions that the system enables you to ask.
It’s here that other tools make an appearance. At various points in the lab, we’ve had experience with a variety of network analysis systems, and with DH code itself written in programming languages as diverse as Python, Ruby, and Go. We’ve also used some generally available DH tools, such as Voyant and Mallet. Some of these systems (Voyant in particular) are point-and-click, and take little time to learn. Building custom code in Python, for instance, will vary wildly in difficulty, depending on what’s already available in terms of sample code from which you could work as a base.
What kinds of other external tools have we already used? Here’s a brief list:
- Topic modeling, through either Mallet or the Python toolkit Gensim
- Word embedding (such as the word2vec toolkit), to determine the relationships between words in documents or document similarity
- Network analysis, including citation networks, which are planned for integration with Sciveyor but have not yet been thus integrated
- And, in general, if you find a tool online (even better if it’s open source and freely available), we can probably find a way to integrate it with our data, at least privately for purposes of one project!
The Deep End #
In large part, what will be open to you in more complex cases depends on what you already know about programming (and the console, scripting, and so forth) and how much time you would like to dedicate to training. As a rule, I have no problem teaching specific DH techniques to people who want to come learn them with the lab, up to and including working on writing up new analysis code. But I’m not very good at teaching the fundamentals of command-line usage, basic programming languages, and those other kinds of “prerequisite” skills. If you’re interested in really going deep into programming but don’t have those skills yet, you should think about places elsewhere that you could obtain them – for example, a number of online courses can get you up to speed quickly. (If you’ve used some that you can recommend, let me know and I can put some links here!)