August 5, 2013

Archaeology and GitHub

When I was an undergraduate, I had an awesome class in archaeological statistics at UVa.  We were tasked with doing a final project using actual archaeological data and manipulating the data using statistics.  I wanted to do something osteology-related and, in 1998, this involved going to the library and poring through archaeological site reports until I found one, from Egypt, that provided some metric data of skull size and shape. The data were hardly contextualized, languishing in an appendix. This was cumbersome, to say the least, and I learned as much about data entry as I did about archaeological statistics with this project.

It may be surprising, then, that osteological data access isn't much better in the new millennium.  Many osteological data are still sent to appendices of site reports, which make them difficult to find and use. New policies such as the NSF's data management plan implemented in 2010 should mean that archaeological and osteological data are brought to the fore; in reality, though, this doesn't seem to be the case.  Even when osteological data are published, it's usually as static charts or tables, not in any sort of digital, database-friendly format that could be imported into Excel or SPSS.  And unless you have a student willing to type in all the data for you, this presents a barrier for the busy scholar who wants to do cross-site and cross-cultural research.

For some reason, the Octodex doesn't have an
ArchaeOctocat. The Octocat de los Muertos will
have to do instead...
Fortunately, archaeologists are leading the way in increasingly opening up their excavations and the resulting data to their colleagues and even the public in the past few years.  This embracing of openness seems to come from the broader open-source movement that started in computer science in the late 90s and expanded into academia as a whole, most notably through open-access publishing, which also took off in the late 90s with the rise of the Internet.  Plenty of other archaeologists have written about opening up their research and their rationale behind it, so I won't repeat those excellent arguments.  Although I'm very open with the products of my research -- with giving out my dissertation and articles -- and often use this blog as something of an "open notebook" to work through ideas and results, I actually haven't been great at sharing the data files themselves.  So I'm starting to remedy that.

Currently, I'm putting up information on GitHub, and you can click through to see all the things that I've posted so far.  GitHub is mainly geared towards software developers, to aid them in working on collaborative projects, but it made some headlines recently when the White House decided to post a bunch of policy documents on GitHub. It's free for open-source projects and fairly cheap if you want to keep your data totally or partially private. There are a bunch of archaeologists and digital humanists there, posting a variety of interesting stuff. But I was convinced to join when I learned that GitHub will let you visualize .stl files (3D models) in your browser.

(Full disclosure: My husband works for GitHub, so he's tasked me with proselytizing the benefits of it to other archaeologists.  I'm exaggerating... but only a little.)

Some stuff I am posting:
  • Syllabi.  Don't just use my syllabi for ideas... fork them, and post your own!  I posted these spiffed-up syllabi that I created with the hope that people will use them as they see fit and post their own syllabi.  I love reading others' class syllabi; it makes for good ideas for my own classes, as I can pick and choose from a variety of activities, lecture topics, and bibliography entries.  
  • 3D models.  Just one posted so far, from the Medieval Berliners project.  I have been slow to learn the 3D scanning and modelling software, but I did scan and photograph all the teeth from this project before drilling into them. Also check out hacky486, one of our grad students, who is doing more with modelling than I am.
  • Osteology Database.  I posted a blank version of the osteology database I designed in 2007 to collect data in Rome.  It has a few updates from 2010.  For those of you osteologists, it's based largely on Standards but more user-friendly (I think) than the Smithsonian's free Osteoware. Mostly, I posted this database here to have somewhere for people to download a big file.
  • Data from Published Articles.  Want to snag my Sr/O/C/N/Pb isotope data from my articles, but don't want to type it all in?  Check out this repository of all the raw data from my 2010 dissertation (and some data from an article that wasn't published in the diss).  I'll probably be updating this file with more contextual information as I go.
That last one is definitely a sticking point.  I haven't published all of these data yet.  That is, although most of these data can be found in my dissertation, there is an entire Access database chock full of information that will go in a couple articles I'm still working on.  I do want to post the entire database for comparative research purposes (since what Roman bioarchaeology needs is a good data set from Rome!), but I also want to keep my job.  So I'm trying to strike a balance by publishing on GitHub those data that are already out there -- in the diss or in articles.  I feel like a bad open-access enthusiast for embargoing the data like this, but I have multiple reasons, some of which I could explain here and others I don't feel comfortable discussing in a public forum.

So while I'm not exactly using GitHub for its intended purpose, I hope my opening up of data and ideas will be useful or inspirational to others.  If GitHub's not for you, though, go check out OpenContext, the brain-child of Eric Kansa, which is fantastic and might be a bit more social-scientist-friendly. (While OpenContext is awesome, it feels more like a platform for dissemination rather than collaboration, another reason I'm trying out GitHub first.)

Finally, for more help in using GitHub and what it all means, check out the great posts by Prof Hacker at the Chronicle of Higher Education -- his "Fork the Academy" essay has links to all posts in the series.

If you're on GitHub, let me know in the comments!  I'm interested in seeing how others are using the site to share data...

4 comments:

Stefano Costa said...

Dear Kristina,
I read your post with a lot of interest. I have been using GitHub and similar services for years, primarily to share software rather than data, but I really like the approach taken by Open Context and now the one you are proposing. Even though I am a proponent of more structured opening of data, I realise that the barriers are really high (in terms of infrastructure and effort needed to transform data in truly open formats). So I hope the following criticism will be taken as a (collective) encouragement towards good practices.

1. Formats. Publisher files, Access databases and even Excel spreadsheets are unpleasant to work with, not only because they require to have a specific software program, but also because they are opaque to web crawlers, and make it more difficult even to quickly skim through the data. Lots of web apps (such as CKAN and GitHub itself) natively provide previews for CSV and JSON data, on the other hand. Open formats are really more enjoyable and easier to manipulate, e.g. to create linked data. After all there's a reason why the tooth is in STL and not in the native scanner format.

2. Licensing. GitHub does not enforce use of a specific usage license, but it is good practice to have one in place. Data is not software so I suggest taking a look at the open flavours of Creative Commons licenses. For raw data (isotope) you may just want to choose Creative Commons Zero, a public-domain like statement. Licensing is crucial for scaling: if I wanted to collect all the cool and crazy data that archaeologists have been pushing to GitHub to make a curated collection, without a license I should contact every single author: doable, but not much in the spirit of open data.

There are more detailed guides on how to use git repositories for open data, e.g. this one http://blog.okfn.org/2013/07/02/git-and-github-for-data/ by Rufus Pollock of the Open Knowledge Foundation. I think we should try to find a common ground and provide some good examples for others to follow. I put some data from my dissertation online in a similar fashion years ago, the format is open but perhaps too obscure for a start: https://bitbucket.org/steko/thesis-app/src/3d4de887d0737c1bfd05fe83642b1fbc0848a42b/tesi/fixtures/dump.json?at=default - it is also worth noting that there are services like figshare where you can upload your data in a more formal (yet easy to use) way.

Thank you for making such an inspiring move. I hope others will follow your steps.

Eric Kansa said...

Dear Kristina,

Thanks for this post! It's exactly the kind of direction that we need to promote in archaeology.

I think GitHub is an excellent near-term platform for data-sharing and collaboration. It would be even better if it interfaced with university libraries or "memory institutions" so that the data and version history can be archived by institutions dedicated to that kind of mission. As much as I like GitHub, it is a for-profit company, and some of those have been known to "turn evil" if they sense a need to change their business model.

I think it's the digital library community that needs to get its act together with respect to GitHub, since GitHub already offers all the APIs needed for libraries to integrate and provide longevity for the data people are sharing and versioning in GitHub.

As you pointed out, Open Context is mainly a dissemination venue. We emphasis APIs more than social software. That's not because we don't value collaboration, it's just that we feel like we don't need to reinvent that wheel. GitHub does it so well already. With Open Context, we've been using GitHub now for over a year as a secondary channel to share (and fork) data.

Right now, we only put XML data in it. That's a barrier to many people who aren't uber-geeks. But we're revising our approach to (CSV) table outputs and will soon have lots of tabular data in GitHub also. Including lots of data describing bones.

Last, our use of GitHub is sorta infamous (see their Robots.txt file, a file that directs search engine bots). It'll be interesting to see where this goes in the future as more researchers make more use of GitHub for more and bigger datasets.

Kristina Killgrove said...

Stefano - Thanks for these suggestions! I am quite new at both GitHub and at thinking about how to open up my data, so your comments are really helpful. In particular, I didn't know that GitHub could show .csv files. It's easy enough for me to post those instead of Excel files.

I'll have to think about other ways to post the syllabi and database -- part of the reason I'm proud of them is their appearance, which rests in the program they were written in.

I'll also think about licenses -- I kind of assumed that anything I posted was basically CC zero unless otherwise noted, but apparently that assumption was wrong.

Eric - Thanks for your comment. That's great that you plan to post tabular data to GitHub! I'm definitely not an uber-geek (that title I leave to my better half), but I'm slowly getting used to working with new programs and file types. I'll be looking forward to seeing how you do it.

And I really hope GitHub doesn't turn evil! It would make my husband so sad... he loves working there.

Shawn Graham said...

Hi Kristina,

Thought you'd like the link to my github too - https://github.com/shawngraham

I'm also using figshare to share datasets, drafts etc. You might want to explore figshare too- http://figshare.com/authors/Shawn_Graham/97736

Twitter Delicious Facebook Digg Stumbleupon Favorites More

 
Design by Free WordPress Themes | Bloggerized by Lasantha