Equipment, tools, and applications
Below are some of the resources and equipment at the CDHU which are used e.g. in projects and in workshops. You are also welcome to visit our GitHub page for further resources we have been working on at the CDHU and with our collaborators. Feel free to contact us regarding your project or research needs, and keep an eye on our workshop events!
The CDHU has three Oculus Quest 2 headsets which can be used for custom 3D virtual reality visualisations as well as for exploring VR representations in popular culture (VR games and experiences).
NodeGoat GO and NodeGoat server
The NodeGoat server hosted at the CDHU allows the installed NodeGoat software to run multiple parallel projects, each accessible to multiple users. CDHU administers the research environments for data modelling, and configuring datasets collaboratively or alone. NodeGoat facilitates geographic spatial and temporal visualisations plus a built-in network analysis tool.
Recogito and Recogito server
Semantic Annotation without the pointy brackets! Work on texts and images, identify and mark named entities, use your data in other tools or connect to other data on the Web. Recogito offers semantic annotation and connections to online data without the need to learn to code, as well as a customised connection to gazetteers for mediterranean archaeology.
Computation and storage
Local computation and storage capabilities at the CDHU are housed on three servers: “Beast” and another (yet unnamed!) workstation, which is used internally for fast computation and information processing, and “Beauty”, used for network attached storage.
- 1 Dell workstation/server (“Beast”), used internally for computation in the CDHU technical infrastructure. CPU: 2x Intel(R) Xeon(R) Gold 6240R @ 2.40GHz, 96 threads. GPU: 2x Nvidia Turing T4, 16 GB VRAM. RAM: 64GB. Storage: Disk array set up in ZFS mirror-mode, providing around 12TB of efficient storage.
- 1 Dell workstation/server (“Beauty”), used for network attached storage in the CDHU technical infrastructure. CPU: Intel(R) Xeon(R) Gold 6226R @ 2.90GHz, 32 threads. RAM: 64GB. Storage: Disk array set up in ZFS parity 3, providing 50TB of efficent storage.
- 1 Dell workstation/server (Still unnamed), used internally for computation at CDHU (not yet set-up). CPU: 2x Intel Xeon Platinum 8260 2.4GH, 96 threads. RAM: 1TB. GPU: 3x Nvidia RTX A5000, 24GB VRAM. Storage: 15 TB in total."
Document scanner and paper guillotine
Research being conducted at the CDHU in collaboration with the Department of History of Science and Ideas presently makes use of a Cannon G2090 document scanner for fast mass scanning stacks of documents up to A3 size. This machine scans 300-600 dpi, ~200 pages per minute, with an automatic document feeder. Used in conjunction with the guillotine to remove spines from books/bound volumes, this enables fast, though destructive, digitisation of cultural heritage materials.
Software, scripts and models
Attention HTR model
Attention HTR is an attention-based sequence-to-sequence model for handwritten word recognition. To overcome training data scarcity, this work leverages models pre-trained on scene text images as a starting point towards tailoring the handwriting recognition models. Source code and pre-trained models are available at GitHub:
Marginalia and machine learning (Pytorch)
For the detection of text written in document margins or handwritten notes, we have a PyTorch implementation of a Handwritten Text Recognition (HTR) system that focuses on automatic detection and recognition of handwritten marginalia texts. Faster R-CNN network is used for detection of marginalia and AttentionHTR is used for word recognition. The data comes from early book collections (printed) found in the Uppsala University Library, with handwritten marginalia texts.Source code and pre-trained models are available at Github. This is a work under progress.
Word rain: Semantically motivated word clouds
This development of a software library for text visualisation is built on shallow and deep neural networks. It is a work in progress led by CDHU, in Collaboration with Språkbanken Sam.
See: DHNB2023 presentation: Skeppstedt & Ahltorp. The words of climate change:
TF-IDF-based word clouds derived from climate change reports. "
Libralinked: Modelisation of scandinavian library data
These scripts generate interactive graphics and display them as html, following webscraping of the National Library of Sweden. The graphic generation can also be generalised, as data are structured enough for instance via a CSV file
Epub text-extraction tool
This is a tool for extracting textual data from EPUB-books. The scripts convert epubs into txt files and perform basic statistics such as number of words, most frequent words etc.
Scripts and notebooks for scraping SOU-pdf:s
This code scrapes all urls to pdf:s from Kungliga Biblioteket and outputs a CSV-file. This repository includes a notebook that turns the csv-file into a download-script; it also sanitizes and normalizes filenames.
Scripts for Optical Character Recognition in batches
This repository contains various scripts and tools for preparing (bursting, converting, renaming) and OCR:ing pdf:s using Tesseract-OCR. We also have an OCR program based on Pytesseract - a wrapper for Tesseract. It includes language models to enhance the OCR performance.
This is a BERT text classification for Finnish OCR texts, originally used for research on the commodification of wild lingon berries. This work is a part of the Centre for Digital Humanities' Pilot Projects 2021-2022, with a project titled "Text Mining Commodification: The Geography Of the Nordic Lingonberry Rush, 1860-1910". Source code is available at Github