Data Science 
Tutorials for Beginners: 

A Digital Humanities Project for History Scholars
: How to Analyze & Visualize Research Trends using Academic Journal Data in Korean History

About

This website is designed to offer a basic online tutorial of data analysis and metadata research for librarians and scholars in History and Digital Humanities. Especially, this tutorial is intended for beginner level researchers who have no prior experience in data science and try to create their own data analysis projects. In this tutorial, we uses an online academic journal database for Korean history, Hanguksa Yeongu Hwibo (The Bulletin of Korean Historical Research). This database is available in Korean.

Below you will find a brief overview of data analysis and visualization that can help you navigate how to conduct your analysis. In the step-by-step tutorials, you will learn how to start finding, cleaning, and exploring your data with various data analysis tools. Although most tools used in this tutorial require no knowledge of code-writing, some basic coding skills can be useful for utilizing relevant packages in RPython, and p5js. The goal of this project is to introduce a basic data analysis and visualization using various free online toolkits. By practicing this sample analysis, you will learn what resources can be used in what format and how as well as how to choose appropriate sources and tools for your research. After this tutorial, you can try more advanced data analysis tools such as scikit-learn, Stylo, and Gephi for different types of data analysis and visualization.

What is Hanguksa Yeongu Hwibo (the Bulletin of Korean Historical Research)?


Hanguksa Yeongu Hwibo (The Bulletin of Korean Historical Research) is a Korean history database and biannual periodical that compiles a list of publications on Korean history research. It provide the most extensive and detailed information on Korean history research topics by offering original text of important historical materials, organizing the trend of Korean history research in the academia, and introducing the catalogue of literature titles.

It is a collection of books and papers, and is classified by topic periods (e.g., general theories, prehistoric times, ancient times, Goryeo, Joseon, modern period, contemporary history, etc). It services up-to-date bibliographical information on almost all of research outputs concerning Korean history in and out of Korea. As of 2019, there are 45,230 books and 188,214 thesis in their database. These contents are available online. (http://db.history.go.kr/item/level.do?itemId=hb.)

It is maintained by National Institute of Korean History (NIKH). The NIKH is a South Korean national organization in charge of researching, collecting, compiling, promoting the study of historical materials on Korean history. It was established as Guksagwan, which is a national organization that systemically research, collect, preserve, compile, and distribute various historical materials that record important events in Korean history and are necessary for Korean history research, in March 1946, one year after the liberation of Korea. Later in 1949, the name was changed to the current one. To date, the NIKH publishes studies and sourcebooks on Korean history to encourage research in the field.

Download
A Sample Dataset

Download this sample dataset and follow the step-by-step tutorials below to practice data analysis and visualization.
This test dataset includes 11,352 publication lists and metadata that are created from April 2017 to April 2020.
Our preprocessed dataset can help you start your analysis more easily.

You can also try the tutorials with your own dataset to explore your data.

Before Analysis

Below you will find a list of tutorials for using different toolkits. 

Before you move on to the tutorials, we will walk you through the three basic steps you need to take before starting analysis.

1

Right Questions

The first step to starting your analysis is to ask the right question. Before diving into the analysis, take time to think about the right questions. What are the right questions you need to ask for proper data-driven decision making processes? Let's find out.

What do you want to know? What exactly do you want to find out?
What problems are you trying to solve?
What are you actually trying to do with your dataset?

What questions can you investigate with your dataset?
What do you expect to achieve by analyzing your dataset? 
How exactly are you going to measure?

Make your question as specific as possible and clarify the exact parameters to be tested.

In this project, our analysis aims to solve these questions;
What are the most common research topics in each time period?
Who are the main publishers and where are they located? 
2

Data Collection

Once you decide the right question, it becomes much easier for you to find the right dataset for your analysis. But you still need to think about the following questions about your dataset.

Where can you obtain your data? Where will it come from?
How can you be sure that your dataset are reliable? 

For this exercise, the original dataset has been downloaded from http://db.history.go.kr/item/level.do?itemId=hb.
3

Data Cleaning

The next step is to roll up your sleeves and start to clean up your data to ensure that your dataset is consistent, reliable, and ready to be used. In order to do that, you need to identify incorrect, inaccurate, or irrelevant parts of your data and modify them. This process can include spell checking, finding/replacing data, removing duplications, space, or other unnecessary parts, merging/splitting columns, rearranging data, et cetera. To do these tasks, you can use Excel, Python, OpenRefine, Data Wrangler, and other data cleaning tools.

These are the questions you can ask for this step.
Does your dataset make sense? Are there any data quality issues?
What kinds of cleaning or manipulating will your dataset require?
What tools will you need to clean and modify your dataset? 

For our analysis, we only extracted the publications' material ID, classification number, period, author(s), title, periodical (with volume), publisher, publication date, and URL from the original dataset. In this sample dataset, there are 11,352 publication lists and their metadata.

Data Cleaning (Example)


First, we had downloaded the raw .txt file from the Hanguksa Yeongu Hwibo website and imported it into Excel.
Removing irrelevant parts

Since the original format of the dataset is a continuous text string on one column of the worksheet, we need to separate it into individual columns based on the category. In order to do that, we split this text data by using a comma delimiter character (e.g., comma, tab, space, or semi-colon) in Excel. You can use the Excel ribbon under the Data tab and click on the Text to Columns icon in the Data Tools group of the Excel ribbon. Select Delimited on the option buttons to split text into different columns in Excel.

Arranging data order
& Checking errors

We extracted relevant data needed for our analysis (e.g., material ID, classification number, period, author(s), title, periodical (with volume), publisher, publication date, and URL).

After getting rid of unnecessary data values, we sort the list of data by the material ID and organized the inaccurate format in the correct order.

Simplifying data

For our analysis, we will only need publication years. Thus, we converted the publication date to year format.

Well-done, it looks much better! Now, our dataset is ready to be analyzed.


WordCloud

WordCloud can be done using various open source toolkits (e.g., MonkeyLearn WordCloud Generator, WordArt.com, Wordclouds.com, TagCrowd, Tagxedo, and Python). This WordCloud is generated using Jason Davies. It is a Wordle-inspired word cloud generator written in JavaScript and available on GitHub under an open source license as d3-cloud.

The layout algorithm itself is incredibly simple.
For each word, starting with the most important: It attempt to place the word at some starting point: usually near the middle, or somewhere on a central horizontal line. If the word intersects with any previously placed words, it moves the word one step along an increasing spiral. Then, it repeats until no intersections are found.

From this WordCloud visualization, we can see the most frequently mentioned words are; 
Chosun(조선), Korea(한국), Research(연구), Investigation(검토), Baekje(백제)*, Silla(신라)**, Excavation(출토), Goguryeo(고구려), Liberation(해방), Movement(운동), and Japanese Colonial Period(일제강점기).

*One of Korea's so-called "Three Kingdoms," along with Goguryeo to the north and Silla to the east. It ruled over the southwester part of the Korean peninsula from 18 BCE to 660 CE.
**Or Shilla (57 BCE – 935 CE) was a Korean kingdom located on the southern and central parts of the Korean Peninsula.


Mapping

Tableau can be a free easy mapping tool. Using Tableau, you can also make an interactive map. Tableau divides your data into dimensions (independent variables) and measures (numeric values). Then, it classifies them into datatypes (text; number; date...). Depending on data source, Tableau can automatically geocode the geographic dimensions. But, sometimes, it cannot recognize a correct data field.

For this exercise, I first extracted the latitude and longitude data of publishers' cities from Google Maps, and then assigned the Geographic Role to these values. I created the map by putting the longitude data into the Columns and latitude data into the Raws. To create a map that shows a point for the location of each publisher, I dragged the Location dimension into the Detail button in the Marks pane. To show how many publishers are located in that city, I dragged Publisher dimension into the Size button in the Marks pane.

Since I have many publishers in my dataset, I used different colors for different publishers by dragging the Publisher dimension into the Color button in the Marks pane. I also made the legend to show the color or each publisher. 

From this map, we can see the most publishers, including International Journal of Korean History, 건지인문학, and Acta Koreana, are located in Seoul, South Korea.

We can also check most publications are from 건축역사연구, The Review of Korean Studies, The Review of Korean Studies, Sungkyun Journal of East Asian Studies, 강원사학 and 강좌미술사.



Voyant Tools


Voyant Tools is an open-source, web-based application for performing text analysis.
This platform has various text analysis functions for digital texts and corpus.
You can simply type URLs, paste a full text, or upload a file.
Term
This is the term in the corpus.

Chosun(조선) and Late Chosun period(조선후기) were the most frequent words in our dataset.

Then, other keywords were mentioned most frequently in the following order; 
Korea(한국), Chosun period(조선시대), Silla(신라), Goryeo(고려), 19, Change(변화), Baekje(백제)Japanese Colonial Period(일제강점기), and Goguryeo(고구려).

Count
This is the frequency of the term in the corpus.

The keywords, Chosun and Late Chosun period were mentioned 512 times and 301 times, respectively in our dataset.

Then, other keywords, Korea(한국), Chosun period(조선시대), Silla(신라), Goryeo(고려), 19, Change(변화), Baekje(백제), Japanese Colonial Period(일제강점기), and Goguryeo(고구려) were mentioned 287, 276, 227, 223, 208, 180, 159, 158, and 153, respectively.

Relative & Trend
Relative is the relative frequency of the term in the corpus, per one million words (sorting by count and relative should produce the same results, the relative frequencies might be useful when comparing to another corpus).

Trend is a sparkline graph that shows the distribution of relative frequencies across documents in the corpus (if the corpus contains more than one document); you can hover over the sparkline to see finer-grained results.

-

Mobirise

Bar Chart (Periods)
Chosun > Contemporary > Modern > Ancient > Goryeo> Prehistory

Mobirise
Bar Chart (Example)

https://editor.p5js.org/

You can check this reference

Cytoscape

You can check the interactive version here.

Insights

If at any point you want virtual/in-person help, please feel free to contact us. See the Contact tab below for our contact information.

Contacts

Share your thoughts

Email

seul@g.ucla.edu

Share this Page!

Built with Mobirise - Click for more