For those interested in the subject of data mining, here is an article written by Diane Cabell and Yours Truly dealing with the issue of data mining in UK higher education. Here is the introduction to see if you are interested in reading it further. Feel free to disseminate.
Data or text mining (hereafter called “content mining”) is a process that uses software that looks for interesting or important patterns in data that might otherwise not be observed. An example might be combining a database of journal articles about ground water pollution with one of hospital admissions to detect a pollution-related pattern of disease breakout.
It is also a useful tool in commerce. A credit card company might detect a correlation between purchases of tickets from particular airline with purchases of certain types of automobiles and develop a marketing program uniting appropriate vendors. One McKinsey report states that the utilization of ‘big data’ in the sphere of public data alone could create €250 billion annual value to Europe’s economy.
Content mining is increasingly accomplished by machine. Databases, particularly those produced by scientific research, are far too large to be scanned by human eyeball. However, the right to mine data is not assured by the law in most jurisdictions and even where it is, the terms of access to the majority of research publication databases deny permission to do so. One recent study indicated that obtaining permission to mine the thousands of articles appearing on a single subject from the myriad of different publishers would require 62% of a researcher’s time. Many content owners, including research institutions, have yet to develop any policy on content mining.
This report will identify the main legal barriers to data mining and data reuse and make policy suggestions to guide governments, funding agencies, and research institutions. As the title suggests, the emphasis of the study is about legal issues that are specific to higher education institutions (HEIs).
The first challenge for this report is to attempt to delimit the subject matter, as various types of content that are subject to automated analysis. HEIs can hold and share content of various formats, here are just a few examples:
- Text: published articles, book chapters, preparatory notes, working papers, reports, teaching materials, conference papers, presentations, theses.
- Datasets: statistical data, geolocation data, survey results, maps, figures, time series, genetic information, health records, computer logs.
- Multimedia: pictures, sound recordings, interviews, presentations, video.
Each of the above may have separate legal regimes applying to them. In the interest of convenience and simplicity, whenever the report talks about database contents, there will be no distinction as to whether we are dealing with text, data or multimedia, unless clearly specified in the text.