Скачать книгу

       21 https://www.europeandataportal.eu

       22 https://www.w3.org/TR/void/

       23 http://xmlns.com/foaf/spec/

       24 https://www.w3.org/2004/02/skos/

       25 https://www.w3.org/TR/vocab-data-cube/

       26 https://www.w3.org/TR/rdf-sparql-query/

       27 https://www.w3.org/TR/sparql11-query/

       28 https://neo4j.com/

       29 https://aws.amazon.com/it/neptune/

      CHAPTER 2

       Principles of Data Visualization

      Information visualization aims at visually representing different types of data (e.g., geographic, numerical, text, network) in order to enable and reinforce cognition. Information visualization offers intuitive ways for information perception and manipulation that essentially amplify the overall cognitive performance of information processing, especially for non-expert users. Visual analytics combines information visualization with data exploration capabilities. It enables users to explore and analyze unknown (in terms of semantics and structure) sets of information, discover hidden correlations and causalities, and make sense of data in ways that are not always possible with traditional quantitative data analysis and mining techniques. This is of great importance, especially given the massive volumes of digital information concerning nearly every aspect of human activity that are currently being produced and collected. The so-called Big Data era refers to this tremendous volume of information collected by digital means and analyzed to produce new knowledge in a plethora of scientific domains.

      The LOD cloud is one of the main pillars of the so-called Big Data era. The number of datasets published on the Web, the amount of information and the interconnections established between disparate sources being available in the form of LOD are nowadays ever expanding, making traditional ways of analyzing them insufficient and posing new challenges in the way humans can explore, visualize, and gain insights out of them.

      In this chapter, we present the basic principles and tasks for data visualization. We first provide the tasks for the preparation and visualization of data. We then present the most popular ways for graphically representing data according to the type of data and then we provide an overview of the main techniques for users to interact with the data. Finally, we show the main techniques used for visualizing and interacting with Big Data.

      Information visualization requires a set of preprocessing tasks, such that data can be first extracted from data sources, transformed, enriched with additional metadata and properly modeled before they can be visually analyzed. An abstract view of this process along with its constituent tasks is shown in Figure 2.1. It presents a generic user-driven end-to-end roadmap of the tasks, problems, and challenges relating to the visualization of data in general.

image

      Figure 2.1: Process for the visualization of data.

      The information is usually present in various formats depending on the source; the appropriate data extraction and analysis technique must be selected to transform the raw information into a more semantically rich, structured format. A set of data processing techniques are then applied to enhance the quality of the collected data; these include cleaning data inconsistencies, filling in missing values and detecting and eliminating duplicates. After that, the data is enriched and customized with visual characteristics, meaningful aggregations, and summaries which facilitate user-friendly data exploration; finally, proper indexing is performed to enable efficient searching and retrieval. The details for each step are presented below.

      Data Retrieval, Extraction. The first step concerns the retrieval and extraction of the data to be visualized. Raw data exists in various formats: e.g., books and reports describe phenomena in plain text—unstructured information, websites, and social networks contain annotated text and semi-structured data, whereas open data sources and corporate databases provide structured information. Data must first be retrieved in a digital format that is appropriate for processing (e.g., digital text files from newspapers). The core modeling concepts and observations are then extracted. Especially when the source data is in plain text, this is usually performed in an iterative human-curated way that refines the quality of the extracted data. For structured data sources, the process involves the extraction of the source concepts and their mapping to the target modeling.

      Linked Data usually comes in structured formats, i.e., conforms to well-defined ontologies and schemas. Linked data are usually the result of a data extraction and transformation task, which has turned raw data into semantically rich data representations. Therefore, LD visualization usually starts from the next step, that of preparing already structured data for visualizing them.

      Data Preparation. Input data are provided either in databases or data files (e.g., .csv, .xml). This step involves identifying all the concepts within the input datasets and representing them in a uniform data model that supports their proper visualization and visual exploration. For example, the multidimensional model is largely employed in social and statistical domains and represents concepts as observations and dimensions. Observations are measures of phenomena (e.g., indices) and dimensions are properties of these observations (e.g., reference period, reference area). Thus, a first step is to analyze the datasets and identify the different types of attributes they contain (date, geolocation, numeric, coded lists, literal). Each attribute is mapped to the corresponding concept of the multidimensional model, such as dimension, observation, coded list, etc. In addition, data processing requires a set of quality improvement activities that eliminate data inconsistencies and violations in source data. For example, missing or inconsistent codes are filled in for coded list attributes, date and time attributes are transformed to the appropriate format, and numerical values are validated so that wrong values can be corrected. Moreover, input data from multiple sources usually contain duplicate facts and a deduplication of the dataset must be performed. Deduplication is the process of identifying duplicate concepts within the input dataset based on a set of distinctive characteristics. A final task concerns the enrichment or the interlinking of the data with information from external sources. For example, places and locations are usually extracted as text; they can subsequently be annotated and enriched with spatial information (e.g., coordinates, boundaries) from external web services or interlinked with geospatial linked data sources (e.g., geonames.org) for their proper representation on maps.

      Visual Preparation. This set of tasks involves the enrichment and customization of the data with characteristics that enable the proper visualization of the underlying information. These characteristics extend the underlying data model with visual information. For example, colors can be assigned to coded values and different types of diagrams can be bound to different types of data, timelines to date attributes and maps to geographical values. Thus, customization and the building of the visual model are necessary tasks before data visualization. In addition, the production of visual summaries and highlights is a common task, especially in the visualization of very large datasets. Summaries provide the user with overviews of the visualized data, and visual

Скачать книгу