Dataset Summarization Research Papers

Extracting core knowledge from Linked Data

2025, files.ifi.uzh.ch

Abstract. Recent research has shown the Linked Data cloud to be a potentially ideal basis for improving user experience when interacting with Web content across different applications and domains. Using the explicit knowledge of datasets,... more

descriptionView Paper arrow_downwardDownload

Thesis

by Wissem Labbadi

2024, PhD Thesis manuscript

descriptionView Paper arrow_downwardDownload

A knowledge pattern-based method for linked data analysis

by Aldo Gangemi

2023, Proceedings of the sixth international conference on Knowledge capture - K-CAP '11

We present a Linked Data analysis method which relies on knowledge patterns for constructing a logical architecture of the knowledge in a dataset. This can then be exploited to compare heterogeneous datasets, enhance interoperability... more

descriptionView Paper arrow_downwardDownload

Minimal Algorithmic Information Loss Methods for Dimension Reduction, Feature Selection and Network Sparsification

by Felipe S Abrahão

2022, arXiv Preprints

We introduce a family of unsupervised, domain-free, and asymptotically optimal model-independent algorithms based on the principles of algorithmic probability and information theory designed to minimize the loss of algorithmic... more

Thus, from Equation (1), we have that these bounds for algorithmic information loss or gain from edge perturbations have a linearly dom- inant dependence on the number of perturbations |F'|, except for a multiplicative logarithmic term depending on the edge density (and, if information is lost, also on the number of perturbations). In case the graph is (algorithmica ly) random, i.e. incompressible, then we get a linearly dominant dependence on the number |F'| of edges being perturbed and, unlike in t he latter general case of | Equation (1), the multiplicative logarithmic term depends only on the network size (see Corollary B.2). Moreover, turbations (i.e., |F'| = 1), t as one might expected, fo tight in general (see Corollary B.4). r single edge per- hese upper and lower bounds become really

FIGURE 1. In (A) and (B) we show a feature selection and image reduction by application of MILS, starting from the original and second step, highlighting the re- gions that are earmarked to be omitted (in grey) versus the features that are kept along the way, thereby opti- mally preserving the main properties of these objects, properties whose persistence enables a ranking of such features. Here can be seen how boundaries are favoured as key features. In (C) we see the image reconstruction by preservation of extracted features from the image with the highest algorithmic probability (indicated by red sig- natures in the first two image processes). The MILS algorithm selects features which are considered most im- portant in the compressed image reconstruction. the preservation of the features that contribute the most to the algo- rithmic description of the objects. In general, the extracted features will not be as clear as in these examples as they may pick more com- plicated patterns even not statistical based on algorithmic probability. Unlike statistical approaches, the algorithm can also approximate (and thus preserve/extract) features that are of an algorithmic nature and which are not statistically apparent as it was in this case (see [35, 41]) and next examples.

FIGURE 2. Minimal Information Loss Reduction, a MILS-based lossy compression algorithm. A: Row com- pression preserves features on even highly coarse-grained versions (B starts from A at 0% but sampled to make it a 100 x 100 image from the original 600 x 600) of the same image producing a cartoonish representation. The image in (B) reduced by 23% is 10% the size of the image at 0% in (A). C: Compressed formula by MILS vs ran- dom row deletion (D) with resulting images of the same size but MILS preserving text proportion and minimiz- ing information loss while random row deletion distorts the text. Fig. 2 shows the MILS algorithm applied to images, which we call MILR. Fig. 2.(A) shows how vertical and horizontal compression pre- serves features even if distorting the image hence showing a different mechanisms and goals than those from popular image compression al- gorithms. The purpose of image compression algorithms such as JPEG is to maximise storage compressibility and require a decoder, here the application is directly to the image itself both at storage and visualisa- tion stages. Also, our purpose is not to maximise storage compression but to minimize the loss of algorithmic content in the reduction pro- cess. Fig. 2.(C) shows how the algorithm preserves the main features of the image leaving almost intact the formulae.

FIGURE 3. MILS or neutral edge deletion (blue) out- performs random edge deletion (red) at preserving both edge degree distribution (top, showing removed edges) and edge betweenness distribution (bottom) on an Erdés- Rényi random graph of node size 100 and low edge den- sity (~ 4%) after up to 60 edges were removed (degree distribution comparison) and 150 edges were removed (edge betweenness) out of a total of 200 edges (notice also the scale differences on the x-axis).

FIGURE 5. MILS mean clustering coefficient preserva- tion against two other sophisticated graph sparsification methods based on graph spectral and transitive reduc- tion techniques on biological, electric and social networks taken from [22]. The transitive method does not allow se- ection of edges to be deleted, and in some cases it either ails to significantly reduce the network size if no cycles are present (such as, generally, in electric and genetic net- works) and/or takes the clustering coefficient to 0 (e.g. or protein networks) if cycles are only local. Compar- isons with other methods are unnecessary because they destroy local or global properties by design, such as clus- tering coefficients for the spanning tree algorithm.

FIGURE 6. Stacked histograms showing edge between- ness preservation of MILS versus spectral sparsification across different amilies of networks. The similarity in height of each segment is an indication of the preserva- tion of such pro yellow (original) perties. Blue bars (MILS) approximate bars better than spectral sparsification. On average MILS was 1.5 times the edge betweenness distribution of t hese representative graphs measured by the area similarity of the respective bars.

FIGURE 7. Stacked histograms showing the preservation of degree centrality after application of MILS versus spec- tral sparsification across different families of networks: bars with height closest to the original graph signify better preservation. Blue bars (MILS) approximate yel- low (original) bars compared with spectral sparsification. MILS only slightly outperformed spectral sparsification in this test but never did worse.

FIGURE 8. Stacked histograms showing eigenvector cen- trality preservation of MILS versus spectral sparsification across the different families of networks: bars with height closest to the graph’s original bar signify better preser- vation. Blue bars (MILS) approximate yellow (original) bars better than spectral sparsification both in distribu- tion shape and individual bar height. On average MILS preserved the eigenvector centrality distribution of these representative networks 1.5 times better.

FIGURE 9. The Datasaurus Dozen datasets [21] visu alised as binary arrays.

FIGURE 10. Scatterplots showing the initial algorith- mic complexity as approximated by BDM (top) and the change in complexity after pymils compressed each im- age to a 50% the original image size. mic complexity as approximated by BDM (top) and the

FIGURE 11. Example image of a face pasted above blank space before (top) and after (bottom) pymils, with a 50% size reduction.

FIGURE 12. Boxplots comparing pymils with random row/column deletion and PCA over various test images (top) and ECA images (bottom).

FIGURE 13. Boxplots comparing pymils with random row/column deletion and PCA over various test image ainadrante (ton) and BRCOA imace quadrante (hottom).

bits. From that, if r= Equation 27, we also analogously have as in |] F|, then Equation 19 If one wants tighter lower and upper bounds for multiple edge dele tions as a function of the edge density (i.e., the size of |E(G)| in com- N?—-N 2 parison to ), then one can re-write the statement of Theorem B.1

approximating algorithmic complexity in the preservation of any possi- ble feature of interest that contributes to the (algorithmic) information content of a network such as, evidently, its degree distribution and other graph-theoretic, algebraic or topological features, even those not neces- sarily captured by any graph theoretic measure or classical information approach [35, 36]. In Section 3.3 we will show that Algorithm 2 is deterministic, poly- nomially time bounded, and describes a criterion to select and remove the most neutral elements of an object. We employ this more efficient version in all our experiments, and even in this limited form the pro- cedure excels at preserving important characteristics of the networks under study. See Theorem 3.3.

which may not necessarily be the case. Algorithm 5 solves this prob- em by performing simultaneous perturbations on all edges with an information contribution of mznLoss. In Algorithm 5 we also introduce NFORANK, a method that produces a ranking of €1,...,ejz(q@) from east informative to most informative edge, i.e., a list of edges sorted in increasing order by their information contribution to G. This ranking acilitates the search for the most neutral elements of the system (see Section B.8), which in turn helps MILS preserve the components that maximise the information content of the resulting object. — As described in [32], one can easily modify these algorithms in order to deal with multiple edge deletions at once, as in I(G, F’), but possibly at the expense of much more computational resources. In addition, note that algorithms 4 and 5 may be applied, mutatis mutandis, to nodes or

descriptionView Paper arrow_downwardDownload

Extracting Core Knowledge from Linked Data

by Alessandro Adamou

2022, International Semantic Web Conference

Recent research has shown the Linked Data cloud to be a potentially ideal basis for improving user experience when interacting with Web content across different applications and domains. Using the explicit knowledge of datasets, however,... more

descriptionView Paper arrow_downwardDownload

Extracting Core Knowledge from Linked Data

by Alessandro Adamou

2021

Recent research has shown the Linked Data cloud to be a potentially ideal basis for improving user experience when interacting with Web content across different applications and domains. Using the explicit knowledge of datasets, however,... more

descriptionView Paper arrow_downwardDownload

Extracting core knowledge from linked data

by Aldo Gangemi

2018

Recent research has shown the Linked Data cloud to be a potentially ideal basis for improving user experience when interacting with Web content across different applications and domains. Using the explicit knowledge of datasets, however,... more

descriptionView Paper arrow_downwardDownload

An information services algorithm to heuristically summarize IP addresses for a distributed, hierarchical directory service

by Marcos Portnoi

2013, Grid Computing (GRID), 2010 11th IEEE/ACM International Conference on

A distributed, hierarchical information service for computer networks might use several service instances, located in different layers. A distributed directory service, for example, might be comprised of upper level listings, and local... more

descriptionView Paper arrow_downwardDownload

Extracting core knowledge from linked data

by Valentina Presutti

2011

Recent research has shown the Linked Data cloud to be a potentially ideal basis for improving user experience when interacting with Web content across different applications and domains. Using the explicit knowledge of datasets, however,... more

descriptionView Paper arrow_downwardDownload

Dataset Summarization

Related Topics