Academia.eduAcademia.edu

Outline

Ontology-Driven Data Semantics Discovery for Cyber-Security

https://doi.org/10.1007/978-3-319-19686-2_1

Abstract

We present an architecture for data semantics discovery capable of extracting semantically-rich content from human-readable files without prior specification of the file format. The architecture, based on work at the intersection of knowledge representation and machine learning , includes machine learning modules for automatic file format identification , tokenization, and entity identification. The process is driven by an ontology of domain-specific concepts. The ontology also provides an abstraction layer for querying the extracted data. We provide a general description of the architecture as well as details of the current implementation. Although the architecture can be applied in a variety of domains, we focus on cyber-forensics applications, aiming to allow one to parse data sources, such as log files, for which there are no readily-available parsing and analysis tools, and to aggregate and query data from multiple , diverse systems across large networks. The key contributions of our work are: the development of an architecture that constitutes a substantial step toward solving a highly-practical open problem; the creation of one of the first comprehensive ontologies of cyber assets; the development and demonstration of an innovative, non-trivial combination of declarative knowledge specification and machine learning.

References (15)

  1. S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz, "Analyzing log analysis: an empirical study of user log mining," in Conference on Large Installation System Administration (LISA), 2014.
  2. A. Bartoli, G. Davanzo, A. De Lorenzo, M. Mauri, E. Medvet, and E. Sorio, "Auto- matic Synthesis of Regular Expressions from Examples with Genetic Programming," Proceedings of the 14th Annual Conference Companion on Genetic and Evolution- ary Computation, 2012.
  3. L. Bitincka, A. Ganapathi, S. Sorkin, and S. Zhang, "Optimizing data analysis with a semistructured time series database," in Proceedings of the 2010 workshop on managing systems via log analysis and machine learning techniques (SLAML '10), 2010.
  4. W. Cui, M. Peinado, K. Chen, H. J. Wang, and L. Irun-Briz, "Tupni: Automatic reverse engineering of input formats," in Proceedings of the 15th ACM Conference on Computer and Communications Security, ACM, 2008.
  5. A. Doan, P. Domingos, and A. Y. Halevy, "Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach," ACM SIGMOD Record, vol. 30, no. 2, p. 509-520, 2001.
  6. K. F. White, D. Walker, K. Q. Zhu, and Peter, "From dirt to shovels: fully automatic tool generation from ad hoc data," ACM SIGPLAN Notices, vol. 43, no. 1, p. 421- 434, 2008.
  7. K. Fisher, D. Walker, and K. Q. Zhu, "LearnPADS: automatic tool generation from ad hoc data," in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, p. 1299-1302, 2008.
  8. K. Fisher and D. Walker, "The PADS project: an overview," in Proceedings of the 14th International Conference on Database Theory, 2011, ACM, 2011.
  9. S. Hangal, "Seaview: Using Fine-Grained Type Inference to Aid Log File Analysis." (2011).
  10. V. Le and S. Gulwani, "FlashExtract: A framework for data extraction by ex- amples." Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 2014.
  11. B. Scholkopf, et. al., "Estimating the support of a high-dimensional distribution," Neural Computation, vol. 13, no. 7, pp. 1443-1471, 2001.
  12. W. Shang, M. Nagappan, A. E. Hassan, and Z. M. Jiang, "Understanding Log Lines Using Developmental Knowledge," in 2014 IEEE International Conference on Software Maintenance and Evolution, 2014.
  13. R. J. Walls, E. G. Learned-Miller, and B. N. Levine, "Forensic Triage for Mobile Phones with DEC0DE," in USENIX Security Symposium, 2011.
  14. Wu, Lin and Weng, "Probability estimates for multi-class classification by pairwise coupling," JMLR 5:975-1005, 2004.
  15. D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy, "SherLog: er- ror diagnosis by connecting clues from run-time logs," ACM SIGARCH Computer Architecture News, vol. 38, no. 1, p. 143-154, 2010.