Ontology-Driven Data Semantics Discovery for Cyber-Security
https://doi.org/10.1007/978-3-319-19686-2_1Abstract
We present an architecture for data semantics discovery capable of extracting semantically-rich content from human-readable files without prior specification of the file format. The architecture, based on work at the intersection of knowledge representation and machine learning , includes machine learning modules for automatic file format identification , tokenization, and entity identification. The process is driven by an ontology of domain-specific concepts. The ontology also provides an abstraction layer for querying the extracted data. We provide a general description of the architecture as well as details of the current implementation. Although the architecture can be applied in a variety of domains, we focus on cyber-forensics applications, aiming to allow one to parse data sources, such as log files, for which there are no readily-available parsing and analysis tools, and to aggregate and query data from multiple , diverse systems across large networks. The key contributions of our work are: the development of an architecture that constitutes a substantial step toward solving a highly-practical open problem; the creation of one of the first comprehensive ontologies of cyber assets; the development and demonstration of an innovative, non-trivial combination of declarative knowledge specification and machine learning.
References (15)
- S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz, "Analyzing log analysis: an empirical study of user log mining," in Conference on Large Installation System Administration (LISA), 2014.
- A. Bartoli, G. Davanzo, A. De Lorenzo, M. Mauri, E. Medvet, and E. Sorio, "Auto- matic Synthesis of Regular Expressions from Examples with Genetic Programming," Proceedings of the 14th Annual Conference Companion on Genetic and Evolution- ary Computation, 2012.
- L. Bitincka, A. Ganapathi, S. Sorkin, and S. Zhang, "Optimizing data analysis with a semistructured time series database," in Proceedings of the 2010 workshop on managing systems via log analysis and machine learning techniques (SLAML '10), 2010.
- W. Cui, M. Peinado, K. Chen, H. J. Wang, and L. Irun-Briz, "Tupni: Automatic reverse engineering of input formats," in Proceedings of the 15th ACM Conference on Computer and Communications Security, ACM, 2008.
- A. Doan, P. Domingos, and A. Y. Halevy, "Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach," ACM SIGMOD Record, vol. 30, no. 2, p. 509-520, 2001.
- K. F. White, D. Walker, K. Q. Zhu, and Peter, "From dirt to shovels: fully automatic tool generation from ad hoc data," ACM SIGPLAN Notices, vol. 43, no. 1, p. 421- 434, 2008.
- K. Fisher, D. Walker, and K. Q. Zhu, "LearnPADS: automatic tool generation from ad hoc data," in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, p. 1299-1302, 2008.
- K. Fisher and D. Walker, "The PADS project: an overview," in Proceedings of the 14th International Conference on Database Theory, 2011, ACM, 2011.
- S. Hangal, "Seaview: Using Fine-Grained Type Inference to Aid Log File Analysis." (2011).
- V. Le and S. Gulwani, "FlashExtract: A framework for data extraction by ex- amples." Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 2014.
- B. Scholkopf, et. al., "Estimating the support of a high-dimensional distribution," Neural Computation, vol. 13, no. 7, pp. 1443-1471, 2001.
- W. Shang, M. Nagappan, A. E. Hassan, and Z. M. Jiang, "Understanding Log Lines Using Developmental Knowledge," in 2014 IEEE International Conference on Software Maintenance and Evolution, 2014.
- R. J. Walls, E. G. Learned-Miller, and B. N. Levine, "Forensic Triage for Mobile Phones with DEC0DE," in USENIX Security Symposium, 2011.
- Wu, Lin and Weng, "Probability estimates for multi-class classification by pairwise coupling," JMLR 5:975-1005, 2004.
- D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy, "SherLog: er- ror diagnosis by connecting clues from run-time logs," ACM SIGARCH Computer Architecture News, vol. 38, no. 1, p. 143-154, 2010.