This article contains promotional content. (January 2019) |
ELKI (Environment for Developing KDD-Applications Supported by Index-Structures) is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally created by the database systems research unit at the Ludwig Maximilian University of Munich, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.
Developer(s) | Technical University of Dortmund; initially Ludwig Maximilian University of Munich |
---|---|
Stable release | 0.8.0
/ 5 October 2022 |
Repository | |
Written in | Java |
Operating system | Microsoft Windows, Linux, Mac OS |
Platform | Java platform |
Type | Data mining |
License | AGPL (since version 0.4.0) |
Website | elki-project |
Description
editThe ELKI framework is written in Java and built around a modular architecture. Most currently included algorithms perform clustering, outlier detection,[1] and database indexes. The object-oriented architecture allows the combination of arbitrary algorithms, data types, distance functions, indexes, and evaluation measures. The Java just-in-time compiler optimizes all combinations to a similar extent, making benchmarking results more comparable if they share large parts of the code. When developing new algorithms or index structures, the existing components can be easily reused, and the type safety of Java detects many programming errors at compile time.
ELKI is a free tool for analyzing data, mainly focusing on finding patterns and unusual data points without needing labels. It's written in Java and aims to be fast and able to handle big datasets by using special structures. It's made for researchers and students to add their own methods and compare different algorithms easily.[2]
ELKI has been used in data science to cluster sperm whale codas,[3] for phoneme clustering,[4] for anomaly detection in spaceflight operations,[5] for bike sharing redistribution,[6] and traffic prediction.[7]
Objectives
editThe university project is developed for use in teaching and research. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental evaluation of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime.[8] ELKI aims at providing a shared codebase with comparable implementations of many algorithms.
As research project, it currently does not offer integration with business intelligence applications or an interface to common database management systems via SQL. The copyleft (AGPL) license may also be a hindrance to an integration in commercial products; nevertheless it can be used to evaluate algorithms prior to developing an own implementation for a commercial product. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience is students, researchers, data scientists, and software engineers.
Architecture
editELKI is modeled around a database-inspired core, which uses a vertical data layout that stores data in column groups (similar to column families in NoSQL databases). This database core provides nearest neighbor search, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries (e.g. k-nearest-neighbor algorithm, local outlier factor and DBSCAN) can be implemented easily and benefit from the index acceleration. The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists.
ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration.
ELKI uses a service loader architecture to allow publishing extensions as separate jar files.
ELKI uses optimized collections for performance rather than the standard Java API.[9] For loops for example are written similar to C++ iterators:
for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) {
relation.get(iter); // E.g., get the referenced object
idcollection.add(iter); // E.g., add the reference to a DBID collection
}
In contrast to typical Java iterators (which can only iterate over objects), this conserves memory, because the iterator can internally use primitive values for data storage. The reduced garbage collection improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps (for, e.g., nearest neighbor search) using such optimizations.
Visualization
editThe visualization module uses SVG for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into PostScript and PDF for easy inclusion in scientific publications in LaTeX. Exported files can be edited with SVG editors such as Inkscape. Since cascading style sheets are used, the graphics design can be restyled easily. Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets (for larger data sets, only a subsample of the data is visualized by default).
Awards
editVersion 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection,[10] won the conference's "best demonstration paper award".
Included algorithms
editSelect included algorithms:[11]
- Cluster analysis:
- K-means clustering (including fast algorithms such as Elkan, Hamerly, Annulus, and Exponion k-Means, and robust variants such as k-means--)
- K-medians clustering
- K-medoids clustering (PAM) (including FastPAM and approximations such as CLARA, CLARANS)
- Expectation-maximization algorithm for Gaussian mixture modeling
- Hierarchical clustering (including the fast SLINK, CLINK, NNChain and Anderberg algorithms)
- Single-linkage clustering
- Leader clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise, with full index acceleration for arbitrary distance functions)
- OPTICS (Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH
- HDBSCAN
- Mean-shift clustering
- BIRCH clustering
- SUBCLU (Density-Connected Subspace Clustering for High-Dimensional Data)
- CLIQUE clustering
- ORCLUS and PROCLUS clustering
- COPAC, ERiC and 4C clustering
- CASH clustering
- DOC and FastDOC subspace clustering
- P3C clustering
- Canopy clustering algorithm
- Anomaly detection:
- k-Nearest-Neighbor outlier detection
- LOF (Local outlier factor)
- LoOP (Local Outlier Probabilities)
- OPTICS-OF
- DB-Outlier (Distance-Based Outliers)
- LOCI (Local Correlation Integral)
- LDOF (Local Distance-Based Outlier Factor)
- EM-Outlier
- SOD (Subspace Outlier Degree)
- COP (Correlation Outlier Probabilities)
- Frequent Itemset Mining and association rule learning
- Apriori algorithm
- Eclat
- FP-growth
- Dimensionality reduction
- Spatial index structures and other search indexes:
- Evaluation:
- Precision and recall, F1 score, Average Precision
- Receiver operating characteristic (ROC curve)
- Discounted cumulative gain (including NDCG)
- Silhouette index
- Davies–Bouldin index
- Dunn index
- Density-based cluster validation (DBCV)
- Visualization
- Scatter plots
- Histograms
- Parallel coordinates (also in 3D, using OpenGL)
- Other:
- Statistical distributions and many parameter estimators, including robust MAD based and L-moment based estimators
- Dynamic time warping
- Change point detection in time series
- Intrinsic dimensionality estimators
Version history
editVersion 0.1 (July 2008) contained several Algorithms from cluster analysis and anomaly detection, as well as some index structures such as the R*-tree. The focus of the first release was on subspace clustering and correlation clustering algorithms.[12]
Version 0.2 (July 2009) added functionality for time series analysis, in particular distance functions for time series.[13]
Version 0.3 (March 2010) extended the choice of anomaly detection algorithms and visualization modules.[14]
Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures.[10]
Version 0.5 (April 2012) focuses on the evaluation of cluster analysis results, adding new visualizations and some new algorithms.[15]
Version 0.6 (June 2013) introduces a new 3D adaption of parallel coordinates for data visualization, apart from the usual additions of algorithms and index structures.[16]
Version 0.7 (August 2015) adds support for uncertain data types, and algorithms for the analysis of uncertain data.[17]
Version 0.7.5 (February 2019) adds additional clustering algorithms, anomaly detection algorithms, evaluation measures, and indexing structures.[18]
Version 0.8 (October 2022) adds automatic index creation, garbage collection, and incremental priority search, as well as many more algorithms such as BIRCH.[19]
Similar applications
edit- scikit-learn: machine learning library in Python
- Weka: A similar project by the University of Waikato, with a focus on classification algorithms
- RapidMiner: An application available commercially (a restricted version is available as open source)
- KNIME: An open source platform which integrates various components for machine learning and data mining
See also
editReferences
edit- ^ Hans-Peter Kriegel, Peer Kröger, Arthur Zimek (2009). "Outlier Detection Techniques (Tutorial)" (PDF). 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009). Bangkok, Thailand. Retrieved 2010-03-26.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - ^ "ELKI Data Mining Framework". elki-project.github.io. Retrieved 2024-05-30.
- ^ Gero, Shane; Whitehead, Hal; Rendell, Luke (2016). "Individual, unit and vocal clan level identity cues in sperm whale codas". Royal Society Open Science. 3 (1): 150372. Bibcode:2016RSOS....350372G. doi:10.1098/rsos.150372. ISSN 2054-5703. PMC 4736920. PMID 26909165.
- ^ Stahlberg, Felix; Schlippe, Tim; Vogel, Stephan; Schultz, Tanja (2013). "Pronunciation Extraction from Phoneme Sequences through Cross-Lingual Word-to-Phoneme Alignment". Statistical Language and Speech Processing. Lecture Notes in Computer Science. Vol. 7978. pp. 260–272. doi:10.1007/978-3-642-39593-2_23. ISBN 978-3-642-39592-5. ISSN 0302-9743.
- ^ Verzola, Ivano; Donati, Alessandro; Martinez, Jose; Schubert, Matthias; Somodi, Laszlo (2016). "Project Sibyl: A Novelty Detection System for Human Spaceflight Operations". Space Ops 2016 Conference. doi:10.2514/6.2016-2405. ISBN 978-1-62410-426-8.
- ^ Adham, Manal T.; Bentley, Peter J. (2016). "Evaluating clustering methods within the Artificial Ecosystem Algorithm and their application to bike redistribution in London". Biosystems. 146: 43–59. doi:10.1016/j.biosystems.2016.04.008. ISSN 0303-2647. PMID 27178785.
- ^ Wisely, Michael; Hurson, Ali; Sarvestani, Sahra Sedigh (2015). "An extensible simulation framework for evaluating centralized traffic prediction algorithms". 2015 International Conference on Connected Vehicles and Expo (ICCVE). pp. 391–396. doi:10.1109/ICCVE.2015.86. ISBN 978-1-5090-0264-1. S2CID 1297145.
- ^ Kriegel, Hans-Peter; Schubert, Erich; Zimek, Arthur (2016). "The (black) art of runtime evaluation: Are we comparing algorithms or implementations?". Knowledge and Information Systems. 52 (2): 341–378. doi:10.1007/s10115-016-1004-2. ISSN 0219-1377. S2CID 40772241.
- ^ "DBIDs". ELKI homepage. Retrieved 13 December 2016.
- ^ a b Elke Achtert, Achmed Hettab, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2011). Spatial Outlier Detection: Data, Algorithms, Visualizations. 12th International Symposium on Spatial and Temporal Databases (SSTD 2011). Minneapolis, MN: Springer. doi:10.1007/978-3-642-22922-0_41.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - ^ excerpt from "Data Mining Algorithms in ELKI". Retrieved 17 October 2019.
- ^ Elke Achtert, Hans-Peter Kriegel, Arthur Zimek (2008). ELKI: A Software System for Evaluation of Subspace Clustering Algorithms (PDF). Proceedings of the 20th international conference on Scientific and Statistical Database Management (SSDBM 08). Hong Kong, China: Springer. doi:10.1007/978-3-540-69497-7_41.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - ^ Elke Achtert, Thomas Bernecker, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2009). ELKI in time: ELKI 0.2 for the performance evaluation of distance measures for time series (PDF). Proceedings of the 11th International Symposium on Advances in Spatial and Temporal Databases (SSTD 2010). Aalborg, Dänemark: Springer. doi:10.1007/978-3-642-02982-0_35.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - ^ Elke Achtert, Hans-Peter Kriegel, Lisa Reichert, Erich Schubert, Remigius Wojdanowski, Arthur Zimek (2010). Visual Evaluation of Outlier Detection Models. 15th International Conference on Database Systems for Advanced Applications (DASFAA 2010). Tsukuba, Japan: Springer. doi:10.1007/978-3-642-12098-5_34.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - ^ Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2012). Evaluation of Clusterings Metrics and Visual Support. 28th International Conference on Data Engineering (ICDE). Washington, DC. doi:10.1109/ICDE.2012.128.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - ^ Elke Achtert, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2013). Interactive Data Mining with 3D-Parallel-Coordinate-Trees. Proceedings of the ACM International Conference on Management of Data (SIGMOD). New York City, NY. doi:10.1145/2463676.2463696.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - ^ Erich Schubert; Alexander Koos; Tobias Emrich; Andreas Züfle; Klaus Arthur Schmid; Arthur Zimek (2015). "A Framework for Clustering Uncertain Data" (PDF). Proceedings of the VLDB Endowment. 8 (12): 1976–1987. doi:10.14778/2824032.2824115.
- ^ Schubert, Erich; Zimek, Arthur (2019-02-10). "ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"". arXiv:1902.03616 [cs.LG].
- ^ Schubert, Erich (2022). Automatic Indexing for Similarity Search in ELKI. Similarity Search and Applications. pp. 205–213. doi:10.1007/978-3-031-17849-8_16.
External links
edit- Official website of ELKI with download and documentation.