Data Mining FAQ

What is Data Mining?

Data Mining is a class of applications that look for hidden patterns in a data. The applications can be used to identify related clusters of records, identify records that do not fit normal patterns, and can be used to predict future results. For example, data mining software can help retail companies find customers with common interests. The term is commonly misused to describe software that presents data in new ways. True data mining software doesn't just change the presentation, but actually discovers previously unknown relationships among the data.

What are the key pieces of the Pentaho Data Mining architecture?

The core module is Weka, another Open Source project, created by researchers and professors at the University of Waikato. Other key components are the integration with Pentaho Data Integration, Pentaho Analysis Services and Pentaho Reporting.

What is the Weka project?

Weka is a state-of-the-art facility for developing machine learning (ML) techniques applied to real-world data mining problems. Weka is the most popular open source data mining project. The Weka team has incorporated several standard ML techniques into a software "workbench" called WEKA, for Waikato Environment for Knowledge Analysis. With it, a specialist in a particular field is able to derive useful knowledge from databases that are far too large to be analyzed by hand. WEKA's users are researchers, analysts, industrial scientists, and it is also widely used for teaching.

Is one olive worth one flight attendant?

American Airlines saved $40,000 in 1987 by eliminating 1 olive from each salad served in first-class. With those savings, you could hire a new flight attendant.

How do I use Weka in a non-GPL application?

Commercial licenses for Weka are available from Pentaho Corporation. For more information please fill out this form and a sales representative will contact you.

Where can I find detailed information about Weka?
Here you can find FAQs, API guides, User Guides, a demonstration of all the graphical interfaces, and more. A good book to learn about data mining concepts is "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000.

Can I use Pentaho Data Mining out-of-the-box?

Yes. It was designed to be used as both an out-of-the-box analysis application and as an analysis application that can be called from other applications.

Can I embed Pentaho Data Mining into my applications?

Pentaho Data Mining is meant to work either stand-alone or as an application that can be called as part of a Business Intelligence process. Note that Weka is distributed under the terms of the GNU General Public License (GPL). Under the GPL, if you intend to distribute GPL-licensed code to your customers as part of other software you have created, you may, depending on the software you have created, be required to GPL that code. Companies that wish to distribute Weka have the option of purchasing a commercial license from Pentaho Corporation. A commercial license would exempt you from GPL obligations.

Who designs the Data Mining models?

Business analysts initially determine the business problem(s) that needs to be solved, and then proceed to define the required source data. From there they choose data mining techniques, create analytical models, and tune the accuracy of them. Data mining is a collection of powerful techniques and it may be best used after a short training course. Pentaho partners specializing in Data Mining training and consulting can provide this service.

How much gold was mined in California during the Gold Rush?

The California Gold Rush began in 1848, and lured more than 200,000 people to migrate westward. 28.4 million ounces of gold was mined between 1848 and 1864.

Can I schedule the running of a Data Mining activity?

Yes. Data mining activities can be scheduled just like any other platform activity

Can I call a web service to run a Data Mining activity?

Yes. Data mining activities can be accessed as a web service just like any other platform activity

What is Pentaho Data Mining written in?

The data mining component, console, and application are written in Java

What data sources can I get to?

Thru the Pentaho Platform you can get to data sets from any relational database accessed through JDBC.

How is the Pentaho BI Project different from other Data Mining projects?

The Pentaho BI Project offers an entire BI Platform complete with reporting, analysis, dashboards, data mining, workflow and infrastructure necessary for true production deployment. Many other projects that exist address specific data mining functions, but not the entire BI spectrum. Most also lack the necessary infrastructure like scheduling, web services, security, administration, auditing, fail-over, scalability features, portal, and other key framework functionality. Customers can start with something simple like Reporting from Pentaho and know that they'll be able to add data mining to their solution when they're ready. They'll know that everything will be integrated, supported, and getting better by the day. The Pentaho BI Project gives users peace of mind via longevity, support, and continued innovation.