Talk:Data mining/2011
This is an archive of past discussions about Data mining. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
suggestion for new section
In this article, there are many applications in data mining but not about data mining in meteorology. I'm interesting in writing an article related that. I would like to know about suggestions about it. — Preceding unsigned comment added by Inoshika (talk • contribs) 14:18, 16 April 2011 (UTC)
- In my opinion, all the "notable" uses of data mining should be moved into a separate article. The current article is a huge mess IMHO. May I suggest to just start a new article somewhere, e.g. "User:Inoshika/Data mining in meteorology" and then have it moved and linked at some point? The other "uses" paragraphs could be cloned into appropriate articles, too. So we can leave all those business-marketing-analaysis people to their own article mess... --87.174.61.197 (talk) 17:53, 16 April 2011 (UTC)
- Ya. Thats what I thought too. Write a separate article as "data mining in meteorology" and link it under "Applications". --112.135.1.146 (talk) 00:38, 17 April 2011 (UTC) —Preceding unsigned comment added by 112.135.1.146 (talk) 00:34, 17 April 2011 (UTC)
- I added that article. It's Data mining in meteorology. and it was linked under applications--112.135.57.93 (talk) 02:19, 23 April 2011 (UTC)
data mining
Data mining is the analysis of data that was collected for a different purpose. Commercial transactions, for example, are recorded for traceability, but mined to find that on-line buyers frequently order different colors of the same garment in the same size. By contrast, clinical trials collect data on a new drug directly for the purpose of assessing efficacy and safety.
In the first case, the analyst does not control what data is collected, in what form, or how much of it there is. He or she does not know a priori what can be learned from it. In the second case, the analyst defines the data collection protocols and knows what to look for. It is a designed experiment.
Everything else about data mining -- such as which tools are used -- flows from this fundamental distinction. --Mbaudin53 (talk) 13:14, 7 October 2011 (UTC)
Data mining is the process of extracting patterns from data. Data mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.
Data mining can be used to uncover patterns in data but is often carried out only on samples of data. The mining process will be ineffective if the samples are not a good representation of the larger body of data. Data mining cannot discover patterns that may be present in the larger body of data if those patterns are not present in the sample being "mined". Inability to find patterns may become a cause for some disputes between customers and service providers. Therefore data mining is not fool proof but may be useful if sufficiently representative data samples are collected. The discovery of a particular pattern in a particular set of data does not necessarily mean that a pattern is found elsewhere in the larger data from which that sample was drawn. An important part of the process is the verification and validation of patterns on other samples of data.
The term data mining has also been used to describe data dredging and data snooping. However, dredging and snooping can be (and sometimes are) used as exploratory tools when developing and clarifying hypotheses
Data mining commonly involves four classes of task:[10]
* Classification - Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Decision Tree Learning, Nearest neighbor, naive Bayesian classification and Neural network.
* Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.
* Regression - Attempts to find a function which models the data with the least error.
* Association rule learning - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
* See also structured data analysis. —Preceding unsigned comment added by 203.223.189.2 (talk) 11:19, 20 February 2010 (UTC)
Focus
I think the majority of this article focused on the negative uses of data mining so I added examples of data mining where it was not negative. —Preceding unsigned comment added by 98.243.168.193 (talk) 02:51, 17 January 2011 (UTC)
Isn't data mining also a part of statistics?
Should I look for references for that? Talgalili (talk) 21:33, 25 January 2011 (UTC)
- What do you propose exactly? The data analysis category is already in statistics, so this article is also in category statistics. Obviously, statistics is used in data mining a lot (so is math in general, databases, ...). Many of the data mining methods have moved away from a sound statistical model and essentially define their output by what the algorithm finds, and not some statistical reasoning. Data mining of course has its roots in statistics, though. --Chire (talk) 10:51, 27 January 2011 (UTC)
- Hi Chire. I'm asking so to know if I can add "statistics" to the first sentence in the article that states:
Data mining, a branch of computer science[1] and artificial intelligence,[2] (and I think we should add - "statistics")
- Statistics definitely should be mentioned in the first paragraph. I like the definition of Encyclopedia Britannica ([1] in the article): "data mining ... combines tools from statistics and artificial intelligence ... with database management ..." [1]. Because there are many aspects that go beyond statistics. We should rewrite the paragraph to convey this. If I have time, I'll give it a go. --Chire (talk) 15:48, 31 January 2011 (UTC)
- Thanks for the rewrite :) Talgalili (talk) 17:07, 31 January 2011 (UTC)
Rexer DM survey notability dispute
I would like to dispute the notability tag added by User:Melcombe to the Rexer's Annual Data Miner Survey article. Unlike statistics, data mining is a relatively young and interdisciplinary field. Not as much writing has been done for it. I should think it significant that 735 participants from 60 countries participated in the most recent 2010 survey. Moreover, each year more people become involved with the survey each year. Compare this with other surveys that have articles on Wikipedia. --Luke145 (talk) 20:16, 22 April 2011 (UTC)
Further reading section - what to keep?
I am going through the "Further reading" section and see it is very bloated. I would have felt better with it has the external links section had more links (which it doesn't have).
My question is, what books should be removed from this section?
Talgalili (talk) 07:54, 14 May 2011 (UTC)
- Whatever makes sense to read next. ;-) IMHO the section should be stripped down to a bare minimum. It also is biased towards machine learning. As a rule of thumb I'd say: 1-3 general textbooks, and 1 for each key area of data mining (cluster analysis, outlier detection, machine learning) and maybe 1 for some big speciality topics (text mining, web mining, time series, graph mining, ...). Some books may fall into more of these categories; for example, the Weka book can be seen as general, but it clearly has a focus on machine learning (also in the title: "Data Mining: Practical Machine Learning Tools ..." - but it's the #1 for machine learning, e.g. on Amazon [2]). But do we then need another machine learning oriented book? At the same time, cluster analysis is really thin in the book, and I don't think it even has outlier detection... --Chire (talk) 11:28, 14 May 2011 (UTC)
Evolve into summary article?
This discussion prompted from my introduction of the "Knowledge Grid" section. This was the result of Wikipedia:Articles for deletion/Knowledge Grid which pointed out that an article should cover a "topic" and not a neologism or buzzword that one group uses. It does seem that subsection des not belong in the "Process" section, but where does it belong? For that matter, there are enough overlapping articles that it might be time to evolve this one into just an overview with summaries of other articles that are linked. It would be a lot of work, but probably the way to go eventually. One suggestion was Data mining in grids since there are some other efforts in that area. Actually I consider "grids" a somewhat dated term already, since the trendy term might be "cloud" which might be different in some ways, but the same general idea. Actually maybe something like Distributed data mining might be a more general title, where the various approaches could be given in a historical context, e.g. Computer cluster Grid computing and cloud computing. There already are articles on both distributed database and distributed database management system which are very thin, so maybe just put there? Then there is Parallel database which sounds similar to me (but seems more like an overview article pointing to the various techniques such as NUMA etc.). Not sure what to do here, help please. W Nowicki (talk) 18:25, 18 October 2011 (UTC)
- Distributed data mining might be an appropriate article for the distributed efforts. Parallel, grid and cloud are all somewhat specialized version of this IMHO. --Chire (talk) 22:42, 18 October 2011 (UTC)
Criticism, difficulties, limitations, etc.
The term "data mining" is sometimes used as a derogatory term in statistics. It involves constructing a model based on observing past data. Such a model is exploratory and tentative, and its results have to be treated with caution. The model may turn out not to explain the future, because it may rely on chance variation in the particular sample that won't be repeated in the future.
If many variables are examined to identify potentially significant factors, for example, then on average 1 in 20 will appear to be significant to the 5% level "by chance", even if the data are actually completely random. An impressive-looking model can be constructed by honing in on the "significant" set of variables. But if a new sample is collected and the modeling process repeated, then chance variation will result in a completely different set of variables being observed as "significant". A completely different, similarly impressive-looking model will be constructed, which will turn out to have no more value than the first.
An approach that incorporates some of the cautions statisticians have traditionally voiced would seem warranted. As the financial crisis illustrated, relying on them too naively may be hazardous to ones wealth. --207.245.44.238 (talk) 15:44, 22 December 2011 (UTC)
- Well, other call any kind of data collection or data analysis "data mining"... Anyway, what you refer to is also known as data dredging and already has its own article. And it's already mentioned in the introduction of the article, too - so what exactly do you propose that isn't here already? --Chire (talk) 20:57, 22 December 2011 (UTC)