Talk:Record linkage
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
Untitled
editRecord linkage and deduplication are NOT the same thing. The first is linking two or more datasets, the second is removing duplicate entries in a single dataset. It's helpful (often very important) to DEDUPLICATE before attempting a RECORD LINKAGE.
Using your definition, deduplication is a simple(r) instance of record linkage. In depuplication the "two" datasets are the same, and have the same structure in terms of fields, something that is not always the case with Record Linkage. The two terms are often used interchangeably (and many other terms are also used to refer to the same concept, which is kind of ironic if you think about it) Ipeirotis 04:46, 1 February 2007 (UTC)
Separately, "deduplication" has taken on a life of its own as an optimization for storage technology in backup and archiving, with a different meaning. There may need to be a link that discusses this and points to the definition that is currently called Capacity Optimization. Examples of this use: see
http://searchsecurity.techtarget.com/tip/0,289483,sid5_gci1187934,00.html
http://www.networkworld.com/news/2006/091806-storage-deduplication.html
http://enterprisestorageforum.webopedia.com/TERM/d/data_deduplication.html
- I agree that "deduplication" is also used in the storage world. Big storage vendors (such as Netapp) are using the term 'Deduplication' when referring to some of there storage optimization products. See [1], for example. In fact, I found this very WP page while searching for storage deduplication technologies. Gigglesworth 22:12, 26 July 2007 (UTC)
Why was the deduplication entry overwritten by something that could have been easily linked as a reference to the actual subject matter? What's been changed here is not for reference, but force feeding just one of many ways of approaching this issue. I'd prefer it changed back to what we had previously, and some of the slightly over-zealous editting to cease. From a focused summary on the subject as a whole, with links and related terms, to something that is needlessly over-complicated, what's here isn't an improvement. If people would consider the user base as a whole when updating, it would be appreciated.
Software implementations
editList of software implementations: this should be added as there is precedence in other articles related to software concepts. Especially since most are related to educational projects. There is precedence in other technology related products such as spreadsheet software, etc. — Preceding unsigned comment added by 174.78.142.161 (talk) 15:02, 7 September 2012 (UTC)
- Sorry, but that list is utterly worthless. Way too many entries. What are our readers supposed to take away from this? -- O.Koslowski (talk) 15:05, 7 September 2012 (UTC)
I've removed the list again per WP:NOTDIRECTORY. We're building an encyclopedia, not a link directory. Dawnseeker2000 23:55, 6 October 2012 (UTC)
Record Matching vs. Entity Matching & Deep Learning Approaches
editI feel the article is a bit outdated and does not reflect current state of the art, e.g. it does not reference any deep learning techniques. More current literature also uses "Entity Matching" instead of "record linkage", as we not anymore live a relational world only. An indication for that is that the latest reference is from 2012. I therefore propose to add the "outdated" mark to this article.
For example this paper could be cited instead: http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-tr.pdf) (I am not an expert in Entity Matching / Record Linkage, so don't feel qualified to start rewriting or adding to the article.)