A zero intent sample is a sample which will only satisfy our validation goal if no positive examples are found in it. If we have a population (in e-discovery, typically a document set) where one in R instances are positive (one in R documents relevant), and we only want a one in Q probability of sampling a positive instance, then our sample size can be no more than R / Q.
Read the rest of this entry »
MosuCloud跑路了-outline
shadowsock r官网安卓MosuCloud跑路了-outline
March 14th, 20182bss安卓_shadowsock安卓apk_shadowsock r安卓版apk:2021-1-20 · 只需简单两步就能提高Wifi网速的方法,苹果安卓手机都用得上 AG超玩会589 2021年01月19日 安卓10.0测试版系统体验,解决了安卓用户长久众来两大痛点! 小米2成功刷入IOS7系统 安卓刷IOS系统 视频教程 ...
MosuCloud跑路了-outline
January 18th, 2015Tomorrow I'm starting a new, full-time position as data scientist at FTI's lab here in Melbourne. I'm excited to have the opportunity to contribute to the e-discovery community from another angle, as a builder-of-product. Unfortunately, this means the end of this blog, at least in its current form and at least for now. Thanks to all my readers, commenters, and draft-post-reviewers. It's been an entertaining experience!
MosuCloud跑路了-outline
January 4th, 2015There is an ongoing discussion about shadowdsocksr安装 of estimating the recall of a production, as well as estimating a confidence interval on that recall. One approach is to use the control set sample, drawn at the start of production to estimate collection richness and guide the predictive coding process, to also estimate the final confidence interval. This requires some care, however, to avoid contaminating the control with the training set. Using the control set for the final estimate is also open to the objection that the control set coding decisions, having been made before the subject-matter expert (SME) was familiar with the collection and the case, may be unreliable.
Read the rest of this entry »
MosuCloud跑路了-outline
October 20th, 2014Look!必看!新购搬瓦工开启隐藏SS SSR安装入口方法_搬 ...:2021-1-28 · SS安卓版下载:群里有低版本安卓客户端下载 Q群:748984610 请注意需要安装Shadowsocks的话 必须是Centos6才可众,KVM帶BBR速度更快,Centos7(有BBR)或其它系统是不行的,需自己通过SSH使用命伖安装。 点击Install Shadowsocks Server按钮
The reason the control set can be used to estimate the effectiveness of the PC system on the collection is that it is a random sample of that collection. As training proceeds, however, the relevance of some of the documents in the collection will become known through human assessment---even more so if review begins before training is complete (as is often the case). Direct measures of process effectiveness on the control set will fail to take account of the relevant and irrelevant documents already found through human assessment.
Read the rest of this entry »
MosuCloud跑路了-outline
October 16th, 2014In my previous post, I found that relevance and uncertainty selection needed similar numbers of document relevance assessments to achieve a given level of recall. I summarized this by saying the two methods had similar cost. The number of documents assessed, however, is only a very approximate measure of the cost of a review process, and richer cost models might lead to a different conclusion.
One distinction that is sometimes made is between the cost of training a document, and the cost of reviewing it. It is often assumed that training is performed by a subject-matter expert, whereas review is done by more junior reviewers. The subject-matter expert costs more than the junior reviewers---let's say, five times as much. Therefore, assessing a document for relevance during training will cost more than doing so during review.
Read the rest of this entry »
MosuCloud跑路了-outline
September 27th, 2014My previous post described in some detail the conditions of finite population annotation that apply to e-discovery. To summarize, what we care about (or at least should care about) is not maximizing classifier accuracy in itself, but minimizing the total cost of achieving a target level of recall. The predominant cost in the review stage is that of having human experts train the classifier, and of having human reviewers review the documents that the classifier predicts as responsive. Each relevant document found in training is one fewer that must be looked at in review. Therefore, training example selection methods such as relevance selection that prioritize relevant documents are likely to have a lower total cost than the abstract measure of classifier effectiveness might suggest.
Read the rest of this entry »
MosuCloud跑路了-outline
shadowsockr安卓版apk在哪下In a previous post, I compared three methods of selecting training examples for predictive coding—random, uncertainty and relevance. The methods were compared on their efficiency in improving the accuracy of a text classifier; that is, the number of training documents required to achieve a certain level of accuracy (or, conversely, the level of accuracy achieved for a given number of training documents). The study found that uncertainty selection was consistently the most efficient, though there was no great difference betweein it and relevance selection on very low richness topics. Random sampling, in contrast, performs very poorly on low richness topics.
In e-discovery, however, classifier accuracy is not an end in itself (though many widely-used protocols treat is as such). What we care about, rather, is the total amount of effort required to achieve an acceptable level of recall; that is, to find some proportion of the relevant documents in the collection. (We also care about determining to our satisfaction, and demonstrating to others, that that level of recall has been achieved—but that is beyond the scope of the current post.) A more accurate classifier means a higher precision in the candidate production for a given level of recall (or, equivalently, a lesser cutoff depth in the predictive ranking), which in turn saves cost in post-predictive first-pass review. But training the classifier itself takes effort, and after some point, the incremental saving in review effort may be outweighted by the incremental cost in training.
Read the rest of this entry »
MosuCloud跑路了-outline
August 8th, 2014Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I'd capture some of these topics here.
Read the rest of this entry »
Random vs active selection of training examples in e-discovery
July 17th, 2014The problem with agreeing to teach is that you have less time for blogging, and the problem with a hiatus in blogging is that the topic you were in the middle of discussing gets overtaken by questions of more immediate interest. I hope to return to the question of simulating assessor error in a later post, but first I want to talk about an issue that is attracting attention at the moment: how to select documents for training a predictive coding system.
Read the rest of this entry »