Wikipedia:搜寻器测试
维基百科,自由的百科全书
在维基百科,Google测试包括任何Google和其他搜索引擎。通过这个方法,部分种类的信息能够被准确地收集。但值得强调的是,所有的搜索引擎,都不能得到确凿的答案,而只是简单的初级启发或经验推导。
- 不适合的标题。使用Google的关键词搜索和搜索结果计量可以很好地检测到一些维基收录的不适合的标题。这个方法可以适度地清除一些欺骗、伪造和个人的臆测和假定。它可以用于确认标题是否可以全面地完整地概括条目的内容,当然这种方法也依然存在偏见(见下)。请参见Wikipedia:不适合维基百科的文章以获得更全面的不准确标题的列表。
- 带有版权的作品。大量的由新用户或匿名用户突然提交给维基的文档,常常是来源于外部资源简单的复制粘贴。他们中的一部分违背了版权。(见發現可能的侵權)通常,通过搜索摘录可以查到这些来自于网络资源的简单的复制粘贴。
- 特殊用法。通常,一个单一的概念,尤其是区域性概念,在英语会有各种各样的解释。针对同一个姓名不同写法的一系列调查表明,其中的一部分与其中最通行的写法非常近似。为了对相关使用作一个快速的对比,可以使用google判断,例如comparing deoxyribose nucleic acid and deoxyribonucleic acid。注意,有一些场合下,google测试不能被使用,例如,当一些国际标准已经被认定,像鋁。
- 相关的网站。对于一个高质量的文章(见特色條目),google可以用于查询与此相关的网站,而且确认后可能被链接至维基。
- 补充。当然,搜索引擎更利于找到更多补充材料资源。
目录 |
[编辑] 技巧
Google网页搜索并不是Google搜索的全部。进行某项Google測試的时候,试着搜索Groups(Usenet )。这是一个迥然不同的例子表示。
for the most part, conversations in English conducted by people who are not deliberately trying to sell products or reach a mass audience. Other things being equal, a "groups" search will typically return very roughly 1/5 as many hits as a "Web" search. Because group and Web searches have very different "systemic biases," hit numbers are not comparable. Nevertheless Group searches are particularly helpful in identifying entities whose Web presence may have been artificially inflated by promotional techniques; it is suspicious if a phrase gets, say, 100,000 Web hits but only 20 Groups hits.
USENET postings are date-stamped and have been archived for over twenty years, making them more useful than Web searches as a record of recent history. Using a Groups "advanced search", it is possible to restrict a search by date, which can help in identifying how recent the widespread use of a term is.
en:Google News searches can assess whether something is currently newsworthy.Google News的一个特色:创建一个网页或公告是廉价易行的。 One characteristic of Google News is that whereas it is easy and inexpensive to create websites or post to USENET, it is harder to convince a Google news source to run a story. Thus Google News, in comparison to Web or Groups, is less susceptible to manipulation by self-promoters. Note that Google News indexes many "news" sources that reflect specific points of view, and many news sources that are only of local interest.
Depending on the subject, advanced search functions may be useful. For example, adding "site:gov" or "site:edu" will restrict your search to U.S. government sites or U.S. college and university sites.
Other tools that may be useful for research include Google Scholar, which searches academic literature, and Google Print, which searchs the contents of books.
[编辑] Alexa 测试
尽管维基不是一个网页目录,但是我们收集那些满足维基收录条件的关于网站的文章。
如果你有兴趣撰写一篇关于某个特定网站的维基文章,不如在Alexa(http://www.alexa.com),查一下这个网站是否足够重要。多数人认同维基应该收录前100名的网站,当然也可能是前1000名。但是对于甚至没有在前100000名的网站,一般认为我们将很难认证相关文章的准确性而不能收录在维基之中。但是,这个中间的灰色区域则很难达成一致意见。
对于有些在前1000名内的网站(如microsoft.com),有必要对其指向进行一些调整,如Microsoft。(目前仍略有争议)
我们也注意到,因为各种原因的影响,alexa排行也有很大的争议。例如,alexa软件仅对Microsoft Windows操作系统合和微软Internet Explorer的用户有效。所以,例如专门针对Apple Macintosh的相关主题可能将无法进行能够精确反映其流量的排名。反之,有些网站管理者仅仅为了提升他们的网站排名便安装Alexa工具条,然后自己访问自己的网站。Alexa工具栏用户基数非常小,对于单个用户频繁不断的访问将对整个结果产生明显的影响。
参见這裡以获得更多关于web comics的信息。
[编辑] Google上的偏見
当使用Google来测试重要性或存在性的时候,请牢记偏见的可能,即这个工具倾向于偏向发达国家有互联网接入条件的人群的当代的标题,所以测试者必须有一定的判断能力。比如,一个美国当代流行乐坛的音乐组合也许需要几千个Google的点击才能够被大部分维基人认为值得包括,而另一个没有太多互联网接入的国家的相同重要的组合就需要少得多的点击数。而14世纪的大音乐家也许根本从Google上查询不到。
Q. What is the minimum number of matches you should see if a term is not made up? (3? 27? 81?)
A.也许有上百个!这决定于以下因素:
- The article's point of view: If narrow, fewer references are required. Try to categorize the point of view, ( whether it is NPOV, or other) eg: notice the difference between en:Ontology (philosophy) and en:Ontology (computer science).
- The subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet neologism, it may be on 100 pages and might still not be considered 'existing' for Wikipedia's purposes.
- The type of sites you find: Pay attention to how open the sites are about accepting submissions. The Urban Dictionary, for example, accepts submissions freely. This is especially important if you suspect an author is self-promoting, or is promoting an idiosyncratic viewpoint. A single Internet user can submit the same ideas to message boards and open-submission sites all over the Internet.
Further judgment: the Google test checks popular usage, not correctness. For example, a search for the incorrect en:Charles Windsor gives 10 times more results than the correct en:Charles Mountbatten-Windsor.
Also, some topics may not be on the Web because of low Internet use in certain areas and cultures of the world.
[编辑] Google测试的可靠性
Given that the results of a Google test are interpreted subjectively, its implementation is not always consistent. This reflects the nature of the test being used on a case by case basis.
In some cases, articles have been kept with Google hit counts as low as 15 and some claim that this undermines the validity of the Google test in its entirety. However, in fact, this reflects on the rather uneven and subjective nature of the en:Wikipedia:Votes for deletion process more than on the usefulness of the Google test. The Google test has always been and very likely always will remain an imperfect tool used to produce a general gauge of notability. It is not and should never be considered definitive.
Major factors which may affect Google hit count include subjects from countries where the internet is not prevalent or topics which are of a historical nature but have not yet been well documented on the internet. In other cases, it is completely speculative as to why a subject merits inclusion with a hitcount below 100 while other such articles are frequently deleted.
Also note that the number of hits that Google reports is (sometimes or perhaps always; the details are secret) an estimate, not an exact figure. The number of hits reported by Google has little meaning until one navigates to the last page of the results, since it's only then that Google applies all criteria to a query (such as eliminating duplicate and spam control). Often the hit count is cut by a factor of 10 (or much more) after doing this. Jumping to the end of the results (or as far as is practical), also reveals if the hit count is actually related to intended meaning of the search term. Queries are further improved by setting the results per page to the maximum value (which reduces duplicate results) and excluding any domain of a bias party. For instance "JoesRockBand.com" should be excluded when searching for references to "Joe's Rock Band". For longer lasting articles, excluding the term "wikipedia" itself, may be needed, to avoid counting all the mirrors and language versions of a wikipedia article. In fact, the vfd discussion itself, once archived and indexed by Google, may actually add to the Google hit count used the next time the item is discussed. Finally, some human labor has to be inolved, and a manageable sample of sites found must be opened individually, to actually verify the relevance of the hit count.
[编辑] 搜尋引擎的限制
Much, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.
The estimated size of the en:World Wide Web is at least 2 billion pages, but a much deeper (and larger) Web, estimated at over 500 billion pages, exists within databases whose contents the search engines do not index. These "dynamic" pages are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The en:United States Patent and Trademark Office website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.
[编辑] 外語及非拉丁文字
Claims for the non-notability of a topic is occasionally made based on few Google hits, where a considerably larger number of hits would have resulted from searching in the correct script or for various transcriptions. An Arabic name, for instance, needs to be searched for in the original script, which is easily done with Google, provided one knows what to search for, but one also has to take into account that e.g. English, French and German webpages will likely transcribe the name using different conventions.
In addition, different forms of a name used in the original language must be searched for. A Russian personal name has to be searched for both including and excluding the en:patronymic, and any search for names and other words in strongly inflected languages should take into account that arriving at the total number of hits may require searching for forms with varying case-endings or other grammatical variations not obvious for someone who does not know the language.
Doing a search like this requires a certain linguistic competence which not every individual wikipedian possesses, but the Wikipedia community as a whole includes many bilingual and multilingual people and it is important for nominators and voters on VfD at least to be aware of one's own limitations and not state conclusively a small number of Google hits for, say, a Serbian poet without pointing out the limited validity of a preliminary search using only one particular transcribed form of the name.