Search engines for scientific and academic information

Lluís Codina

Citación recomendada: Lluís Codina. Search engines for scientific and academic information [en linea]. "Hipertext.net", num. 5, 2007. <http://www.hipertext.net>

  1. Introduction
    1.1. Scirus
      1.1.1. Context
      1.1.2. Inputs
    1.2. Google Scholar
      1.2.1. Context
      1.2.2. Inputs
    1.3. Live Search Academic
      1.3.1. Context
      1.3.2. Inputs
  2. Conclusions
  3. Acknowledgments
  4. References

 

1. Introduction

A contradiction appears upon joining the words "Web" and "science." This contradiction arises from the academic and professional sector´s suspicion, albiet reasonable, of the Web´s contents:

  1. Who controls the information published on the Web?

  2. Up to what point is the information found through search engines reliable?

  3. Are the publisher´s reviews used in print publications used on the Web-those reviews that have meant so much to science´s development?

There is certainly no lack of famous fraudulant cases or manipulations on the Web, such as The White House website, where search engine manipulations led to unscrupulous web positioning (the most popular case called "Gooble bombing" was just recently erradicated by Google). This has led to events like the recent prohibition in a North American university of students citing Wikipedia as a source in their academic papers.

Moreover, there is then the difficulty in finding academic or scientific results when terms with the same name (but different meaning) are also used in commercial or popular culture. For example, somebody interested in the physiology of dreaming would find it difficult to find information on the Rapid Eye Movement phase of sleep, since it is internationally recognised as REM. If this expression is entered in Google, one would only find links to the musical group REM. The word "Dolly" is another good example: if somebody interested in cloning searches for information on the famous cloning experiment with the sheep, Dolly, search engines like Google will probably only bring up information about the singer Dolly Parton.

The same goes for searching for information with key words coinciding with popular Internet forum discussions. For example, if somebody interested in finding information on graphic cards uses the key words ATI or NVIDIA, the last thing to appear would be technical analyses or scientific articles on the subject; instead, the search would bring up hundreds of typically chaotic messages in forums and discussion groups, along with an endless list of prices on sites like e-bay.

Nevertheless, despite all its drawbacks, the Web is not only here to stay, but to make a positive and real impact on the spread of academic and scientific information. For years, more or less since the nineties, one of the solutions to this contradiction that academia strived for was the development and promotion of directories, websites and evaluation services like INTUTE (www.intute.ac.uk).

With the Internet´s exponential growth since then, the problem that these "hand-made" information services face is that they can only work on very few of the Web´s real contents. Thus keeping the contradiction unresolved.

1*Academic search engines

Historically, the important editorial Elsevier was the first to detect that there was a new need for academic information on the Web, and that a new class of information system was needed for the Web. Specifically, Elsevier conceived of a system capable of automatically indexing web pages, that is, just like conventional engines, but with the ability to filter information to only provide admisible and reliable with academic criteria.

This product is Scirus (www.scirus.com), which seems to have made Google jelous enough to launch a similar operation, resulting in Google Scholar (scholar.google.com) a few years later.

Soon after came the copy (luckily for the academic world). Microsoft would not be outdone, since the beginning of 2007 we now have a new contender in this exciting field: Live Search Academic (academic.live.com).

The main principle of the three systems is that they only index webisites associated to academia. What "academia" exactly is changes in each case. The one that best combines both rigour and range is definitely Scirus. Live Search Academic attempts to be the most rigourous, but at the cost of cutting its range, and somewhere in the middle lies Google Scholar.

Here we will present a comparison of the three search engines, and to do so we propose the following classification of academic documents:

  1. Type 1: All sorts of web pages and documents (Word, ppt, etc.) published in academic and scientific websites (ex..edu extensions).

  2. Type 2: P eer reviewed type scientific publications , whether or not they are open access or pay publications.

  3. Type 3: Academic work including doctoral, master´s or bachelor´s theses.

  4. Type 4: Documents stored in scientific repositories (e-prints) whether or not they are pre-pirnts, post-prints, didactic material, etc.

  5. Type 5: Patents

  6. Type 6: Books (monographs)

These six types of documents overlap. For example, some repositories include doctoral theses (not all); some repositories are made by scientific organisations or governmental agencies, while others are created and maintained by universities and are accessed through thier websites, etc. Despite this overlapping, these classifications will help us place academic search engines into context.

From this classification we can establish a table like the following in order to present a comparison of the three previously mentioned search systems relative to what type of documents they include (that is, thier "inputs"):

System

Type 1

Type 2

Type 3

Type 4

Type 5

Type 6

Scirus

x

x

x

x

x

.

Live Search Academic

.

x

.

.

.

.

Google Scholar

x

x

x

x

.

x

As we can see, of the six types, Scirus and Google Scholar include 5 of them (while not the same): Scirus does not have books, and Google does not include patents. Live only has one, and Type 2 (scientific journals) as would be expected, the only common type in all three engines. Below we go into each one of the three motors in further detail.

1.1. Scirus

10000000000001F4000000C7F347A9C8

Illustration 1: The austere yet powerful and effective Scirus homepage.

1.1.1. Context

The Scirus search enginge is , as we have previously mentioned, a project from the important Dutch scientific journal publishing company Elsevier (www.elsevier.com) which is also part of the giant Anglo-Dutch publishing house Reed-Elsevier (www.reed-elsevier.com), which publishes books, magazines and data bases for Lexis-Nexis.

What we see here is that Elsevier seems to have clearly understood the key role the Web is playing in distributing academic information while also providing two of the main databases (in this case, and unlike the other two search engines we analyse here, it is aimed at use within a university library setting): Science Direct (www.sciencedirect.com) and Scopus (www.scopus).

Scirus was founded in 2001 and slowsly has extended its scope by incorporating new sources, until becoming the most complete system of the three (Google Scholar and Windows Live). In an analysis carried out in the end of 2006 (Jacsó, 2006) it is claimed to have 300 million documents (it began with 50 million in 2001, multiplying its content by six since then). Other two analyses (Giustini and Barksy, 2005; Doldi and Bratengeyer, 2005) confirmed at the time that Scirus was much more complete than Google Scholar (Live was not around in 2005) due to how it handled scientific repositories like American Physical Society or PubMed.

1.1.2. Inputs

Scirus´s inputs, that is, the origin of the documents to be included in its indexes, are the following (we follow the categories established by Scirus):

  1. Journal articles: mainly academic publications from its own Elsevier publishing (some 2,000 titles) plus a wide range of open access publications. These are what Scirus categorises as Journal Sources in its results page along with the option to select it among its search options.

  2. Institutional and academic repositories: this section includes repositories like NASA´s in astronomy, or Cornell University library´s for sciences (physics, computer science, biology and mathematics), totaling (in theory) up to 18 repositories, among which it is worth highlighting, besides those already mentioned, doctoral theses from the international NDLTD network,Lexis-Nexis´ patents which includes patents from the United States, Japan and Europe. We say "in theory" because the tests show that they really use more repositories, for example, we saw that it also uses E-LIS, a repository on Library and Information Science which did not appear on the "official" list of Scirus sources. Scirus classifies these types of documents as Preferred Web Sources.

  3. Webpages anddocuments published online: this exclusively deals with university servers, academic institutionsor from company R+D departments or institutes. In terms of thier domain, these are mainly.edu, ac.uk,.gov, etc. Scirus classifies this group as Other Web Sources.

1.2. Google Scholar

10000000000001F4000000BA0430F077

Illustration 2: The super austere Google Scholar homepage

1.2.1. Context

By now it is quite difficult to introduce Google. This company has revolutionised the way we search the Web, even affecting the way we browse. For example, the majority of web surfers no longer use a browser´s favourites option: they prefer to simply enter the company name in the most famous search engine in history. Most do not even enter the complete URL anymore if it is at all complicated. They prefer to enter a part of the name knowing that Google will take them there, and probably as the top result. It has practically sent general directories like Yahoo or Dmoz into hiding, and it has eliminated the hundreds of national and international directories which were around before 2000. Google has also influenced the way people do business by generating profits from the Web: its publicity system AdWord and AdSense, which all competitors now imitate.

Finally, they have created (or forced the development of, however you want to look at it) a branch of mathematics: web link analysis. Google´s contribution to the Web is immense. However, the question is, in its incessant search for new activties (always hoping to reinforce its business model, we must not forget), two years ago Google decided to come into the academic search engine market and launched Google Scholar with some (relatively) new ideas. Most importantly, without a doubt, provding the Web with a citations analysis (that is why we say relatively new).

1.2.2. Inputs

As stated in its official documentation (which can also be easily checked with a simple test) Google Scholar´s inputs are the following:

  1. Journal articles: here we are dealing with articles from academic publishers that have agreed to become part of the Google Scholar programme. In a secretive fashion (something becoming more and more common with Google) there is no public documentation (at least this analyst has not found it) listing what specific publishers are included. With repetitive tests we can see that there is a wide range of them, but of course this does not substitute the good practice of periodically publishing the list of publishers in the Google Scholar programme.

  2. Books: just like with the journal articles , we are dealing with publishers that have agreed to become part of the Google Scholar programme, but in this case with book publishers. Again, we do not have a public list of these publishers. However, this is only one of the varieties of the types of books. The other consists of agreements with libraries for works with expired author copyrights after X number of years set by local governments (European, North American, etc.) which deems it public domain after the author´s death. It is worth noting that in general, if Scholar´s result is a book, it takes us to Google Books for viewing it. However, we feel we should include it in this category since it does come up on Scholar searches.

  3. Websites : Just like Scirus, it includes documents and pages from websites associated to academia. Scholar´s official documentation does not explain how these sites are selected. However, we can deduce that it must use a system similar to that of Scirus, that is, using sites with.edu, etc., even though it does not have a list of URL (sites) for us to analyse and from which others can be found, etc. In the websites´ category, Google Scholar also includes e-prints like those mentioned when discussing Scirus.

Google Scholar´s main problem is that it does not provide any specific information on its specific sources. We do not have a list of the publishing houses or repositories, nor of the number of sites indexed or the number of documents it has. On a positive note, we can mention that it has created its own impact index based on the number of documents that it is found in. This would be something like the economic alternative to the ISI index (with much fewer features, at least for now).

1.3. Live Search Academic

10000000000001F400000095668D793B

Illustration 3: Live Search presents the only interface in the
search engine world that does not try to imitate Google

1.3.1. Context

Microsoft (the owner of Live Search) has a curious history with the Web: it almost always arrives late, but ends up dominating at least all or part of the sector. It happened with browsers, with email and Web searches. It has happened again with academic web searches, that is, for now they only have fulfilled the first part: they´ve arrived late. What we do not know is if it will end up dominating a large part of the sector, like they have done with internet browsers.

Eitherway, Microsoft is the only company in computers with the sufficient technological and financial capacity to pose a credible threat to the current leader in general internet search engines (Google), and the leader in academic search engines (Scirus). But Microsoft´s past of unexplainable failures in this field makes it difficult to see it as leader anytime soon, despite its financial and technical muscle.

1.3.2. Inputs

For Live Academic, the input list is simple: academic journal articles from various publishers and scientific socieities Which of these journals are involved? Luckily, Live Academics is more transparent than Google in this regards by offering a list of " participating publishers. " This list includes publishers like: ACM, Blackwell, Elsevier, Nature, Springer-Verlag reaching a little over fifty " publishers. " However, these "publishers" publish some 2000 titles a year, leaving us to wonder which of these titles Live Academic includes. That is, are all publications included or only some? The tests show that for now only a limited part of these titles are included. The tests? tests also show that the list does not include non-anglo publishers. Of course a Spanish key word search brings up some results, but it usually is a non-spanish publisher, like Elsevier, that has at one time published a Spanish document, almost by chance. Not like having included publications from CSIC or any other spanish publisher (in Spanish or any other language)

If Microsoft hopes to take its new academic search engine serious, it should definitely expand its list of "publishers" on both sides: publishers from other countries as well as a greater number of titles from each publisher.

 

2. Conclusions

Evidence shows that the diffusion, or if I may, the promotion of knowledge, a characteristic of Library science, is entering a new era. Until recently, the Web had more than proven its great capacity to act as a direct agent of the diffusion of communication and culture. Only the scientific and academic information element was left behind.

The search engines turn towards academia is a sharp contrast to its "ignoring" the Semantic Web project carried out by the WWW Consortium with an extensive list of institutional supporters. It is surprising that at this new stage of opening new search engines, none of the three leading players (Google, Elsevier, Microsoft) have considered including some of the aspects of the Semantic Web, such as the use of ontologies. May be both initiatives are too new to start thinking of uniting. They probably would first have to mature separately before they can decide to unite forces. Even so, thier (Semantic Web and academic search engines) mutual ignorance of each other is a shame.

Eitherway, these nuances in searching show of a new stage in the way in which we manage and diffuse scientific information. For the moment, the evidence is promising and library scientists must continue. But now, their characteristic role as promoters of knowledge is performed through the new Web format.

 

3. Acknowledgments

This project has been financed by the Ministerio de Educación y Ciencia (Spain) as part of the HUM2004-03162/FILO project.

 

4. References

Codina, Lluís. (2006). "Motores de búsqueda para usos académicos: ¿Cambio de Paradigma?" ThinkEPI, January 2006. [Available from: http://www.thinkepi.net/repositorio/motores-de-busqueda-para-usos-academicos- ¿cambio-de-paradigma/]

Doldi, L. M.; Bratengeyer, E. (2005). "The web as a free source for scientific information: a comparison with fe-based databases". Online information review, v. 29, n. 4, p. 400-411

Giustini, D.; Barsky, E. (2005). " A look at Google Scholar, PubMed, and Scirus: comparisons and recommendations." JCHLA/JABSC, 26, 2005, P. 85-89. [ http://pubs.nrc-cnrc.gc.ca/jchla/jchla26/c05-030.pdf ]

Grupo Digidoc. Web semántica y Sistemas de Información Documental. http://www.semanticaweb.net/

Jacsó, Peter. Péter's Digital Reference Shelf. December, 2006. [ http://www.gale.com/reference/peter/ ]

Rovira, Cristòfol; Marcos, Mari-Carmen; Codina, Lluís (2007). "Repositorios de publicaciones digitales de libre acceso en Europa: análisis y valoración de la accesibilidad, posicionamiento web y calidad del código." El Profesional de la Información , v. 16, n. 1, Jan-Feb 2007. [ http://eprints.rclis.org/archive/00008668/ ]



Creative Commons License
Last updated 05-06-2012
© Universitat Pompeu Fabra, Barcelona