- Enables utilization of a lot of hardware
- Enables cluster based cheap/commodity hardware
- Takes care about failures during processing
- Speeds up data processing significantly
Natural Language Processing
During last decade we have witnessed huge increase in the amount of unstructured text that we are not using at all. This increase became possible thanks to the new Web technologies brought to web applications the richness and functionality that was not available before. Information hidden within mess of huge data may bring the broad opportunities to company, if properly identified and extracted. That’s why the Natural Language Processing (NLP) as the scientific discipline that dial with language processing by computer and working closely with computer science fields like text mining/analysis, artificial intelligence, linguistics, machine learning, etc. has evolved into cross-disciplinary applied research area.
In short, Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistic, devoted to make computers “understand” statements written in human languages.
As a result of recent expansion of the WWW, immeasurable amount of data is freely available online in varied formats and structures. The variety of data types and unstructured nature of the content makes the direct use impractical and hard. It is why Web mining techniques and application have become one of the key research areas of the data mining. Among the purposes of web mining are to retrieve and evaluate the content, to understand the client’s online behavior, to increase the efficiency and improve the online experience of clients. In short, Web Mining is the application of various data mining techniques and tools for discovering and revealing hidden information on the WWW.
It is essential for any data analysis project to have enough relevant information in order to get reliable results. Most effective and cost saving way of data collection is use of Web Crawlers (or Spider). All Internet giants including Google, Yahoo, Facebook, etc. are using crowlers. The principle of operation of any crawler is based on hyperlinked nature of the Web. Crawler starts to work by visiting a seed set of URLs and adding new ones during the crawl. In accordance with objectives, each visited website may be processed on the fly or be stored on the local or distributed storage for the later processing.
We are going to develop intelligent and high performance Web Crawler that will be used for several puposes including: feeding of search engine for later indexing; to collect Web textual content for the NLP corpus development of Azerbaijani language; to store mass of unstructured textual data for Data Analytics.
Search Engine Development
It is estimated the digital to grow by 2020 by a factor of 10 (approximately 44 ZBytes). It is why the tools and technologies that help us transform the time we spend in searching into discovering and understanding information will be increasingly important to enhance productivity and creativity. As a key area of Data Science, the Search Engine related topics emerged into multidisciplinary body of research ranging from computer scinece to the humanity, from artificial intelligence to the computational linguistic, from infomration retrieval to data visualization.
The project goal is to develop High Performance and Scalable Search Engine for intranet and public Internet. The project is divided into two phases: within the first phase we plan to index websites under “AZ” zone; as a second phase global Interenet will be indexed.