Large Scale Data Management Modern science and industry depends on data. The matter is how much they can do with that data and benefit from it. In the period from creating data till benefiting from it, the data has been moved, aggregated, selected, visualized and analyzed. The effective and efficient analysis of data in vary different formats and purposes becomes a challenging task that requires an implementation of non-trivial approaches and techniques in the Large Scale Data Management. Distributed/Cluster computing based on Distributed File System is one of the key elements of solution. In general, distributed computing is any computing that involves multiple computers remote from each other that each have a role in a computation problem or information processing. How Hadoop ecosystem (including Hadoop Distributed File System) supports Large Scale Data Management
  1. Enables utilization of a lot of hardware
  2. Enables cluster based cheap/commodity hardware
  3. Takes care about failures during processing
  4. Speeds up data processing significantly

Natural Language Processing
During last decade we have witnessed huge increase in the amount of unstructured text that we are not using at all. This increase became possible thanks to the new Web technologies brought to web applications the richness and functionality that was not available before. Information hidden within mess of huge data may bring the broad opportunities to company, if properly identified and extracted. That’s why the Natural Language Processing (NLP) as the scientific discipline that dial with language processing by computer and working closely with computer science fields like text mining/analysis, artificial intelligence, linguistics, machine learning, etc. has evolved into cross-disciplinary applied research area.
In short, Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistic, devoted to make computers “understand” statements written in human languages.

Data/Web/Text mining
As a result of recent expansion of the WWW, immeasurable amount of data is freely available online in varied formats and structures. The variety of data types and unstructured nature of the content makes the direct use impractical and hard. It is why Web mining techniques and application have become one of the key research areas of the data mining. Among the purposes of web mining are to retrieve and evaluate the content, to understand the client’s online behavior, to increase the efficiency and improve the online experience of clients. In short, Web Mining is the application of various data mining techniques and tools for discovering and revealing hidden information on the WWW.

Web Crawling
It is essential for any data analysis project to have enough relevant information in order to get reliable results. Most effective and cost saving way of data collection is use of Web Crawlers (or Spider). All Internet giants including Google, Yahoo, Facebook, etc. are using crowlers. The principle of operation of any crawler is based on hyperlinked nature of the Web. Crawler starts to work by visiting a seed set of URLs and adding new ones during the crawl. In accordance with objectives, each visited website may be processed on the fly or be stored on the local or distributed storage for the later processing.
We are going to develop intelligent and high performance Web Crawler that will be used for several puposes including: feeding of search engine for later indexing; to collect Web textual content for the NLP corpus development of Azerbaijani language; to store mass of unstructured textual data for Data Analytics.

Search Engine Development
It is estimated the digital to grow by 2020 by a factor of 10 (approximately 44 ZBytes). It is why the tools and technologies that help us transform the time we spend in searching into discovering and understanding information will be increasingly important to enhance productivity and creativity. As a key area of Data Science, the Search Engine related topics emerged into multidisciplinary body of research ranging from computer scinece to the humanity, from artificial intelligence to the computational linguistic, from infomration retrieval to data visualization.
The project goal is to develop High Performance and Scalable Search Engine for intranet and public Internet. The project is divided into two phases: within the first phase we plan to index websites under “AZ” zone; as a second phase global Interenet will be indexed.