Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields
Processing large volumes of data has presented a challenging issue, particularly in data-redundant systems. As one of the most recognized models, the conditional random fields (CRF) model has been widely applied in biomedical named entity recognition (Bio-NER). Due to the internally sequential feature, performance improvement of the CRF model is nontrivial, which requires new parallelized solutions. By combining and parallelizing the limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) and Viterbi algorithms, we propose a parallel CRF algorithm called MRCRF (MapReduce CRF) in this paper, which contains two parallel sub-algorithms to handle two time-consuming steps of the CRF model. The MRLB (MapReduce LBFGS) algorithm leverages the MapReduce framework to enhance the capability of estimating parameters. Furthermore, the MRVtb (MapReduce Viterbi) algorithm infers the most likely state sequence by extending the Viterbi algorithm with another MapReduce job. Experimental results show that the MRCRF algorithm outperforms other competing methods by exhibiting significant performance improvement in terms of time efficiency as well as preserving a guaranteed level of correctness.
Current methods for Bio-NER fall into three general classes, i.e., dictionary-based methods, heuristic rule-based methods, and statistical machine learning methods.
In order to handle large data, Jeong et al. proposed an efficient inference algorithm of CRF for large-scale natural language data which unified the SFB and TP approaches.
Lavergne et al. addressed the issue of training very large CRF, containing up to hundreds output labels and several billion features.
DISADVANTAGES OF EXISTING SYSTEM:
Relying on dictionary-based methods could cause the low recall due to the continual appearance of new entities with the advancing biology research. Biological named entities do not follow any nomenclature, which makes rule-based methods hard to be perfect.
Besides, rule-based systems require domain experts, and they are not portable to other NE types and domains.
Machine learning methods are more robust and they can identify potential biomedical entities which are not previously included in standard dictionaries.
Conditional random fields (CRF), a type of conditional probability model, has been widely applied in biomedical named entity recognition.
In this paper, we propose an improved parallel CRF algorithm by combining the parallel L-BFGS and Viterbi algorithms. The algorithm leverages the MapReduce framework to enhance the capability of estimating parameters. Furthermore, it infers the most likely state sequence by extending the Viterbi algorithm with another MapReduce job.
We propose an efficient method called MRCRF (MapReduce CRF) to partition a large dataset across Hadoop nodes in order to keep the context of each word in each sentence of Bio-NER, balance the workload and minimize the need for replication. Compared to the CRF method, the proposed MRCRF method requires ”partitioning” the data sets.
We develop two efficient parallel algorithms, i.e., the MRLB (MapReduce L-BFGS) algorithm and the MRVtb (MapReduce Viterbi) algorithm to implement the parallel CRF for Bio-NER based on MapReduce. The algorithms have improved performance compared with an existing sequential algorithm.
We conduct performance evaluation which can reveal the performance benefit of the MRCRF algorithm over the CRF counterpart. The performance is presented with reported speedup versus the sequential CRF under different data set sizes and varying Hadoop configurations.
ADVANTAGES OF PROPOSED SYSTEM:
The advantage of the CRF model is the ability to express long-distance-dependent and overlapping features. CRF has shown empirical success recently in Bio-NER, since it is free from the so-called label bias problem by using a global normalization.
We empirically show that, while maintaining a competitive accuracy on the test data, the algorithm achieves significant speedup compared to the baseline CRF algorithm implemented on a single machine. The proposed algorithms are designed to work in the Hadoop environment, where each mapper in the nodes only compute a subset of the data.
System : Pentium IV 2.4 GHz.
Hard Disk : 40 GB.
Floppy Drive : 44 Mb.
Monitor : 15 VGA Colour.
Ram : 512 Mb.
Operating system : Windows 7/UBUNTU.
Coding Language : Java 1.7 ,Hadoop 0.8.1
IDE : Eclipse
Database : MYSQL
Kenli Li, Wei Ai, Zhuo Tang, Fan Zhang, Lingang Jiang, Keqin Li, and Kai Hwang, “Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields”, IEEE Transactions on Parallel and Distributed Systems 2015.