Towards Real-Time, Country-Level Location Classification of Worldwide Tweets

Towards Real-Time, Country-Level Location Classification of Worldwide Tweets

ABSTRACT:

The increase of interest in using social media as a source for research has motivated tackling the challenge of automatically geolocating tweets, given the lack of explicit location information in the majority of tweets. In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyse the extent to which a tweet’s country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyse the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone – the most widely used feature in previous work – leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20% and 50%. We observe that tweet content, the user’s self-reported location and the user’s real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification increases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.

EXISTING SYSTEM:

  • The work by Han et al. is the existing system, it conducted a comprehensive study on how Twitter users can be geolocated by using different features of tweets. They analysed how location indicative words from a user’s aggregated tweets can be used to geolocate the user. However, this requires collecting a user’s history of tweets, which is not realistic in our real-time scenario.
  • They also looked at how some metadata from tweets can be leveraged for classification, achieving slight improvements in performance, but again this is for a user’s aggregated history.
  • Finally, they looked at the temporality of tweets, using an old model to classify new tweets, finding that new tweets are more difficult to classify. This is an insightful study, which also motivates some of the settings and selection of classifiers in our own study; however, while an approach based on location indicative words may be very useful when looking at a user’s aggregated tweets, it is rather limited when – as in our case – relying on a single tweet per user.

DISADVANTAGES OF EXISTING SYSTEM:

  • Twitter data lacks reliable demographic details that would enable a representative sample of users to be collected and/or a focus on a specific user subgroup
  • Most of the previous research in inferring tweet geolocation has classified tweets by location within a limited geographical area or country; these cannot be applied directly to an unfiltered stream where tweets from any location or country will be observed.
  • The few cases that have dealt with a global collection of tweets have used an extensive set of features that cannot realistically be extracted in a real-time, streaming context (e.g., user tweeting history or social networks), and have been limited to a selected set of global cities as well as to English tweets.

PROPOSED SYSTEM:

  • Our methodology enables us to perform a thorough analysis of tweet geolocation, revealing insights into the best approaches for an accurate country-level location classifier for tweets.
  • We find that the use of a single feature like content, which is the most commonly used feature in previous work, does not suffice for an accurate classification of users by country and that the combination of multiple features leads to substantial improvement, outperforming the state-of-the-art real-time tweet geolocation classifier; this improvement is particularly manifest when using metadata like the user’s self-reported location as well as the user’s real name.
  • We also perform a per-country analysis for the top 25 countries in terms of tweet volume, exploring how different features lead to optimal classification for different countries, as well as discussing limitations when dealing with some of the most challenging countries.
  • We show that country-level classification of an unfiltered Twitter stream is challenging. It requires careful design of a classifier that uses an appropriate combination of features.

ADVANTAGES OF PROPOSED SYSTEM:

  • To the best of our knowledge, our work is the first to deal with global tweets in any language, using only those features present within the content of a tweet and its associated metadata.
  • We also complement previous work by investigating the extent to which a classifier trained on historical tweets can be used effectively on newly harvested tweets.
  • Our results at the country level are promising enough in the case of numerous countries, encouraging further research into finer grained geolocation of global tweets.
  • Still, our experiments show that we can achieve F1 scores above 80% in many of these cases given the choice of an appropriate combination of features, as well as an overall performance above 80% in terms of both micro-accuracy and macro-accuracy for the top 25 countries.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : JAVA/J2EE
  • Tool : Netbeans 7.2.1
  • Database : MYSQL

REFERENCE:

Arkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, Adam Tsakalidis, “Towards Real-Time, Country-Level Location Classification of Worldwide Tweets”, IEEE Transactions on Knowledge and Data Engineering, 2017.

SociRank: Identifying and Ranking Prevalent News Topics Using Social Media Factors

SociRank: Identifying and Ranking Prevalent News Topics Using Social Media Factors

ABSTRACT:

Mass media sources, specifically the news media, have traditionally informed us of daily events. In modern times, social media services such as Twitter provide an enormous amount of user-generated data, which have great potential to contain informative news-related content. For these resources to be useful, we must find a way to filter noise and only capture the content that, based on its similarity to the news media, is considered valuable. However, even after noise is removed, information overload may still exist in the remaining data—hence, it is convenient to prioritize it for consumption. To achieve prioritization, information must be ranked in order of estimated importance considering three factors. First, the temporal prevalence of a particular topic in the news media is a factor of importance, and can be considered the media focus (MF) of a topic. Second, the temporal prevalence of the topic in social media indicates its user attention (UA). Last, the interaction between the social media users who mention this topic indicates the strength of the community discussing it, and can be regarded as the user interaction (UI) toward the topic. We propose an unsupervised framework—SociRank—which identifies news topics prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Our experiments show that SociRank improves the quality and variety of automatically identified news topics.

EXISTING SYSTEM:

  • Two traditional methods for detecting topics are LDA and PLSA. LDA is a generative probabilistic model that can be applied to different tasks, including topic identification.
  • PLSA, similarly, is a statistical technique, which can also be applied to topic modeling. In these approaches, however, temporal information is lost, which is paramount in identifying prevalent topics and is an important characteristic of social media data.
  • Matsuo et al. employed a different approach to achieve the clustering of co-occurrence graphs. They used Newman clustering to efficiently identify word clusters. The core idea behind Newman clustering is the concept of edge betweenness. The betweenness measure of an edge is the number of shortest paths between pairs of nodes that run along it. If a network contains clusters that are loosely connected by a few intercluster edges, then all shortest paths between different clusters must go along one of these edges. Consequently, the edges connecting different clusters will have high edge betweenness, and removing them iteratively will yield well-defined clusters.

DISADVANTAGES OF EXISTING SYSTEM:

  • Even after the removal of unimportant content, there is still information overload in the remaining news-related data, which must be prioritized for consumption.
  • LDA and PLSA only discover topics from text corpora; they do not rank based on popularity or prevalence.
  • The main disadvantage of the algorithm was its high computational demand.
  • The existing work, however, only considers the personal interests of users, and not prevalent topics at a global scale.
  • These methods, however, only use data from microblogs and do not attempt to integrate them with real news. Additionally, the detected topics are not ranked by popularity or prevalence.

PROPOSED SYSTEM:

  • We propose an unsupervised system—SociRank—which effectively identifies news topics that are prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Even though this paper focuses on news topics, it can be easily adapted to a wide variety of fields, from science and technology to culture and sports.
  • To achieve its goal, SociRank uses keywords from news media sources (for a specified period of time) to identify the overlap with social media from that same period.
  • We then build a graph whose nodes represent these keywords and whose edges depict their co-occurrences in social media. The graph is then clustered to clearly identify distinct topics. After obtaining well-separated topic clusters (TCs), the factors that signify their importance are calculated. Finally, the topics are ranked.

ADVANTAGES OF PROPOSED SYSTEM:

  • To the best of our knowledge, no other work attempts to employ the use of either the social media interests of users or their social relationships to aid in the ranking of topics.
  • Moreover, SociRank undergoes an empirical framework, comprising and integrating several techniques, such as keyword extraction, measures of similarity, graph clustering, and social network analysis.
  • The effectiveness of our system is validated by extensive controlled and uncontrolled experiments.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : JAVA/J2EE
  • Tool : Netbeans 7.2.1
  • Database : MYSQL

REFERENCE:

Derek Davis, Gerardo Figueroa, and Yi-Shin Chen, “SociRank: Identifying and Ranking Prevalent News Topics Using Social Media Factors”, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, 2017.

RAPARE: A Generic Strategy for Cold-Start Rating Prediction Problem

RAPARE: A Generic Strategy for Cold-Start Rating Prediction Problem

ABSTRACT:

In recent years, recommender system is one of indispensable components in many e-commerce websites. One of the major challenges that largely remains open is the cold-start problem, which can be viewed as a barrier that keeps the cold-start users/items away from the existing ones. In this paper, we aim to break through this barrier for cold-start users/items by the assistance of existing ones. In particular, inspired by the classic Elo Rating System, which has been widely adopted in chess tournaments; we propose a novel rating comparison strategy (RAPARE) to learn the latent profiles of cold-start users/items. The center-piece of our RAPARE is to provide a fine-grained calibration on the latent profiles of cold-start users/items by exploring the differences between cold-start and existing users/items. As a generic strategy, our proposed strategy can be instantiated into existing methods in recommender systems. To reveal the capability of RAPARE strategy, we instantiate our strategy on two prevalent methods in recommender systems, i.e., the matrix factorization based and neighborhood based collaborative filtering. Experimental evaluations on five real data sets validate the superiority of our approach over the existing methods in cold-start scenario.

EXISTING SYSTEM:

  • Despite the success of existing recommender systems all over the world, the cold-start problem, i.e., how to make proper recommendations for cold-start users or coldstart items, largely remains a daunting dilemma. On one hand, cold-start users (e.g., who have rated no more than 10 items) and cold-start items (e.g., which have received no more than 10 ratings) occupy a large proportion in many real applications such as Netflix.
  • On the other hand, the effectiveness of the existing recommendation approaches (e.g., collaborative filtering) largely depends on the sufficient amount of historical ratings, and hence these approaches might quickly become ineffective for cold-start users/items that only have few ratings.
  • To date, many collaborative filtering methods have been proposed to mitigate the cold-start problem, and these efforts can be divided into three classes. In the first class, a well designed interview process is introduced for cold- start users. During this interview process, a set of items are provided for the cold-start users to express their opinions
  • Methods in the second class resort to side information such as the user/item attributes and social relationships for the cold-start problem.
  • In the third class, the coldstart problem is tackled in a dynamic manner. The intuition is that, compared to existing users/items, ratings for coldstart users/items may be more valuable to improve the accuracy of recommendation for these cold-start users/items; consequently, methods in this class aim to provide fast recommendations for cold-start users/items specifically, and then dynamically and efficiently adjust their latent profiles as they give/receive new ratings.

DISADVANTAGES OF EXISTING SYSTEM:

  • The main disadvantage of methods in this class is the additional burdens incurred by the interview process.
  • They rely on the access of such side information. These methods are inapplicable when the information is not available due to some reasons (e.g., privacy issue, user’s social network structure not existing), and has a higher computational cost compared with its side information free counterpart.
  • Methods in the third class cannot serve users with no rating in the recommender system.

PROPOSED SYSTEM:

  • In particular, we make the following analogy, i.e., to view the cold-start problem as a barrier between the cold-start users/items and the existing ones, and such a barrier could be broken with the assistance of existing users/items. To this end, we propose a novel rating comparison strategy (RAPARE) which can calibrate the latent profiles for coldstart users/items. Take cold-start user as an example, when a cold-start user gives a rating on an item, we first compare this rating with the existing ratings (which are from existing users) on this item. Then, we adjust the profile of the coldstart user based on the outcomes of the comparisons.
  • Our rating comparison strategy (RAPARE) is inspired by the Elo Rating System which has been widely used to calculate players’ ratings in many different types of match systems
  • We propose a novel and generic rating comparison strategy RAPARE to serve for the cold-start problem. We formulate the strategy as an optimization problem. The key idea of RAPARE is to exploit the knowledge from existing users/items to help calibrate the latent profiles of cold-start users/items.
  • We instantiate the proposed generic RAPARE strategy on both matrix factorization based (RAPARE-MF) and neighborhood based (RAPARE-KNN) collaborative filtering, together with algorithms to solve them.

ADVANTAGES OF PROPOSED SYSTEM:

  • We present the algorithm analysis for RAPARE strategy and its instantiations on aspects of effectiveness and efficiency.
  • We conduct extensive experimental evaluations on five real data sets, showing that our approach (1) outperforms several benchmark collaborative filtering methods and online updating methods in terms of prediction accuracy for cold-start scenario; (2) earns better quality-speed balance while enjoying a linear scalability.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : JAVA/J2EE
  • Tool : Netbeans 7.2.1
  • Database : MYSQL

REFERENCE:

Jingwei Xu, Yuan Yao, Hanghang Tong, Xianping Tao, Jian Lu, “RAPARE: A Generic Strategy for Cold-Start Rating Prediction Problem”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Query Expansion with Enriched User Profiles for Personalized Search Utilizing Folksonomy Data

Query Expansion with Enriched User Profiles for Personalized Search Utilizing Folksonomy Data

ABSTRACT:

Query expansion has been widely adopted in Web search as a way of tackling the ambiguity of queries. Personalized search utilizing folksonomy data has demonstrated an extreme vocabulary mismatch problem that requires even more effective query expansion methods. Co-occurrence statistics, tag-tag relationships and semantic matching approaches are among those favored by previous research. However, user profiles which only contain a user’s past annotation information may not be enough to support the selection of expansion terms, especially for users with limited previous activity with the system. We propose a novel model to construct enriched user profiles with the help of an external corpus for personalized query expansion. Our model integrates the current state-of-the-art text representation learning framework, known as word embeddings, with topic models in two groups of pseudo-aligned documents. Based on user profiles, we build two novel query expansion techniques. These two techniques are based on topical weights-enhanced word embeddings, and the topical relevance between the query and the terms inside a user profile respectively. The results of an in-depth experimental evaluation, performed on two real-world datasets using different external corpora, show that our approach outperforms traditional techniques, including existing non-personalized and personalized query expansion methods.

EXISTING SYSTEM:

  • Researchers have considered tag-tag relationships for personalized QE, by selecting the most related tags from a user’s profile. However, tags might not be precise descriptions of web pages, and as a result the retrieval performance of this QE approach is somewhat disappointing. Local analysis and co-occurrence based user profile representation have also been adopted to expand the query according to a user’s interaction with the system.
  • It is worth noting that, folksonomy data are not used as a test bed as in other approaches, but rather used as an external source of information from which to extract semantic classes that are added to web search results. Moreover, terms in this approach are still based on co-occurrence statistics rather than semantic relatedness.
  • Zhou et al. proposed a personalized QE framework based on the semantic relatedness of terms inside individual user profiles. A statistical tag-topic model is created to deduce latent topics from the user’s tags and tagged documents. This model is then used to identify the most relevant terms in the user model to the user’s query and then use those terms to expand the query.

DISADVANTAGES OF EXISTING SYSTEM:

  • User profiles which contain only a user’s past annotation information may not be enough to support the effective selection of expansion terms, especially for users who have had limited previous activity with the system.
  • This may “inject” the personality of other users instead of the current user, causing problems like query shift and/or interest shift.
  • Previous personalized QE research either favors tagtag relationships or relies on the co-occurrence statistics of two terms. Given the fact that tags may not constitute precise descriptions of resources, and that methods based on pure lexical matching may miss important semantic information, the retrieval performance is generally unsatisfactory.

PROPOSED SYSTEM:

  • In this paper, we adopt a different approach to personalized QE utilizing folksonomy data. In our approach, the expansion process is based on an enriched user profile, which contains tags and annotations together with documents retrieved from an external corpus. This corpus can be viewed as a knowledge base to enhance the information stored in the user profile.
  • The whole procedure of query adaptation is hidden to the user. It happens in an implicit way based on their choices of tags and the terms used on annotated web pages. We first propose a novel model to build the enriched user profiles. Our model integrates the current state-of-the-art text representation learning framework, known as word embeddings, with topic models in two groups of pseudo-aligned documents between user annotations and documents from the external corpus. We then present two novel QE techniques.
  • The first technique approaches the problem by using topical weights-enhanced WEs to select the best possible expansion terms.
  • The second method is based on the topics learned. It calculates the topical relevance between the query and the terms inside a user profile.

ADVANTAGES OF PROPOSED SYSTEM:

  • We tackle the challenge of personalized QE utilizing folksonomy data in a novel way by integrating latent and deep semantics.
  • We propose a novel model that integrates word embeddings with topic models to construct enriched user profiles with the help of an external corpus.
  • We suggest two novel personalized QE techniques based on topical weights-enhanced word embeddings, and the topical relevance between the query and the terms inside a user profile.
  • The techniques demonstrate significantly better results than previously proposed non-personalized and personalized QE methods.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : JAVA/J2EE
  • Tool : Netbeans 7.2.1
  • Database : MYSQL

REFERENCE:

Dong Zhou, Xuan Wu, Wenyu Zhao, Séamus Lawless, and Jianxun Liu, “Query Expansion with Enriched User Profiles for Personalized Search Utilizing Folksonomy Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017.

QDA: A Query-Driven Approach to Entity Resolution

QDA: A Query-Driven Approach to Entity Resolution

ABSTRACT:

This paper addresses the problem of query-aware data cleaning in the context of a user query. In particular, we develop a novel Query-Driven Approach (QDA) that systematically exploits the semantics of the predicates in SQL-like selection queries to reduce the data cleaning overhead. The objective of QDA is to issue the minimum number of cleaning steps that are necessary to answer a given SQL-like selection correctly. The comprehensive empirical evaluation of QDA demonstrates outstanding results – that is QDA is significantly better compared to traditional ER techniques, especially when the query is very selective.

EXISTING SYSTEM:

  • Traditionally, entity resolution is performed in the context of data warehousing as an offline preprocessing step prior to making data available to analysis – an approach that works well under standard settings. Such an offline strategy, however, is not viable in emerging applications that need to analyze only small portions of the entire dataset and produce answers in (near) real-time
  • While such solutions address query-aware ER, they are limited to mention-matching and/or numerical aggregation queries executed on top of dirty data. Data analysis, however, often requires a different type of queries requiring SQL-style selections. For instance, a user interested in only well-cited (e.g., with citation count above 45) papers written by “Alon Halevy”.

DISADVANTAGES OF EXISTING SYSTEM:

  • The previous approaches cannot exploit the semantics of such a selection predicate to reduce cleaning.
  • It does not prune cleaning steps due to query predicates.

PROPOSED SYSTEM:

  • To address these new cleaning challenges we proposed a Query-Driven Approach (QDA) to data cleaning.
  • In this paper, we generalize QDA to work with lazy clustering techniques (viz., those techniques that tend to delay their merging decisions until a final clustering step. Note that such a generalization requires a significant different QDA approach compared to the one we previously proposed.
  • We develop new ideas that optimize the processing of equality and range queries.
  • Finally, we present a more comprehensive experimental evaluation by providing experiments for the new lazy approach and by using another real-world world dataset (from a different domain) to test our solutions.

ADVANTAGES OF PROPOSED SYSTEM:

  • First, while previously we introduced the concept of vestigiality for a large class of SQL selection queries and developed techniques to identify vestigial cleaning steps; in this paper, we formally develop the concept of vestigiality. In particular, we (i) differentiate vestigiality from minimality and (ii) provide a theoretical study of the conditions under which vestigiality can be tested using cliques.
  • We demonstrate that QDA is generic and can work with different types of clustering algorithms. Specifically, we explore how the eagerness of the chosen clustering algorithm affects the computational efficiency of QDA.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : JAVA/J2EE
  • Tool : Netbeans 7.2.1
  • Database : MYSQL

REFERENCE:

Hotham Altwaijry, Dmitri V. Kalashnikov, and Sharad Mehrotra, Member, IEEE, “QDA: A Query-Driven Approach to Entity Resolution”, IEEE Transactions on Knowledge and Data Engineering, 2017.

PPRank: Economically Selecting Initial Users for Influence Maximization in Social Networks

PPRank: Economically Selecting Initial Users for Influence Maximization in Social Networks

ABSTRACT:

This paper focuses on seeking a new heuristic scheme for an influence maximization problem in social networks: how to economically select a subset of individuals (so-called seeds) to trigger a large cascade of further adoptions of a new behavior based on a contagion process. Most existing works on selection of seeds assumed that the constant number k seeds could be selected, irrespective of the intrinsic property of each individual’s different susceptibility of being influenced (e.g., it may be costly to persuade some seeds to adopt a new behavior). In this paper, a price-performance-ratio inspired heuristic scheme, PPRank, is proposed, which investigates how to economically select seeds within a given budget and meanwhile try to maximize the diffusion process. Our paper’s contributions are threefold. First, we explicitly characterize each user with two distinct factors: the susceptibility of being influenced (SI) and influential power (IP) representing the ability to actively influence others and formulate users’ SIs and IPs according to their social relations, and then, a convex price-demand curve-based model is utilized to properly convert each user’s SI into persuasion cost (PC) representing the cost used to successfullymake the individual adopt a new behavior. Furthermore, a novel cost-effective selection scheme is proposed, which adopts both the price performance ratio (PC-IP ratio) and user’s IP as an integrated selection criterion and meanwhile explicitly takes into account the overlapping effect; finally, simulations using both artificially generated and real-trace network data illustrate that, under the same budgets, PPRank can achieve larger diffusion range than other heuristic and brute-force greedy schemes without taking users’ persuasion costs into account.

PROJECT OUTPUT VIDEO: (Click the below link to see the project output video):

EXISTING SYSTEM:

  • Chen et al. have proposed several influence maximization algorithms in social networks. In particular, based on an independent cascade (IC) diffusion model, a heuristic algorithm called Degree Discount was proposed to alleviate the effect of overlapping, which intentionally discounts the degree of each node by removing the neighbors that are already in seed set. The aforementioned authors extended the Degree Discount algorithm to make it fit the weighted cascade (WC) diffusion model.
  • PRDiscount was proposed to alleviate the “overlapping effect” existing in reverse PageRank-like schemes. Interestingly, greedy-based algorithm and PageRank-inspired heuristic are integrated, which conducted the greedy algorithm on a small set of nodes, consisting of the top nodes ranked by PageRank algorithm on social networks.

DISADVANTAGES OF EXISTING SYSTEM:

  • Their running times are still long.
  • All aforementioned works ignore one key aspect of influence propagation that we have usually experienced in real life. That is, users have intrinsically different susceptibility of being persuaded to adopt a specific behavior that system designer advertises.

PROPOSED SYSTEM:

  • This paper proposes a new heuristic algorithm, PPRank, for economically selecting seeds to maximize influence. In detail, our main contributions are threefold.
  • First, we explicitly characterize each user with two distinct factors: susceptibility of being influenced (SI) and influential power (IP), and formulate users’ SIs and IPs according to their social relationships.
  • Second, we argue that each user’s SI is an implicit measurement of persuasion cost (PC): Qualitatively the less a user’s SI is, the more cost would be used to persuade the user. Therefore, inspired by the properties of price-demand function in economic field, our paper properly converts individual’s SI into PC, and then, a novel seed selection algorithm is proposed, which utilizes both the price-performance ratio (PC-IP ratio) and IP as an integrated selection criterion, and explicitly takes into account the overlapping effect.
  • Finally, simulations using real social network data traces and artificially generated network graphs illustrate that, under the same budget constraints, our scheme, PPRank, can achieve better performance than other heuristic and greedybased schemes, in terms of maximal diffusion range.

ADVANTAGES OF PROPOSED SYSTEM:

  • Our paper deeply investigates how to economically select seeds, within a specific marketing budget, so as to trigger a large cascade of further adoptions based on contagion process.
  • In our paper, we utilize WC diffusion model for the problem of influence maximization
  • Unlike the aforementioned works, our paper investigates how to select the initial seeds from cost-effective viewpoint and designs a new heuristic scheme.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : JAVA/J2EE
  • Tool :         Netbeans 7.2.1
  • Database : MYSQL

REFERENCE:

Yufeng Wang, Athanasios V. Vasilakos, Qun Jin, and Jianhua Ma, “PPRank: Economically Selecting Initial Users for Influence Maximization in Social Networks”, IEEE SYSTEMS JOURNAL, 2017.

Transactional Behavior Verification in Business Process as a Service Configuration

Transactional Behavior Verification in Business Process as a Service Configuration

ABSTRACT:

Business Process as a Service (BPaaS) is an emerging type of cloud service that offers configurable and executable business processes to clients over the Internet. As BPaaS is still in early years of research, many open issues remain. Managing the configuration of BPaaS builds on areas such as software product lines and configurable business processes. The problem has concerns to consider from several perspectives, such as the different types of variable features, constraints between configuration options, and satisfying the requirements provided by the client. In our approach, we use temporal logic templates to elicit transactional requirements from clients that the configured service must adhere to. For formalizing constraints over configuration,feature models are used. To manage all these concerns during BPaaS configuration, we develop a structured process that applies formal methods while directing clients through specifying transactional requirements and selecting configurable features.The Binary Decision Diagram (BDD) analysis is then used to verify that the selected configurable features do not violate any constraints. Finally, model checking is applied to verify the configured service against the transactional requirement set. We demonstrate the feasibility of our approach with several validation scenarios and performance evaluations.

EXISTING SYSTEM:

  • Existing approaches in managing business process configuration ensure domain constraints over configuration choices, while allowing basic client requirements such as selected features or control flow variations. One area that has yet to receive research attention is ensuring both domain constraints and client transactional requirements during BPaaS configuration.
  • These requirements can include conditions for acceptable process commit or abortion, required recovery operations for key activities, or valid forms of process compensation, and are difficult to verify in a cloud based scenario where multiple stakeholders are involved.
  • A configuration method that ensures complex requirements within a feasible runtime will be able to provide service clients with increased trust for outsourcing potentially sensitive business operations.

DISADVANTAGES OF EXISTING SYSTEM:

  • The problem has concerns to consider from several perspectives, such as the different types of variable features, constraints between configuration options, and satisfying the requirements provided by the client.

PROPOSED SYSTEM:

  • We propose a three-step configuration and verification process which relies on a modeling paradigm. Such paradigm allows us to capture transactional requirements and subsequently verify them. Our approach is expressive and relatively easy to use by stakeholders, while at the same time being sufficiently rigorous to allow us to apply formal methods for verification.
  • We propose a BPaaS configuration process that applies formal methods to ensure that i) the configuration is valid with respect to provider domain constraints, and ii) the process satisfies transactional requirements drawn from the business rules of the client.
  • First, we provide an overview of the process which guides clients through BPaaS configuration, then we provide details on how Binary Decision Diagram (BDD) analysis and model checking are used at certain steps.

ADVANTAGES OF PROPOSED SYSTEM:

  • To the best of our knowledge, transactional requirements important to clients, such as those supported by our template set, are not yet supported by any business process configuration method, and this is one of the major contributions of this work compared to existing works.
  • This increases client trust that the service will behave in a manner consistent with internal business policies and requirements, without having to perform their own analysis of the service behavior.
  • Our BPaaS model enables configuration from numerous perspectives important to BPaaS clients, namely, activities, resources, and data objects.
  • Our configuration method aims to elicit and ensure complex transactional requirements from clients, by adapting the temporal logic template set.
  • It has the advantage of a reduced runtime when configuring services with many configuration options and values.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : NET,C#.NET
  • Tool : Visual Studio 2008
  • Database : SQL SERVER 2005

REFERENCE:

Scott Bourne, Claudia Szabo, Member, IEEE, Quan Z. Sheng, Member, IEEE, “Transactional Behavior Verification in Business Process as a Service Configuration”, IEEE TRANSACTIONS ON SERVICE COMPUTING 2017.

Flexible Wildcard Searchable Encryption System

Flexible Wildcard Searchable Encryption System

ABSTRACT:

Searchable encryption is an important technique for public cloud storage service to provide user data confidentialityprotection and at the same time allow users performing keyword search over their encrypted data. Previous schemes only deal withexact or fuzzy keyword search to correct some spelling errors. In this paper, we propose a new wildcard searchable encryption systemto support wildcard keyword queries which has several highly desirable features. First, our system allows multiple keywords searchin which any queried keyword may contain zero, one or two wildcards, and a wildcard may appear in any position of a keyword andrepresent any number of symbols. Second, it supports simultaneous search on multiple data owner’s data using only one trapdoor.Third, it provides flexible user authorization and revocation to effectively manage search and decryption privileges. Fourth, it isconstructed based on homomorphic encryption rather than Bloom filter and hence completely eliminates the false probability causedby Bloom filter. Finally, it achieves a high level of privacy protection since matching results are unknown to the cloud server in the testphase. The proposed system is thoroughly analyzed and is proved secure. Extensive experimental results indicate that our system isefficient compared with other existing wildcard searchable encryption schemes in the public key setting.

EXISTING SYSTEM:

  • Existing wildcard search-able encryption scheme based on Bloom filter, in which each keyword has its own Bloom filter. The storage overhead grows with the number of extracted keywords from the document. The disadvantage of the scheme is that one wildcard can only represent one character.
  • For instance, if a user desires to search all keywords that begin with “acid”, he has to submit the trapdoors for wildcard keywords “acid??”, “acid???” and “acid?????? ??” respectively so that the keywords “acidic”, “acidity” and “acidification” can be matched.
  • To overcome this problem, Hu et al. introduced an improved scheme such that one wildcard can represent any number of characters.
  • Hu’s scheme is constructed based on Suga’s scheme but utilizes a different method to insert a keyword into Bloom filter.

DISADVANTAGES OF EXISTING SYSTEM:

  • The limitation of fuzzy searchable encryption scheme is that only small edit distance errors, such as spelling errors, can be corrected. It is almost useless if the query keyword has a large edit distance from the exact keyword.
  • A serious drawback of Bloom filter based searchable encryption schemes is the inevitability of false probability.

PROPOSED SYSTEM:

  • We propose a flexible wildcard searchable encryption scheme supporting multiple users. It is constructed in public key setting without relying on Bloom filter, is efficient, and achieves high security level. Additionally, when any suspicious action is detected, data owners can dynamically update the verification data stored on the cloud server.
  • Our system is the firstwildcard SE which allows a data user to use onetrapdoor to simultaneously search on multiple dataowner’s files. For example, a medical doctor canissue one wildcard keyword query to simultaneouslysearch over multiple patients’ encrypted EHRs.
  • Moreover, in the search algorithm, the user canuse multiple keyword to generate one trapdoor.These query keywords may contain zero, one ortwo wildcards. The user can issue “AND” or “OR”queries on these keywords and the top-k documentsthat have the highest relevance scores is returned tothe user.

ADVANTAGES OF PROPOSED SYSTEM:

  • No false probability.
  • Flexible user authorization and revocation.
  • Flexible search function.
  • Flexible wildcard representation.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : NET,C#.NET
  • Tool : Visual Studio 2008
  • Database : SQL SERVER 2005

REFERENCE:

Yang Yang, Member, IEEE, Ximeng Liu, Member, IEEE, Robert H. Deng, Fellow, IEEE, Jian Weng, “Flexible Wildcard Searchable Encryption System”, IEEE TRANSACTIONS ON SERVICES COMPUTING, 2017.

A Novel Data Hiding Algorithm for High Dynamic Range Images

A Novel Data Hiding Algorithm for High Dynamic Range Images

ABSTRACT:

In this paper, we propose a novel data hiding algorithm for high dynamic range (HDR) images encoded by the OpenEXR file format. The proposed algorithm exploits each of three 10-bit mantissa fields as an embedding unit in order to conceal k bits of a secret message using an optimal base which produces the least pixel variation. An aggressive bit encoding and decomposition scheme is recommended, which offers a high probability to convey (k+1) bits without increasing pixel variation caused by message concealment. In addition, we present a bit inversion embedding strategy to further increase the capacities when the probability of appearance of secret bit “1” is greater than 0.5. Furthermore, we introduce an adaptive data hiding approach for concealing more secret messages in pixels with low luminance, exploiting the features of the human visual system to achieve luminance-aware adaptive data hiding. The stego HDR images produced by our algorithm coincide with the high dynamic range image file format, causing no suspicion from malicious eavesdroppers. The generated stego HDR images and their tone-mapped low dynamic range (LDR) images reveal no perceptual differences when subjected to quantitative testing by Visual Difference Predictor. Our algorithm can resist steganalytic attacks from the HDR and LDR RS and SPAM steganalyzers. We present the first data hiding algorithm for OpenEXR HDR images offering a high embedding rate and producing high visual quality of the stego images. Our algorithm outperforms the current state-of-the-art works.

EXISTING SYSTEM:

  • In existing systems, author proposed a distortion-free data embedding scheme for HDR images. Their scheme takes advantage of the Cartesian product of all of the HDR pixels, thus exploiting all of the homogeneous representations.
  • Their method provides an average embedding rate of 0.1355 bpp. And also introduced a new homogeneity index table for homogeneity values of N=3, 5, 6, 7, which efficiently exploits all homogeneous representations of each pixel.
  • Their scheme offers an average embedding rate of 0.1445 bpp.

DISADVANTAGES OF EXISTING SYSTEM:

There are three drawbacks in the existing data hiding algorithm for HDR images.

  • First, while most algorithms target the 32-bit radiance RGBE or 48-bit TIFF format, none of them is developed for the OpenEXR format.
  • Second, while works reported byconstantly increase the embedding capacity, a stego HDR image generated by these algorithms does not preserve the radiance RGBE encoding format, thus becoming perceptible to eavesdroppers and vulnerable to steganalytic attack.
  • Third, most algorithms do not consider how to minimize pixel distortion incurred from message concealment, thus producing a tone-mapped stego image with a moderate image quality. This paper presents a novel data hiding algorithm for HDR images which is detailed in the next section.

PROPOSED SYSTEM:

  • This paper presents a novel data hiding algorithm using optimal base, abbreviated as DHOB, which employs an optimal base to conceal a serial secret bit stream with least distortion in a high dynamic range image encoded by 48-bit OpenEXR file format. This type of HDR image consists of three 16-bit floating-point values in the red, green and blue channels, all of them being “half” data types with 1-bit sign, 5-bit exponent and 10-bit mantissa field.
  • Considering a variety of luminance levels in an HDR image, we propose an adaptive data hiding scheme using  optimal base,  abbreviated as ADHOB, which supports luminance-aware message embedding, where more secret  messages are carried on pixels with low luminance, and vice versa. This scheme exploits the feature of the human visual system since human beings are less sensitive to luminance variation when a pixel has low luminance. 

ADVANTAGES OF PROPOSED SYSTEM:

  • The proposed algorithm takes advantage of 10-bit mantissa fields to convey secret messages, while leaving intact the sign and exponent fields.
  • The proposed algorithm in HDR images encoded by the OpenEXR format capable of providing a variety of capacities and producing high quality stego images feasible for real applications.

SYSTEM ARCHITECTURE:

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :   1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : NET,C#.NET
  • Tool : Visual Studio 2008
  • Database : SQL SERVER 2005

REFERENCE:

Yun-Te Lin, Chung-Ming Wang*, Member, IEEE, Wei-Sung Chen, Fang-Pang Lin, and Woei Lin, Member, IEEE, “A Novel Data Hiding Algorithm for High Dynamic Range Images”, IEEE Transactions on Multimedia, 2017.

SPFM: Scalable and Privacy-preserving FriendMatching in Mobile Cloud

SPFM: Scalable and Privacy-preserving Friend Matching in Mobile Cloud

ABSTRACT:

Profile (e.g., contact list, interest, mobility) matching is more than important for fostering the wide use of mobile social networks. The social networks such as Facebook, Line or Wechat recommend the friends for the users based on users personal data such as common contact list or mobility traces. However,outsourcing users’ personal information to the cloud for friend matching will raise a serious privacy concern due to the potential risk of data abusing. In this study, we propose a novel Scalable and Privacy-preserving Friend Matching protocol, or SPFM in short, which aims to provide a scalable friend matching and recommendation solutions without revealing the users personal data to the cloud. Different from the previous works which involves multiple rounds of protocols, SPFM presents a scalable solution which can prevent honest-but-curious mobile cloud from obtaining the original data and support the friend matching of  multiple users simultaneously. We give detailed feasibility and security analysis on SPFM and its accuracy and security have been well demonstrated via extensive simulations. The result show that our scheme works even better when original data is large.

EXISTING SYSTEM:

  • The existing mobile social network systems pay little heed to the privacy concerns associated with friend matching and recommendation based on users’ personal information. For example, in Facebook, it provides the feature of People You May Know, which recommends the friends based on the education information, the contact lists obtained from users’ smartphone, and other users’ personal information.
  • Li et al. applies additive homomorphic encryption in privacy preserving in a scenario with many intermediate computing parties.
  • Narayanan et al. and Dong et al. computes social proximity to discover potential friends by leveraging both homomorphic cryptography and obfuscation, which is more efficient.

DISADVANTAGES OF EXISTING SYSTEM:

  • Outsourcing users’ personal information to the cloud for friend matching will raise a serious privacy concern
  • Existing researches show that loss of privacy can expose users to unwanted advertisement and spams/scams, cause social reputation or economic damage, and make them victims of blackmail or even physical violence
  • The existing works may fail to work in practice due to the following two reasons. Firstly, the best practice in industry for friends recommendation is a multiple-users matching problem rather than a two-party matching problem. Some pre-share parameters between users are more likely to leak. Secondly, most of the existing works involve multiple rounds of protocols, which will suffer from a serious performance challenge.

PROPOSED SYSTEM:

  • In this study, we propose a novel Scalable and Privacy preserving Friend Matching protocol, or SPFM in short, which aims to provide a scalable friend matching and recommendation solutions without revealing the users personal data to the cloud.
  • Our basic motivation is that each user obfuscates every bit of the original personal data (e.g., contact list) before uploading by performing XOR operations with a masking sequence which is generated with a certain probability.
  • We propose a Scalable and Privacy-preserving Friend Matching scheme (SPFM) to prevent privacy leakage in friend matching and recommendation system.

ADVANTAGES OF PROPOSED SYSTEM:

  • Our design can ensure that the same data maintain a statistical similarity after obfuscation while different data can be statistically classified without leaking the original data.
  • We provide a detailed feasibility and security analysis as well as the discussion of correctness, True-Negative rate and True-Positive rate.
  • Extensive evaluations have been performed on SPFM to demonstrate the feasibility and security. The result show that our scheme works even better when original data is large.

SYSTEM ARCHITECTURE: 

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7.
  • Coding Language : NET,C#.NET
  • Tool : Visual Studio 2008
  • Database : SQL SERVER 2005

REFERENCE:

Mengyuan Li, Ruan Na_, QiYang Qian, Haojin Zhu, Xiaohui Liang, Le Yu, “SPFM: Scalable and Privacy-preserving Friend Matching in Mobile Cloud”, IEEE Internet of Things Journal, 2017.