• Volume 33,Issue 6,2022 Table of Contents
    Select All
    Display Type: |
    • >Special Issue's Articles
    • Counterexample-guided Spatial Flow Model Checking Methods for C Codes

      2022, 33(6):1961-1977. DOI: 10.13328/j.cnki.jos.006563 CSTR:

      Abstract (1380) HTML (2374) PDF 1.90 M (3554) Comment (0) Favorites

      Abstract:Software verification has always been a hot research topic to ensure the correctness and security of software. However, due to the complex semantics and syntax of programming language, applying formal methods to verify the correctness of programs has the problems of low accuracy and low efficiency. Among them, the state change of address space caused by pointer operations makes the verification accuracy of existing model checking methods difficult to be guaranteed. By combining model checking and sparse value-flow analysis, this study designs a spatial flow model to effectively describe the state behavior of C codes at the symbolic variable level and address space level, and proposes a model checking algorithm of counterexample-guided abstraction refinement and sparse value flow strong update (CEGAS), which enables points-to-sensitive formal verification for C codes. This study establishes a C code benchmark containing a variety of pointer operationsand conducts comparative experiments based on this benchmark. These experiments indicate that in the task of analyzing multi-class C code features, the model checking algorithm CEGAS proposed in this study can achieve outstanding results compared with the existing model detection tools. The verification accuracy of CEGAS is 92.9%, and the average verification time of each line of code is 2.58 ms, which are both better than existing testing tools.

    • TaintPoint: Fuzzing Taint Flow Efficiently with Live Trace

      2022, 33(6):1978-1995. DOI: 10.13328/j.cnki.jos.006564 CSTR:

      Abstract (1829) HTML (2883) PDF 1.85 M (3742) Comment (0) Favorites

      Abstract:Coverage-guided fuzzing has become one of the most effective ways of vulnerability detection. The widely used edge coverage is a kind of control flow information. However this feedback information is too coarse when detecting taint-style vulnerabilities. A large number of taint-independent seeds are added to the queue, and the number of taint-related seeds converges prematurely, which leads to the loss of evolutionary direction of fuzzing and unable to efficiently test the information flow between source and sink. Firstly, the reasons why the existing feedback mechanism is not efficient enough in detecting taint style vulnerabilities are analyzed. Secondly, TaintPoint, a fuzzer dedicated to taint style vulnerability detection, is proposed. TaintPoint adds live taint as data flow information on the basis of control flow traces to form the live trace as coverage feedback, and the live trace is used to improve the existing method in the instrumentation, seed filtering, selection, and mutation stages respectively. Experimental results on UAFBench show that the efficiency, output, and speed of TaintPoint in detecting taint-style vulnerabilities surpass the industry-leading general-purpose fuzzer AFL++ and directed fuzzer AFLGO. In addition, four vulnerabilities arefound and confirmed on two open source projects.

    • Compiler Fuzzing Test Case Generation with Feed-forward Neural Network

      2022, 33(6):1996-2011. DOI: 10.13328/j.cnki.jos.006565 CSTR:

      Abstract (1849) HTML (2497) PDF 1.69 M (3783) Comment (0) Favorites

      Abstract:Compiler fuzzing is one of the commonly used techniques to test the functionality and safety of compilers. The fuzzer produces grammatically correct test cases to test the deep parts of the compiler. Recently, recurrent neural networks-based deep learning methods have been introduced to the test case generation process. Aiming at the problems of insufficient grammatical accuracy and low generation efficiency when generating test cases, a method for generating compiler fuzzing test cases is proposed based on feed-forward neural networks, and the prototype tool FAIR is designed and implemented. Different from the method based on token sequence learning, FAIR extracts code fragments from the abstract syntax tree, and uses a self-attention-based feed-forward neural network to capture the grammatical associations between code fragments. After learning a generative model of the programming language, fair automatically produce diverse test cases. Experimental results show that FAIR is superior to its competitors in terms of grammatical accuracy and generation efficiency of generating test cases. The proposed method has significantly improved the ability to detect compiler software defects, and has successfully detected 20 software defects in GCC and LLVM. In addition, the method has soundportability. The simple ported FAIR-JS has detected 2 defects in the JavaScript engine.

    • Memory Optimization System for SGXv2 Trusted Execution Environment

      2022, 33(6):2012-2029. DOI: 10.13328/j.cnki.jos.006566 CSTR:

      Abstract (1676) HTML (3300) PDF 1.76 M (4201) Comment (0) Favorites

      Abstract:Trusted execution environment (TEE) is an architectural solution for secure computing that requires confidentiality and integrity for private data and code. In recent years, TEE has become the research hotspot for machine learning privacy protection, encrypted database, blockchain security, etc. This study addresses the performance problem of the system under this new trusted hardware. The performance of the new trusted hardware, i.e., Intel SGX2, is analyzed. It is found that the paging overhead in SGX1 is no longer the main issue in SGX2 under the premise of configuring large secure memory. However, the setup of large secure memory leads to two new problems. First, the available range of normal memory is narrowed down, which increases the memory pressure of normal applications, especially big data applications. Second, secure memory is usually underutilized, resulting in low overall physical memory utilization. To solve the above issues, this study proposes a new lightweight code migration approach, which dynamically migrates the code of normal applications into secure memory, while leaving the data in place. The migrated code can use secure memory and avoid the drastic performance degradation caused by disk paging. Experimental results show that the proposed approach can reduce the runtime overhead of normal applications by 73.2% to 98.7% without affecting the isolation and the use of secure applications.

    • Refcount Field Identification for Linux Kernel Based on Deep Learning

      2022, 33(6):2030-2046. DOI: 10.13328/j.cnki.jos.006567 CSTR:

      Abstract (1464) HTML (2535) PDF 1.60 M (3486) Comment (0) Favorites

      Abstract:Reference counting (refcount) is a common memory management technique in modern software. Refcount errors can often lead to severe memory errors such as memory leak, use-after-free, etc. Many efforts to harden refcount security rely on known refcount fields as their input. However, due to the complexity of software code, identifying refcount fields in source code is very challenging. Traditional methods of identifying refcount fields are mainly based on code pattern matching and have great limitations such as requiring expert experience to summarize patterns, which is a laborious job. Besides, the manually-summarized patterns do not cover all cases, resulting in a low recall. To address these issues, this studyproposes to characterize a field based on the field name and the code behaviour associated with the field; and designs a multimodal deep learning based approach. The study implements a prototype of the new approach for Linux kernel code. In the evaluation, the precision and recall achieved by the prototype system are 96.98% and 93.54%. In contrast, the traditional code-pattern-based identification method did not report any refcount fields on the testing set. In addition, sixty-one refcount fields are identified which are implemented with insecure data types in the latest Linux kernel. Until now, twenty-one of them are reported to the Linux community, of which six have been confirmed.

    • Repeating Toast Drawing Based Password Inference Attack Technique

      2022, 33(6):2047-2060. DOI: 10.13328/j.cnki.jos.006568 CSTR:

      Abstract (1313) HTML (2552) PDF 1.37 M (3064) Comment (0) Favorites

      Abstract:The mobile platform is rapidly emerging as one of the dominant computing paradigms of the last decades. However, there are also security issues that can work against mobile platforms. Being the first line of defense of various cyber attacks against mobiles, password protection serves an import role in protecting users' sensitive data. The offensive and defensive techniques related to passwords, therefore, gained a lot of attention. This work systematically studied the design flaws existing in the Android Toast mechanism and discovered a new type of vulnerability leveraging on Toast fade-in and fade-out animation, where malware can create a strategy of continuously displaying keyboard-like Toast views to capture the user's inputs stealthily, thereby stealing the user's password. The attackhas implemented, and extensive user experiments are performed to demonstrate its effectiveness, accuracy, and stealthiness. The results show that when the password length is 8, the attack success rate can reach up to 89%. It has also confirmed that the latest Android system has patched this vulnerability.

    • Study of State-of-the-art Open-source C/C++ Static Analysis Tools

      2022, 33(6):2061-2081. DOI: 10.13328/j.cnki.jos.006569 CSTR:

      Abstract (2468) HTML (3693) PDF 1.97 M (4790) Comment (0) Favorites

      Abstract:Static software defect detection is an active research topic in the domain of software engineering and software security. Along with the increase of software complexity and size, static software defect detection has been applied in both industry and academy to take the benefit of finding defects in C/C++ programs without execution. A large amount of static analysis tools (SATs) for C/C++ have been developed in recent years, and have played an important role in automatically finding defects in various kinds of C/C++ software projects. In spite of this, developers are still having less confidence on SATs mainly due to the high false positive rate that has been an unsolved problem for a long time. This research dives deep into state-of-the-art static analysis tools for C/C++ and figures out why false positives are raised through the approach of running them on Juliet Test Suite and 37 open-source real-world software projects. With insight of the design and implementation details of the selected open-source SATs, the exact reasons of which result in the high false positive rateare found. Moreover, the effort is also made to trace the tendency of development and the future of state-of-the-art open-source C/C++ SATs.

    • Exploit-oriented Automated Information Leakage

      2022, 33(6):2082-2096. DOI: 10.13328/j.cnki.jos.006570 CSTR:

      Abstract (1360) HTML (2227) PDF 1.47 M (3461) Comment (0) Favorites

      Abstract:Automated exploit generation (AEG) has become one of the most important ways to demonstrate the exploitability of vulnerabilities. However, state-of-the-art AEG solutions in general assume that the target system has no mitigations deployed, which is not true in modern operating systems since they often deploy mitigations like data execution prevention (DEP) and address space layout randomization (ASLR). This paper presents an automated solution EoLeak, able to exploit heap vulnerabilities to leak sensitive data and bypass ASLR and DEP at the same time. At a high level, EoLeak analyzes the program execution trace of the POC input that triggers the heap vulnerability, characterizes the memory profile from the trace and locates important data (e.g., code pointers), constructs leak primitives that discloses sensitive data, and generates exploits for the entire process when possible. A prototype of EoLeakis implemented and it is evaluated on a set of CTF binary programs and several real-world applications. Evaluation results show that EoLeak is effective in terms of leaking data and generating exploits.

    • Cross-project Defect Prediction Method Using Adversarial Learning

      2022, 33(6):2097-2112. DOI: 10.13328/j.cnki.jos.006571 CSTR:

      Abstract (1948) HTML (2652) PDF 1.57 M (3982) Comment (0) Favorites

      Abstract:Cross-project defect prediction (CPDP) has become an important research direction in data mining of software engineering, which uses the defective codes of other projects to build prediction models and solves the problem of insufficient data in the process of model construction. Nevertheless, there is difference in data distribution between the code files of source and target projects, which leads to poor cross-project prediction results. Based on the adversarial learning idea of generative adversarial network (GAN), under the action of discriminator, the distribution of target project features can be changed to make it close to the distribution of source project features, so as to improve the performance of cross-project defect prediction. Specifically, the process of the proposed abstract continuous GAN (AC-GAN) method consists of two stages:Data processing and model construction. First, the source and target project codes are converted into the form of abstract syntax trees (ASTs), and then the ASTs are traversed in a depth-first manner to derive the token sequences. The continuous bag-of-words model (CBOW) is used to generate word vectors, and the token sequences are transformed into numeric vectors based on the word vector table. Second, the processed numeric vectors are fed into a GAN structure-based model for feature extraction and data migration. Finally, a binary classifier is used to determine whether the target project code files are defective or not. The AC-GAN method conducted comparison experiments on 15 sets of source-target project pairs, and the experimental results demonstrate the effectiveness of this method.

    • Mechanized Verification of Efficient Iterative Data-flow Algorithm

      2022, 33(6):2115-2126. DOI: 10.13328/j.cnki.jos.006573 CSTR:

      Abstract (845) HTML (2277) PDF 1.13 M (2937) Comment (0) Favorites

      Abstract:The common way of performing data-flow analysis of programs is by solving the data-flow equations using an iterative algorithm. Finding dominators, thus identifying natural loops, is central to numerous modern compiler optimization. Basically, mechanized verification of an efficient algorithm of computing dominators is intergral part of a verified compiler. In order to prove an efficient algorithm of computing strict dominators formally, first, a semilattice structure whose domain is a set of all descending lists in which nodes in a control flow graph are represented by its reverse post order (rPO) number is constructed and proved to be a semilattice whose ordering satisfies the ascending chain condition. Then, using the semilattice structure, a worklist-based Kildall's algorithm that computes strict dominators is implemented. Next, a specification of dominators on a control flow graph is defined; key properties of the specification and the iterative algorithm are established, and the correctness and completeness of the algorithm are proved with respect to the definitional specification. Finally, the work is summarized and future research is presented. The whole development is carried out in Isabelle/HOL.

    • Semantics under Step-indexed Model and Formalization

      2022, 33(6):2127-2149. DOI: 10.13328/j.cnki.jos.006574 CSTR:

      Abstract (863) HTML (2305) PDF 2.02 M (2821) Comment (0) Favorites

      Abstract:Hoare logic is the logic base of computer programming. It is used to describe verification of general programs. Separation logic as an extension of Hoare logic, provides supports for high order features used in modern programming languages. Step-indexed model is used to define self-referential predicates. Step-indexed logic is widely used in various program verification tools based on interactive theorem prover, but the reasoning based on step index logic is more complex and complicated than that based on classical logic. On step-indexed model, it is also able to define the non-step-indexed semantics under classical logic system which is more concise and clearer, and independent of the number of steps. Aiming at studying the relationship between stepping index logic and non-stepping index logic, it is found that the two logics are not equivalent. This study summarizes the propositions involved in practical program verification, finds out their common characteristics, and gives the constraint conditions of assertions about program states. The semantics of assertions in step-indexed logic and non-step-indexed logic are defined respectively, and the equivalence of the two semantics is proved under the constraint conditions. All the above definitions and proofs are formalized in Coq. Finally, the future research directions are discussed preliminarily.

    • Coq Formalization of Propulsion Subsystem of Flight Control System for Multicopter

      2022, 33(6):2150-2171. DOI: 10.13328/j.cnki.jos.006575 CSTR:

      Abstract (1037) HTML (2183) PDF 2.14 M (2966) Comment (0) Favorites

      Abstract:A highly reliable flight control system (FCS) is a necessary prerequisite for the reliable operation of an aircraft. Under the traditional development approach, the domain knowledge is first modeled by the human in the form of natural language, and then code is written by humans according to the model, and finally, the software defects are eliminated by using testing technology. The approach fails to build reliable FCS, because of human error, natural language ambiguity, and incompleteness of test techniques. A novel development approach based on formal verification technology could improve the reliability of FCS from many aspects. This paper presents a formal design and verification method for multicopter propulsion subsystem based on Coq and generated a usable and highly reliable functional software library. The main work of this study includes:the domain knowledge is organized into a hierarchical document suitable for formal verification, the basic functions, and composite functions are separated, and the concept of the simplest form of function (SFF) is proposed; formalize the system in Coq according to this document, defining constants, variables, basic functions, composite functions, SFF, axioms, etc.; the correctness of the derivation of all kinds of composite functions is expressed as lemmas and be proved; the algorithm for solving practical problems such as the longest hover time of multicopter is given; and a functional software library is generated using OCaml language by COQ program extraction ability. In the future, more subsystems of FCS will be developed based on formal verification, and finally, a complete and highly reliable FCS with formal verification will be established.

    • Implementation of Theorem Prover for PPTL with Indexed Expressions

      2022, 33(6):2172-2188. DOI: 10.13328/j.cnki.jos.006576 CSTR:

      Abstract (906) HTML (2201) PDF 1.74 M (2836) Comment (0) Favorites

      Abstract:Theorem proving is a mainstream formal verification method, with a strong ability of abstraction and logical expression. It does not suffer from state space explosion and can be used to verify finite and infinite systems. Nevertheless, it cannot be fully automated and requires users to have deep mathematical knowledge. Propositional projection temporal logic with indexed expressions is a temporal logic with full regular expressiveness and subsumes LTL, having strong modeling and property describing ability. At present, a sound and complete axiom system for PPTL with indexed expressions is presented while the theorem proving based on it is not yet well supported by tools, which leads to the low automaticity, redundancy, and fallibility of theorem proving. Therefore, firstly, the implementation framework of the theorem prover for PPTL with indexed expressions is designed, including two parts, the formalization of the PPTL axiom system and interactive theorem proving. Then the formulas, axioms, and inference rules are formally defined in Coq, implementing the axiom system of the framework. Finally, the availability of the theorem prover is proved by the interactive proving of two proof examples.

    • Refinement-based Verification of Memory Isolation Mechanism for Trusted Execution Environment

      2022, 33(6):2189-2207. DOI: 10.13328/j.cnki.jos.006577 CSTR:

      Abstract (1358) HTML (2673) PDF 1.70 M (3298) Comment (0) Favorites

      Abstract:A trusted execution environment (TEE) provides security-sensitive applications with an isolated execution environment based on hardware isolation mechanisms to protect sensitive data. The memory isolation mechanism is one of the key mechanisms of TEE, which is used to isolate the secure memory from the non-secure memory and to control the access to the secure memory. If the security of the memory isolation mechanism can not be guaranteed, sensitive data stored in the secure memory may be leaked. To verify the security of the TEE memory isolation mechanism, a refinement-based security verification method of the memory isolation mechanism is propose for ARM TrustZone-based TEE. An abstract model and a concrete model are established, a refinement relation between these two models is defined, and the information flow security of the concrete model is verified on the premise of proving that the refinement relation is held and the abstract model satisfies the information flow security properties. The proposed concrete model consists of key hardware and software of the TEE memory isolation mechanism, including TrustZone address space controller, MMU, and TrustZone monitor, and using the theorem prover Isabelle/HOL, it is verified that the concrete model satisfies the information flow security properties, such as noninterference, nonleakage, and noninfluence.

    • Formalization of C.T.Yang’s Theorem in Coq

      2022, 33(6):2208-2223. DOI: 10.13328/j.cnki.jos.006578 CSTR:

      Abstract (1166) HTML (2449) PDF 1.58 M (3236) Comment (0) Favorites

      Abstract:It was a wish for Academician Wu Wen-tsun to mechanically prove a class theorem in topology. The C.T.Yang's Theorem includes many basic concepts in general topology, which has great significance for understanding essential content of topological space. The C.T.Yang's Theorem is an important theorem in general topology, which states that in any topological space, if the derived set of a singleton set is closed, then the derived set of any subset is also closed. Based on the interactive theorem prover Coq, this paper presents a formalization of the basic concepts in general topology from mechanized axiomatic set theory, including open sets, closed sets, neighborhoods, condensation point, derived sets, and gives a formal verification of the corresponding properties. Furthermore, a formal framework of topological space is proposed and the formal proof of C.T.Yang's Theorem is realized in general topology. The proof code of all the theorems is given without exception, the formalization process has been verified, which reflects that the formal proof of mathematics theorem has the characteristics of readability and interactivity in Coq. The proof process is standardized, rigorous, and reliable, and the formal proof of C.T.Yang's Theorem is a profound embodiment of general topology formalization.

    • Coq-based Matrix Code Generation Technology

      2022, 33(6):2224-2245. DOI: 10.13328/j.cnki.jos.006579 CSTR:

      Abstract (1006) HTML (2300) PDF 1.95 M (3180) Comment (0) Favorites

      Abstract:Matrix programs are taking increasing important role in the intelligent systems. As the complexity of matrix applications grows, the difficulty of producing correct matrix code does the same. Parallel hardware can greatly increase the speed of matrix operations; nevertheless, using parallel hardware for programming to achieve parallel operations requires programmers to describe functions in the program and to manage how to use hardware resources to deliver results. These programs are usually written in imperative languages that are difficult to reason about and refactor for different parallelization strategies. A matrix expression code generation technology has been implemented from high-level matrix operators to C code in Coq, which can convert functional matrix code with execution strategies into efficient low-level imperative code. In the future, the formal verification of the matrix will be integrated with the automatic generation of matrix code, and formal verification of the matrix code conversion process will be performed to ensure the reliability of the generated matrix code, laying the foundationfor the development of a high-reliability deep learning compiler based on the matrix formal method.

    • Formalization of Collision Detection Method for Robots

      2022, 33(6):2246-2263. DOI: 10.13328/j.cnki.jos.006580 CSTR:

      Abstract (1041) HTML (3157) PDF 1.67 M (3692) Comment (0) Favorites

      Abstract:In order to cope with the demands of more complex tasks, the development of modern robotics industry becomes rapidly. Considering the flexibility, compliance, and intelligence required for coordinated work, multi-arm/multi-robots give full play to the powerful role of robots and become an important research hotspot in the modern robotics industry. In the coordinated operation of the two arms of the robot, collisions between the robot arms and external obstacles are prone to occur, which may cause property damage and even casualties. In this study, a formal verification of the robot collision detection method is carried out. Based on the formal model of sphere and capsule, the shortest distance model of the robot geometric units and the robotic collision model are established. Meanwhile, its related attributes and the collision conditions have been formally verified. Based on the above content, the basic theorem library of robot collision detection has been successfully established, which provides technical support and method reference for further realizing the reliability and stability verification of the collision detection algorithm of the multi-machine system.

    • Tool for Verifying Cloud Block Storage Based on Separation Logic

      2022, 33(6):2264-2287. DOI: 10.13328/j.cnki.jos.006581 CSTR:

      Abstract (967) HTML (2443) PDF 2.23 M (3017) Comment (0) Favorites

      Abstract:Cloud storage is now widely used in production and people's life. Verifying the correctness of hypervisors in cloud storage can effectively improve the reliability of the whole system. Cloud block storage (CBS) has the closest storage architecture to the bottom layer. In this study, a tool is implemented for analyzing and verifying the correctness of hypervisors in CBS, by using the interactive theorem prover Coq. Based on separation logic, the implementation of the proof system in the tool mainly consists:First, a modeling language is defined to abstract the CBS into a two-tier structure, and to formally represent the CBS state and the hypervisor; second, several predicates are defined to describe the state properties of the CBS, and the logical relationships between predicates are illustrated; finally, a separation logic triple for CBS is defined to describe the behavior of a program, and the reasoning rules required to verify the triples are stated. In addition, several proof examples are introduced in this study, to present the tool's ability to represent and reason about the real hypervisor in CBS.

    • >Review Articles
    • Systematic Literature Review of Duplicated Bug Report Detection Methods

      2022, 33(6):2288-2311. DOI: 10.13328/j.cnki.jos.006367 CSTR:

      Abstract (2909) HTML (3106) PDF 2.19 M (5624) Comment (0) Favorites

      Abstract:Software bugs are inevitable in the process of software development and maintenance. Software bug reports are an important bug description documents in the software maintenance process. A high-quality software bug report can effectively improve the efficiency of software bug repair. Nevertheless, due to the existence of many developers, testers, and users interact with the bug tracking system and submit bug reports, the same bug may be reported by different parties, resulting in a large number of duplicate software bug reports. Duplicate software bug reports will inevitably increase the workload of manual detection of duplicate bug reports, cause waste of manpower and material resources, and reduce the efficiency of bug repair. This study systematically analyzes the research work of worldwide scholars in the field of duplicated detection of bug reports in recent years by means of literature research. It mainly analyzes and summarizes the research methods, data set selection, performance evaluation, etc, and puts forward the problems and challenges in the follow-up work in this field, and the correspondent's suggestions.

    • Verification Method of Hierarchical for Safety-critical Memory Management Systems

      2022, 33(6):2312-2330. DOI: 10.13328/j.cnki.jos.006639 CSTR:

      Abstract (927) HTML (1442) PDF 2.04 M (2297) Comment (0) Favorites

      Abstract:The failure of a safety-critical system can cause serious consequences, and it is very important to ensure its correctness. The space embedded operating system is a typical safety-critical system. In the design of its memory management, it must ensure its efficient allocation and deallocation, and the occupancy of system resources is minimizedat the same time. In the traditional software development process, centralized testing and verification are usually carried out after the entire software development is completed, which will inevitably cause uncertaindevelopment. Therefore, this study combines the formal verification method with the three-tier development framework of "demand-design-implementation" in the field of software engineering, and ensures the consistency of each level through the method of layered transfer verification. First, starting from the demand analysis of the demand level, the idea of formal proof is introduced to prove the correctness of the logic of the demand level, which can better guide the design of the program. Second, verification at the design level can greatly reduce the error rate of the development code, and prove the correctness of the call logic between the design algorithm and the function that needs to be implemented. Third, at the code level, it is needed to prove the consistency of the implemented code and the functional design, and prove the correctness of code. Using the interactive theorem proving auxiliary tool Coq, this study takes the memory management module of a domestic space embedded operating system as an example, to prove the correctness of the memory management algorithm and the consistency of demand, design, and code.

    • Multi-attribute Data Indexing for Query Based Entity Resolution

      2022, 33(6):2331-2347. DOI: 10.13328/j.cnki.jos.006284 CSTR:

      Abstract (1631) HTML (1653) PDF 1.72 M (2988) Comment (0) Favorites

      Abstract:Entity resolution is a key aspect of data integration, and also is a necessary preprocessing step of big data analytics and mining. In big data era, more and more query-driven data analytics applications come out, and query-based entity resolution becomes a hot topic. This work studies multi-attribute data indexing technology for entity cache in order to promote query-resolution efficiency. There are two core problems. One is how to design the multi-attributeindex. An R-tree based multi-attributeindex is designed. Entity cache is produced online, so an online index construction method is proposed based on spatial clustering. A filter-verify based multi-dimensional query method is proposed. It filters impossible records by the multi-attributeindex, and then verifies each candidate record with similarity functions or distance functions. The other ishow to insert different string attributes into the tree index. The basic solution is mapping strings into integer spaces. For Jaccard similarity and edit similarity, a q-gram based mapping method is proposed, and is improved by vector dimension reduction and z-order, which achieves high mapping qualities. Finally, the proposed hybrid index is experimentally evaluated on two datasets. Its effectiveness is validated, and moreover, different aspects of the multi-attribute index are also tested.

    • Histogram Publication under Shuffled Differential Privacy

      2022, 33(6):2348-2363. DOI: 10.13328/j.cnki.jos.006363 CSTR:

      Abstract (1136) HTML (1403) PDF 1.57 M (3231) Comment (0) Favorites

      Abstract:Given a distributed set D of categorical data defined on a domain D, this work studies differentially private algorithms for releasing a histogram to approximate the categorical data distribution in D. Existing solutions for this problem mostly use central/local differential privacy models, which are two extreme assumptions of differential privacy. The two models, however, cannot balance the contradiction between the privacy requirement of users and the analysis accuracy of collectors. To remedy the deficiency caused by the current solutions under central/local differential privacy, this study proposes a differentially private method in a shuffling way, called HP-SDP, to release histogram. HP-SDP firstly employs the local hash technology to design the shuffled randomized response mechanism. Based on this mechanism, each user perturbs her/his data in a linear decomposition way of perturbation function, without worrying about the domain size, and reports the perturbed messages to the shuffler. And then, the shuffler in HP-SDP permutes the reported messages by using a uniformly random permutation method, which makes sure the shuffled messages satisfy central differential privacy, and the collector cannot reidentify a target user. Furthermore, HP-SDP adopts the convex programming technology to boost the accuracy of the released histogram. Theoretical analysis and experimental evaluations show that the proposed methods can effectively improve the utility of the histogram, and outperform the existing solutions.

Current Issue


Volume , No.

Table of Contents

Archive

Volume

Issue

联系方式
  • 《Journal of Software 》
  • 主办单位:Institute of Software, CAS, China
  • 邮编:100190
  • 电话:010-62562563
  • 电子邮箱:jos@iscas.ac.cn
  • 网址:https://www.jos.org.cn
  • 刊号:ISSN 1000-9825
  •           CN 11-2560/TP
  • 国内定价:70元
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063