Security Methods for Statistical Databases
Introduction
are often used for research
§ Statistical Databases containing medical information
the privacy of the patient
§ Some of the data is protected by laws to help protect
§ Proper security precautions must be implemented to comply with laws and respect the sensitivity of the data
Accuracy vs. Confidentiality
Confidentiality –
Accuracy –
Patients, laws
Researchers want to extract accurate and meaningful data
and database administrators want to maintain the privacy of patients and the confidentiality of their information
Laws
§ Health Insurance Portability and Accountability Act
– HIPAA (Privacy Rule)
§ Covered organizations must comply by April 14, 2003
§ Designed to improve efficiency of healthcare system by using electronic exchange of data and maintaining security
§ Covered entities (health plans, healthcare clearinghouses, healthcare providers) may not use or disclose protected information except as permitted or required
§ Privacy Rule establishes a “minimum necessary standard” for the purpose of making covered entities evaluate their current regulations and security precautions
HIPAA Compliance
covered entities
§ Companies offer 3rd Party Certification of
associating companies HIPAA
§ Such companies will check your company and for compliance with
rapid
implementation and
compliance to HIPAA regulations
§ Can help with
Types of Statistical Databases
§ Static – a static
§ Dynamic – changes
continuously to reflect real-time data
database is made once and never changes
§ Example: most online research databases § Example: U.S. Census
Security Methods
§ Access Restriction
§ Query Set Restriction
§ Microaggregation
§ Data Perturbation
§ Output Perturbation
§ Auditing
§ Random Sampling
Access Restriction
§ Databases normally have different access
levels for different types of users
§ User ID and passwords are the most common
methods for restricting access
§
In a medical database:
§ Doctors/Healthcare Representative – full access to information
§ Researchers – only access to partial information (e.g. aggregate information)
Query Set Restriction
of records that must be in the result set
§ A query-set size control can limit the number
§ Allows the query results to be displayed only if the size of the query set satisfies the condition
§ Setting a minimum query-set size can help protect against the disclosure of individual data
Query Set Restriction
§ Let K represents the minimum number or records to be present for the query set
§ Let R represents the size of the query set
§ The query set can only be displayed if
K (cid:0)
R
Query Set Restriction
Query 2
Query 1
Original Database
Query Results
K
Query 2 Results
K
Query Results
Query 1 Results
Microaggregation
before publication
§ Raw (individual) data is grouped into small aggregates
the individual
§ The average value of the group replaces each value of
maintain data accuracy
§ Data with the most similarities are grouped together to
§ Helps to prevent disclosure of individual data
Microaggregation
§ National Agricultural Statistics Service (NASS)
publishes data about farms
§ To protect against data disclosure, data is only
released at the county level
§ Farms in each county are averaged together to maintain as much purity, yet still protect against disclosure
Microaggregation
Age
Microaggregated Age
10
11.67
Average
12
11.67
13
11.67
57
56.67
54
Average
56.67
59
56.67
Microaggregation
User
Averaged
Original Data
Microaggregated Data
Data Perturbation
§ Perturbed data is raw data with noise added
accessed, the true value is not disclosed
§ Pro: With perturbed databases, if unauthorized data is
data
§ Con: Data perturbation runs the risk of presenting biased
Data Perturbation
User 1
Noise Added
Original Database
Perturbed Database
User 2
Output Perturbation
Instead of the raw data being transformed as in Data Perturbation, only the output or query results are perturbed
§
is
less severe
than with data
perturbation
§ The bias problem
Output Perturbation
Query
User 1
Results
Noise Added to Results
Original Database
Query
Results
User 2
Auditing
§ Auditing is the process of keeping track of all queries made by
each user
§ Usually done with up-to-date logs
§ Each time a user issues a query, the log is checked to see if the
user is querying the database maliciously
Random Sampling
of the query are shown
§ Only a sample of the records meeting the requirements
to the same query
§ Must maintain consistency by giving exact same results
different query set
§ Weakness - Logical equivalent queries can result in a
Comparison Methods
The following criteria are used to determine the most effective methods of statistical database security:
§ Security – possibility of exact disclosure, partial
disclosure, robustness
§ Richness of Information – amount of non-confidential
information eliminated, bias, precision, consistency
§ Costs – initial implementation cost, processing
overhead per query, user education
A Comparison of Methods
Method
Security
Costs
Richness of Information
Query-set Restriction
Low
Low1
Low
Microaggregation
Moderate
Moderate
Moderate
Data Perturbation
High
High-Moderate
Low
Output Perturbation
Moderate
Moderate-low
Low
Auditing
Moderate-Low
Moderate
High
Sampling
Moderate
Moderate-Low
Moderate
1 Quality is low because a lot of information can be eliminated if the query does not meet the requirements
Sources
http://www.cs.jmu.edu/users/aboutams
§ This presentation is posted on
http://delivery.acm.org/10.1145/80000/76895/p515-adam.pdf?key1=76895&key2=1947043301&coll=portal&dl=ACM&CFID=4702747&CFTOKEN=83773110 )
§ Adam, Nabil R. ; Wortmann, John C.; Security- Control Methods for Statistical Databases: A Comparative Study; ACM Computing Surveys, Vol. 21, No. 4, December 1989 (
§ Official HIPAA – (http://cms.hhs.gov/hipaa/) incur
BioTech/Pharma Research: Rules of the Road (
http://www.privacyassociation.org/docs/3-02bernstein.pdf)
§ Bernstein, Stephen W.; Impact of HIPAA on
http://hipaatesting.com/service_bureau.html)
§ Service Bureau; 3rd Party Testing (

