[tor-commits] [tech-reports/master] adding subsection on aggregation techniques
karsten at torproject.org
karsten at torproject.org
Wed Jun 17 18:48:07 UTC 2015
commit 941b9516a472ae112a29540878424ae179ef7f86
Author: A. Johnson <aaron.m.johnson at nrl.navy.mil>
Date: Wed Dec 31 15:57:25 2014 -0500
adding subsection on aggregation techniques
---
2015/hidden-service-stats/hidden-service-stats.bib | 35 +++
2015/hidden-service-stats/hidden-service-stats.tex | 295 +++++++++++++++++---
2 files changed, 298 insertions(+), 32 deletions(-)
diff --git a/2015/hidden-service-stats/hidden-service-stats.bib b/2015/hidden-service-stats/hidden-service-stats.bib
index 52daad5..2fbb81a 100644
--- a/2015/hidden-service-stats/hidden-service-stats.bib
+++ b/2015/hidden-service-stats/hidden-service-stats.bib
@@ -4,4 +4,39 @@
booktitle = {Proceedings of the Third Conference on Theory of Cryptography},
series = {TCC'06},
year = {2006}
+}
+
+ at inproceedings{nissim-stoc2007,
+ author = {Nissim, Kobbi and Raskhodnikova, Sofya and Smith, Adam},
+ title = {Smooth Sensitivity and Sampling in Private Data Analysis},
+ booktitle = {Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing},
+ series = {STOC '07},
+ year = {2007}
+}
+
+ at article{pgen:homer,
+ author = {N.~Homer et al.},
+ title = {Resolving Individuals Contributing Trace Amounts of {DNA} to
+ Highly Complex Mixtures Using High-Density {SNP}
+ Genotyping Microarrays},
+ year = {2008},
+ journal = {PLoS Genet},
+ volume = {4},
+ year = {2008}
+}
+
+ at inproceedings{elahi-ccs2014,
+ author = {Elahi, Tariq and Danezis, George and Goldberg, Ian},
+ title = {PrivEx: Private Collection of Traffic Statistics for Anonymous Communication Networks},
+ booktitle = {Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security},
+ series = {CCS '14},
+ year = {2014}
+}
+
+ at inproceedings{dwork-stoc2009,
+ author = {Dwork, Cynthia and Lei, Jing},
+ title = {Differential Privacy and Robust Statistics},
+ booktitle = {Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing},
+ series = {STOC '09},
+ year = {2009}
}
\ No newline at end of file
diff --git a/2015/hidden-service-stats/hidden-service-stats.tex b/2015/hidden-service-stats/hidden-service-stats.tex
index 1400268..9aeb64c 100644
--- a/2015/hidden-service-stats/hidden-service-stats.tex
+++ b/2015/hidden-service-stats/hidden-service-stats.tex
@@ -473,6 +473,7 @@ No obvious risks. % only talking about aggregate statistics here, not
% is not the best example.
\subsubsection{Number of circuits built with TAP vs. nTor}
+\label{subsubsec:num_circ_tap}
\textbf{Details:}
%
@@ -510,6 +511,7 @@ Performance-related statistics and failure statistics are covered in a
later section.
\subsubsection{Number of established introduction points}
+\label{subsubsec:num_establish_ip}
\textbf{Details:}
%
@@ -646,7 +648,7 @@ Though it's unclear what we're going to do with this information.
This statistic will also be killed by rend-spec-ng.
\subsubsection{Number of descriptors with encrypted introduction points
-(3.1.5.)}
+(3.1.5.)} \label{subsubsec:num_desc_encrypted_ips}
\textbf{Details:}
%
@@ -665,6 +667,7 @@ performance-related statistics and failure statistics are covered at a
later time.
\subsubsection{Number of descriptor fetch requests (3.2.1.)}
+\label{subsubsec:num_desc_fetches}
\textbf{Details:}
%
@@ -693,6 +696,7 @@ Relays report the distribution of descriptor fetch requests to hidden
service identities.
\subsubsection{Number of established rendezvous points (2.1.1.)}
+\label{subsubsec:num_rps}
\textbf{Details:}
%
@@ -717,6 +721,7 @@ There is no obvious risk from sharing this number if aggregated over a
large enough time period.
\subsubsection{Number of introductions received from clients (1.2.1.)}
+\label{subsubsec:num_intros_from_clients}
\textbf{Details:}
%
@@ -772,6 +777,7 @@ points that never sees a single \verb+INTRODUCE1+ cell.
It's unclear what we'd do with this information, though.
\subsubsection{Number of server rendezvous (2.2.1.)}
+\label{subsubsec:num_server_rendezvous}
\textbf{Details:}
%
@@ -871,7 +877,7 @@ Still, this metric needs further analysis.
% time from RENDEZVOUS1 to first RELAY cell?)
\subsubsection{Number of closed rendezvous circuits without a single data
-cells}
+cell} \label{subsubsec:num_rend_circ_no_data}
\textbf{Details:}
%
@@ -957,7 +963,7 @@ Could be a starting point to look at actual logs from relays.
But is this what statistics are for?
\subsubsection{Number of failed attempts to establish an introduction
-point (1.1.3.)}
+point (1.1.3.)} \label{subsubsec:num_failed_ips}
\textbf{Details:}
%
@@ -989,7 +995,7 @@ or a deliberate action (data mangling, unknown attack, DoS, ...).
No obvious risks.
\subsubsection{Reasons for terminating established introduction points
-(1.1.5.)}
+(1.1.5.)} \label{subsubsec:reasons_end_ips}
\textbf{Details:}
%
@@ -1006,7 +1012,7 @@ things more robust.
No obvious risks.
\subsubsection{Number of descriptors published to the wrong directory
-(3.1.7.)}
+(3.1.7.)} \label{subsubsec:num_descriptors_wrong_hsdir}
\textbf{Details:}
%
@@ -1039,7 +1045,7 @@ might reveal information about specific services.
\subsubsection{Number of discarded client introductions by reason
-(1.2.3.)}
+(1.2.3.)} \label{subsubsec:num_discarded_client_intros}
\textbf{Details:}
%
@@ -1065,7 +1071,7 @@ the risk of reporting the number of received \verb+INTRODUCE1+ cells; if
only fractions are reported, it's not that bad.
%
\subsubsection{Number of server rendezvous with unknown rendezvous cookie
-(2.2.3.)}
+(2.2.3.)} \label{subsubsec:num_server_rend_unknown_cookie}
\textbf{Details:}
%
@@ -1105,13 +1111,13 @@ such as anonymizing reports from individual relays
We will be adding noise in a way that provides differential
privacy~\cite{dwork-tcc2006} for
-``single'' actions. What constitutes a single action will depend on the
+``single actions''. What constitutes a single action will depend on the
specific statistic. For example, when publishing the number of unique
descriptors seen at each HSDir, a single action could be publishing a
descriptor to
six relays. To obtain differential privacy, we will add noise using the
-Laplace distribution, which has a distribution function of
-$\textrm{Lap}(b) = e^{-|x|/b}/(2b)$. We will choose $b$ such that
+Laplace distribution, which has a distribution function
+$\textsf{Lap}(b)$ of $e^{-|x|/b}/(2b)$. We will choose $b$ such that
altering a single action will change the probability of the total output
by a factor of at most $e^{\epsilon}$. Thus more privacy is provided the
smaller that $\epsilon$ is.
@@ -1131,9 +1137,127 @@ periodically)
traffic (possibly leaked by the service itself, e.g. a web forum)
\end{compactitem}
-\subsection{Counts}
+\subsection{Counts} \label{subsec:counts}
+Reporting a basic count is useful in its own right, and it also provides a
+simple setting in which to develop privacy-preserving methodology that
+we can use for more complicated statistics. Counts can be used to
+summarize many types of hidden-service activity, such as the number of
+hidden services, the number of hidden-service clients, and the amount of
+hidden-service traffic. Table~\ref{table:count_stats} lists the statistics
+for which it could be useful to release counts.
+\begin{table}
+\center
+\caption{Count statistics}
+\label{table:count_stats}
+\begin{tabular}{|l|l|}
+\hline
+\textbf{Description} & \textbf{Section}\\
+\hline
+Number of circuits built with TAP vs. nTor &
+\ref{subsubsec:num_circ_tap}\\
+\hline
+Number of established introduction points &
+\ref{subsubsec:num_establish_ip}\\
+\hline
+Number of descriptor publish request &
+\ref{subsubsec:num_descriptor_publish}\\
+\hline
+Number of descriptors with encrypted introduction points &
+\ref{subsubsec:num_desc_encrypted_ips}\\
+\hline
+Number of descriptor fetch requests &
+\ref{subsubsec:num_desc_fetches}\\
+\hline
+Number of established rendezvous points &
+\ref{subsubsec:num_rps}\\
+\hline
+Number of introductions received from clients &
+\ref{subsubsec:num_intros_from_clients}\\
+\hline
+Number of server rendezvous &
+\ref{subsubsec:num_server_rendezvous}\\
+\hline
+Number of closed rendezvous circuits without a single data cell &
+\ref{subsubsec:num_rend_circ_no_data}\\
+\hline
+Number of failed attempts to establish an introduction point &
+\ref{subsubsec:num_failed_ips}\\
+\hline
+Reasons for terminating established introduction points &
+\ref{subsubsec:reasons_end_ips}\\
+\hline
+Number of descriptors published to the wrong directory &
+\ref{subsubsec:num_descriptors_wrong_hsdir}\\
+\hline
+Number of discarded client introductions by reason &
+\ref{subsubsec:num_discarded_client_intros}\\
+\hline
+Number of server rendezvous with unknown rendezvous cookie &
+\ref{subsubsec:num_server_rend_unknown_cookie}\\
+\hline
+\end{tabular}
+\end{table}
+
+To release a single count, we will use ideas from differential privacy
+to provide strong protection for a single hidden-service action, such as
+making a small number of descriptor lookups or sending all but an
+abnormally-large amount of traffic. However, we wish to
+continually publish statistics over time, and as a result differential
+privacy, using a per-user or per-service privacy notion that applies over
+time, does not provide an adequate solution. The reasons are
+(\emph{i}) mechanisms using global sensitivity~\cite{dwork-tcc2006} that
+would apply over time require amounts of noise that grow with the
+reporting period (potentially years in our case), and (\emph{ii})
+improving the accuracy using local mechanisms~\cite{nissim-stoc2007} is
+not feasible because the Tor protocol by design hides which activities
+correspond to the same user or service.
+
+There are several challenges to privately publishing statistics over time.
+One is that, although the effect of a single action may be made difficult
+to determine in any given statistic, the collective set of statistics may
+reveal some level of activity by the same user or service. This problem
+of a single sensitive fact influencing many published statistics
+was effectively exploited by Homer et al.~\cite{pgen:homer} to identify
+if an individual was a member of a diseased study group based only on
+per-gene statistics (where here data per-gene replace data
+per-time-period). Another challenge is that it may be
+possible to remove the noise added to the published values if ``fresh''
+noise (i.e. noise generated using new randomness) is added to each
+statistic. For example, if a count stays the same for a while, and
+the adversary knows that (or can guess it with some confidence), then
+the adversary can get a good estimate of the count by taking the
+mean of the sequence of published noisy counts. Simply reusing the
+same noise isn't adequate, however, because in that case the statistics
+would reveal with certainty all changes in the statistic and thus
+all activity since the fresh noise was chosen.
+
+To handle this problem, we will use simple rounding or ``binning''. This
+will hide changes to a count that keep it in the same bin. Of course,
+this won't hide the effects of an action if the count happens to be near
+the rounding threshold, and also the adversary can in most cases himself
+perform actions that alter the count to attempt to determine the location
+of the count within the bin. However, we will mitigate both of these
+issues by making the bin output itself noisy.
+
+We suggest the following to privately publish a count $c$:
+\begin{compactenum}
+\item Choose a bin granularity $\delta$, which should be larger than
+the amount by which a single action can change the count.
+\item Round $c$ to the nearest multiple of $\delta$, that is, let
+$\hat{c} = \delta[c/\delta]$, where $[]$ indicates the nearest integer
+function.
+\item Choose a value $\nu$ from the $\textsf{Lap}(\delta/\epsilon)$
+distribution. $\epsilon$ is the privacy
+parameter discussed at the beginning of Sec.~\ref{sec:obfuscation}.
+$\delta$ appears in the Laplace
+parameter because a single action could cause the bin center to change
+by at most $\delta$.
+\item Let the noisy count be $\tilde{c} = \hat{c} + \nu$. Publish
+$\tilde{c}$.
+\end{compactenum}
-\subsection{Distributions}
+
+\subsection{Distributions} \label{subsec:distributions}
For many statistics, it would be very helpful to understand the
distribution of values. For example, such information about descriptor
fetches could reveal if most hidden services are never used or if
@@ -1141,6 +1265,7 @@ there are a few hidden services that constitute most HS activity.
Table~\ref{table:dist_stats} lists the statistics for which it could
be useful to release information about a distribution.
\begin{table}
+\center
\caption{Statistics with interesting distributions}
\label{table:dist_stats}
\begin{tabular}{|l|l|}
@@ -1202,40 +1327,146 @@ skew, and kurtosis).
\end{compactitem}
To protect individual privacy when releasing these kinds of data,
we would again like to protect activity over time and also provide
-particularly-strong protection for a ``single'' activity. This is quite
+particularly-strong protection for a single activity. This is quite
straightforward to do for publishing
histograms, simply by applying the techniques that we developed for counts
to each
count in the histogram. Thus we suggest using histograms in this way to
report distribution data, as follows:
\begin{compactenum}
-\item Choose a finite number of \emph{buckets} that cover the possible
+\item Choose a finite number $k$ of \emph{buckets} that cover the possible
values of the statistic (we use the term ``buckets'' to distinguish
these from bins that will limit the granularity of each bucket). Each
extra bucket will result in a certain additional amount of noise being
added, but including more values in a bucket (i.e. increasing its width)
-reduces its accuracy. Therefore, these should be balanced while also
-choosing buckets that capture the most useful distinctions
+reduces its accuracy. Therefore, the number of width of the buckets should
+be balanced while also
+choosing buckets to capture the most useful distinctions
for the statistic under consideration (e.g. deciding between relative and
absolute accuracy).
-\item For each bucket, the count of values in that bucket should be
-rounded to a chosen granularity $\delta$ (e.g. to the nearest multiple of
-10). For simplicity, it is recommended that bins are not split over
-multiple buckets (e.g. there should not be buckets for values 0 and 1 if
-bin granularity is at least 2). A rounded value is used because over time
-the effects of fresh noise can be factored out (e.g. by taking the mean
-of a sequence of published values if the statistic stays the same over
-that time).
-\item Fresh Laplace noise with distribution
-$\textrm{Lap}(2\delta/\epsilon)$ should be added to the center of the bin
-of each bucket, where $\epsilon$ is the privacy parameter discussed at the
-beginning of Sec.~\ref{sec:obfuscation}. $\delta$ appears in $b$ because
-a single input to the histogram could cause the bucket center to change
-by at most $\delta$ (e.g. if the rounding threshold is just crossed).
-The value $2$ appears in $b$ because modifying a single entry in the
+\item For the $i$th bucket, the count $c_i$ of values in that bucket
+should be rounded to a chosen granularity $\delta$:
+$\hat{c_i} = \delta[c_i/\delta]$.
+$\delta$ should be larger than the amount by which a
+single activity could change the bucket count, where again the notion of a
+single activity depends on the context. Also, for simplicity, it is
+recommended that bins
+are not split over multiple buckets (e.g. there should not be buckets for
+values 0 and 1 if $\delta = 2$). The bins here serve the same purpose
+of protecting privacy over time that they did when publishing counts.
+\item Fresh Laplace noise $\nu_i$ with distribution
+$\textsf{Lap}(2\delta/\epsilon)$ should be added to the center of the
+bin of the $i$th bucket. Let the resulting value be
+$\tilde{c_i} = \hat{c_i} + \nu_i$. The value $2$ appears in the Laplace
+parameter because modifying a single entry in the
histogram can change two values: the value of the bucket it was changed
from and the value of the bucket it was changed to.
-\item The noisy bin center of each bucket is published.
+\item Publish each noisy bucket count, $\tilde{c_i}$, $1\le i\le k$.
+\end{compactenum}
+
+\subsection{Statistics aggregation}
+Up to this point it has been suggested that individual Tor relays collect
+and publish these statistics. However, there are many drawbacks to
+collecting all statistics in this manner. The main advantage is that it is
+straightforward to implement. Here we describe some of the problems and
+outline potential solutions.
+
+One problem with publishing statistics in a way that
+is attributable to specific Tor relays is that in some cases it is
+inherently insecure. For example, if each relay reported the number of
+client introduction requests it received
+(Sec.~\ref{subsubsec:num_intros_from_clients}), then an adversary that
+knows the \texttt{.onion} address of an HS (and thus can obtain its
+introduction points) could infer how many client connections the HS
+received, especially if there were few other HSes sharing its IPs. The
+general issue is that for certain types of HS activities different HSes
+or clients will make use of different relays in a way that may be
+known by the adversary. Table~\ref{table:stats_needing_aggregation} lists
+those statistics for which this is an issue.
+\begin{table}
+\center
+\caption{Per-relay statistics at high risk to leak private information}
+\label{table:stats_needing_aggregation}
+\begin{tabular}{|l|l|}
+\hline
+\textbf{Description} & \textbf{Section}\\
+\hline
+Number of descriptor fetch requests &
+\ref{subsubsec:num_desc_fetches}\\
+\hline
+Number of introductions received from clients &
+\ref{subsubsec:num_intros_from_clients}\\
+\hline
+Reasons for terminating established introduction points &
+\ref{subsubsec:reasons_end_ips}\\
+\hline
+Number of discarded client introductions by reason &
+\ref{subsubsec:num_discarded_client_intros}\\
+\hline
+Time from establishing introduction point to tearing down circuit &
+\ref{subsubsec:time_intro_to_teardown}\\
+\hline
+Number of descriptor updates per service &
+\ref{subsubsec:num_decriptor_updates}\\
+\hline
+Time between last and first published descriptor with same identifier &
+\ref{subsubsec:time_first_last_descriptor_update}\\
+\hline
+Number of descriptor fetch requests by service identity &
+\ref{subsubsec:num_descriptor_fetches_per_hs}\\
+\hline
+Number of introductions received by established introduction point &
+\ref{subsubsec:num_intros_per_circ}\\
+\hline
+Time from establishing introduction point to receiving first client
+introduction & \ref{subsubsec:time_ip_est_to_introduce1}\\
+\hline
+\end{tabular}
+\end{table}
+
+Another problem is that with per-relay statistics much more total noise
+needs to be added than is necessary if only network-wide totals are
+ultimately desired. Note that the rounding and noise applied to each
+relay's statistics (Secs. \ref{subsec:distributions} \&
+\ref{subsec:counts})
+would provide equivalent protection if applied to the same statistics for
+the entire network. This would reduce the amount of added noise and
+rounding inaccuracy by a factor of $m$, where $m$ is the number of relays.
+
+There are many possible ways to improve both the security and accuracy
+of Tor statistics via aggregation using well-studied cryptographic
+techniques, including
+\begin{compactitem}
+\item Have the relays run a secure multiparty computation (SMC) protocol
+that produces the desired statistics with any privacy-preserving
+modifications included (e.g. added noise).
+\item Take the approach of PrivEx~\cite{elahi-ccs2014} and use a separate
+set of ``tally servers'' that secret-share statistics and use homomorphic
+encryption to aggregate counts.
+\item Anonymize statistics reports, either via Tor itself or via a shuffle
+run over a separate set of servers. For statistics that could be
+attributable to a small set of relays by their values alone (e.g. a large
+number of rendezvous data cells is likely to come from a small set of
+large relays), break up the values into minimum amounts.
+\end{compactitem}
+
+Adoption of these approaches faces a couple of main challenges. Those
+issues and our suggestions for handling them are
+\begin{compactenum}
+\item Implementation difficulty: Making use of sophisticated
+cryptographic tools, such as non-standard cryptosystems or novel
+SMC protocols, often requires building secure implementations of
+them. This can require significant time and skill. It also
+creates future maintenance obligations. When choosing from the above
+solutions, we will have to consider what reliable software already exists
+for their various cryptographic components.
+\item Manipulation of statistics: Adversarial relays may report incorrect
+statistics in order to affect the aggregate statistic. For example, a
+simple aggregate such as a sum could be trivially destroyed by a malicious
+relay reporting a much larger value than its true input. One way to handle
+this problem is to use ``robust'' statistics~\cite{dwork-stoc2009}, which
+are not excessively influenced by outliers. For example, we could use a
+median instead of a mean as the basis for a sum.
\end{compactenum}
More information about the tor-commits
mailing list