unstick r1852471

Revision 1852471

Date:
2019/01/29 19:35:44
Author:
eyal
Revision Log:
Add forgotten svn blog post to static content

Files:

Legend:

 
Added
 
Removed
 
Modified
  • datafu/site/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html

     
    1
    2
    3
    4 <!doctype html>
    5 <html>
    6 <head>
    7 <meta charset="utf-8">
    8
    9 <!-- Always force latest IE rendering engine or request Chrome Frame -->
    10 <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
    11 <meta name="google-site-verification" content="9N7qTOUYyX4kYfXYc0OIomWJku3PVvGrf6oTNWg2CHI" />
    12
    13 <meta name="twitter:card" content="summary" />
    14 <meta name="twitter:site" content="@apachedatafu" />
    15 <meta name="twitter:title" content="A Look into PayPal’s Contributions to Apache DataFu" />
    16 <meta name="twitter:description" content=" Photo by Louis Reed on Unsplash As with many Apache projects with robust communities and growing ecosystems, Apache DataFu has contributions from individual code committers employed by various organizations. Users of Apache projects who contribute..." />
    17 <meta property="og:url" content="http://datafu.apache.org/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html" />
    18 <meta property="og:type" content="article" />
    19 <meta property="og:title" content="A Look into PayPal’s Contributions to Apache DataFu" />
    20 <meta property="og:description" content=" Photo by Louis Reed on Unsplash As with many Apache projects with robust communities and growing ecosystems, Apache DataFu has contributions from individual code committers employed by various organizations. Users of Apache projects who contribute..." />
    21
    22
    23 <!-- Use title if it's in the page YAML frontmatter -->
    24 <title>A Look into PayPal’s Contributions to Apache DataFu</title>
    25
    26 <link href="/stylesheets/all.css" rel="stylesheet" /><link href="/stylesheets/highlight.css" rel="stylesheet" />
    27 <script src="/javascripts/all.js"></script>
    28
    29 <script type="text/javascript">
    30 var _gaq = _gaq || [];
    31 _gaq.push(['_setAccount', 'UA-30533336-2']);
    32 _gaq.push(['_trackPageview']);
    33
    34 (function() {
    35 var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    36 ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    37 var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
    38 })();
    39 </script>
    40 </head>
    41
    42 <body class="blog blog_2019 blog_2019_01 blog_2019_01_29 blog_2019_01_29_a-look-at-paypals-contributions-to-datafu">
    43
    44 <div class="container">
    45
    46
    47 <div class="header">
    48
    49 <ul class="nav nav-pills pull-right">
    50 <li><a href="/blog">Blog</a></li>
    51 </ul>
    52
    53 <h3 class="header-title"><a href="/">Apache DataFu&trade;</a></h3>
    54
    55 </div>
    56
    57
    58 <div class="row">
    59 <article class="col-lg-10">
    60 <h1>A Look into PayPal’s Contributions to Apache DataFu</h1>
    61 <h5 class="text-muted"><time>Jan 29, 2019</time></h5>
    62 <h5 class="text-muted">Eyal Allweil</h5>
    63
    64 <hr>
    65
    66 <p><img alt="1*rzrpfvbz7 idjxy 6teaxq" src="https://cdn-images-1.medium.com/max/1600/1*RZRPFvbZ7_IdJxY-6TeaxQ.jpeg" /></p>
    67
    68 <p>Photo by <a href="https://unsplash.com/photos/pwcKF7L4-no?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Louis Reed</a> on <a href="https://unsplash.com/search/photos/test-tubes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
    69
    70 <p><strong><em>As with many Apache projects with robust communities and growing ecosystems,</em></strong> <a href="http://datafu.apache.org/"><strong><em>Apache DataFu</em></strong></a> <strong><em>has contributions from individual code committers employed by various organizations. Users of Apache projects who contribute code back to the project benefits everyone. This is PayPal&#39;s story.</em></strong></p>
    71
    72 <p>At PayPal, we often work on large datasets in a Hadoop environment — crunching up to petabytes of data and using a variety of sophisticated tools in order to fight fraud. One of the tools we use to do so is <a href="https://pig.apache.org/">Apache Pig</a>. Pig is a simple, high-level programming language that consists of just a few dozen operators, but it allows you to write powerful queries and transformations over Hadoop.</p>
    73
    74 <p>It also allows you to extend Pig’s capabilities by writing macros and UDF’s (user defined functions). At PayPal, we’ve written a variety of both, and contributed many of them to the <a href="http://datafu.apache.org/">Apache DataFu</a> project. In this blog post we’d like to explain what we’ve contributed and present a guide to how we use them.</p>
    75
    76 <hr>
    77
    78 <p><br></p>
    79
    80 <p><strong>1. Finding the most recent update of a given record — the <em>dedup</em> (de-duplication) macro</strong></p>
    81
    82 <p>A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.</p>
    83
    84 <p><br>
    85 <script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
    86 <center>Raw customers’ data, with more than one row per customer</center>
    87 <br></p>
    88
    89 <p>We can see that though most of the customers only appear once, <em>julia</em> and <em>quentin</em> have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? For this we can use the <em>dedup</em> macro, as below:</p>
    90 <pre class="highlight pig"><code><span class="k">REGISTER</span> <span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">5</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span>
    91
    92 <span class="k">IMPORT</span> <span class="s1">'datafu/dedup.pig'</span><span class="p">;</span>
    93
    94 <span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'customers.csv'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">id</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="n">chararray</span><span class="p">,</span> <span class="n">purchases</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">date_updated</span><span class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
    95
    96 <span class="n">dedup_data</span> <span class="o">=</span> <span class="n">dedup</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s1">'id'</span><span class="p">,</span> <span class="s1">'date_updated'</span><span class="p">);</span>
    97
    98 <span class="k">STORE</span> <span class="n">dedup_data</span> <span class="k">INTO</span> <span class="s1">'dedup_out'</span><span class="p">;</span>
    99 </code></pre>
    100
    101 <p>Our result will be as expected — each customer only appears once, as you can see below:</p>
    102
    103 <p><br>
    104 <script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
    105 <center>“Deduplicated” data, with only the most recent record for each customer</center>
    106 <br></p>
    107
    108 <p>One nice thing about this macro is that you can use more than one field to dedup the data. For example, if we wanted to use both the <em>id</em> and <em>name</em> fields, we would change this line:</p>
    109 <pre class="highlight pig"><code><span class="n">dedup_data</span> <span class="o">=</span> <span class="n">dedup</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s1">'id'</span><span class="p">,</span> <span class="s1">'date_updated'</span><span class="p">);</span>
    110 </code></pre>
    111
    112 <p>to this:</p>
    113 <pre class="highlight pig"><code><span class="n">dedup_data</span> <span class="o">=</span> <span class="n">dedup</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s1">'(id, name)'</span><span class="p">,</span> <span class="s1">'date_updated'</span><span class="p">);</span>
    114 </code></pre>
    115
    116 <hr>
    117
    118 <p><br></p>
    119
    120 <p><strong>2. Preparing a sample of records based on a list of keys — the sample_by_keys macro.</strong></p>
    121
    122 <p>Another common use case we’ve encountered is the need to prepare a sample based on a small subset of records. DataFu already includes a number of UDF’s for sampling purposes, but they are all based on random selection. Sometimes, at PayPal, we needed to be able to create a table representing a manually-chosen sample of customers, but with exactly the same fields as the original table. For that we use the <em>sample_by_keys</em> macro. For example, let’s say we want customers 2, 4 and 6 from <em>customers.csv</em>. If we have this list stored on the HDFS as <em>sample.csv</em>, we could use the following Pig script:</p>
    123 <pre class="highlight pig"><code><span class="k">REGISTER</span> <span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">5</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span>
    124
    125 <span class="k">IMPORT</span> <span class="s1">'datafu/sample_by_keys.pig'</span><span class="p">;</span>
    126
    127 <span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'customers.csv'</span> <span class="k">USING</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">id</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="n">chararray</span><span class="p">,</span> <span class="n">purchases</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">updated</span><span class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
    128
    129 <span class="n">customers</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'sample.csv'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">cust_id</span><span class="p">:</span> <span class="n">int</span><span class="p">);</span>
    130
    131 <span class="n">sampled</span> <span class="o">=</span> <span class="n">sample_by_keys</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">customers</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="n">cust_id</span><span class="p">);</span>
    132
    133 <span class="k">STORE</span> <span class="n">sampled</span> <span class="k">INTO</span> <span class="s1">'sample_out'</span><span class="p">;</span>
    134 </code></pre>
    135
    136 <p>The result will be all the records from our original table for customers 2, 4 and 6. Notice that the original row structure is preserved, and that customer 2 —<em> julia</em> — has two rows, as was the case in our original data. This is important for making sure that the code that will run on this sample will behave exactly as it would on the original data.</p>
    137
    138 <p><br>
    139 <script src="https://gist.github.com/eyala/28985cc0e3f338d044cc5ebb779f6454.js"></script>
    140 <center>Only customers 2, 4, and 6 appear in our new sample</center>
    141 <br></p>
    142
    143 <hr>
    144
    145 <p><br></p>
    146
    147 <p><strong>3. Comparing expected and actual results for regression tests — the diff_macro</strong></p>
    148
    149 <p>After making changes in an application’s logic, we are often interested in the effect they have on our output. One common use case is when we refactor — we don’t expect our output to change. Another is a surgical change which should only affect a very small subset of records. For easily performing such regression tests on actual data, we use the <em>diff_macro</em>, which is based on DataFu’s <em>TupleDiff</em> UDF.</p>
    150
    151 <p>Let’s look at a table which is exactly like <em>dedup_out</em>, but with four changes.</p>
    152
    153 <ol>
    154 <li> We will remove record 1, <em>quentin</em></li>
    155 <li> We will change <em>date_updated</em> for record 2, <em>julia</em></li>
    156 <li> We will change <em>purchases</em> and <em>date_updated</em> for record 4, <em>alice</em></li>
    157 <li> We will add a new row, record 8, <em>amanda</em></li>
    158 </ol>
    159
    160 <p><br>
    161 <script src="https://gist.github.com/eyala/699942d65471f3c305b0dcda09944a95.js"></script>
    162 <br></p>
    163
    164 <p>We’ll run the following Pig script, using DataFu’s <em>diff_macro</em>:</p>
    165 <pre class="highlight pig"><code><span class="k">REGISTER</span> <span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">5</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span>
    166
    167 <span class="k">IMPORT</span> <span class="s1">'datafu/diff_macros.pig'</span><span class="p">;</span>
    168
    169 <span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'dedup_out.csv'</span> <span class="k">USING</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">id</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="n">chararray</span><span class="p">,</span> <span class="n">purchases</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">date_updated</span><span class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
    170
    171 <span class="n">changed</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'dedup_out_changed.csv'</span> <span class="k">USING</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">id</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="n">chararray</span><span class="p">,</span> <span class="n">purchases</span><span class="p">:</span> <span class="n">int</span><span class="p">,</span> <span class="n">date_updated</span><span class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
    172
    173 <span class="n">diffs</span> <span class="o">=</span> <span class="n">diff_macro</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="n">changed</span><span class="p">,</span><span class="n">id</span><span class="p">,</span><span class="s1">''</span><span class="p">);</span>
    174
    175 <span class="k">DUMP</span> <span class="n">diffs</span><span class="p">;</span>
    176 </code></pre>
    177
    178 <p>The results look like this:</p>
    179
    180 <p><br>
    181 <script src="https://gist.github.com/eyala/3d36775faf081daad37a102f25add2a4.js"></script>
    182 <br></p>
    183
    184 <p>Let’s take a moment to look at these results. They have the same general structure. Rows that start with <em>missing</em> indicate records that were in the first relation, but aren’t in the new one. Conversely, rows that start with <em>added</em> indicate records that are in the new relation, but not in the old one. Each of these rows is followed by the relevant tuple from the relations.</p>
    185
    186 <p>The rows that start with <em>changed</em> are more interesting. The word <em>changed</em> is followed by a list of the fields which have changed values in the new table. For the row with <em>id</em> 2, this is the <em>date_updated</em> field. For the row with <em>id</em> 4, this is the <em>purchases</em> and <em>date_updated</em> fields.</p>
    187
    188 <p>Obviously, one thing we might want to ignore is the <em>date_updated</em> field. If the only difference in the fields is when it was last updated, we might just want to skip these records for a more concise diff. For this, we need to change the following row in our original Pig script, from this:</p>
    189 <pre class="highlight pig"><code><span class="n">diffs</span> <span class="o">=</span> <span class="n">diff_macro</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="n">changed</span><span class="p">,</span><span class="n">id</span><span class="p">,</span><span class="s1">''</span><span class="p">);</span>
    190 </code></pre>
    191
    192 <p>to become this:</p>
    193 <pre class="highlight pig"><code><span class="n">diffs</span> <span class="o">=</span> <span class="n">diff_macro</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="n">changed</span><span class="p">,</span><span class="n">id</span><span class="p">,</span><span class="s1">'date_updated'</span><span class="p">);</span>
    194 </code></pre>
    195
    196 <p>If we run our changed Pig script, we’ll get the following result.</p>
    197
    198 <p><br>
    199 <script src="https://gist.github.com/eyala/d9b0d5c60ad4d8bbccc79c3527f99aca.js"></script>
    200 <br></p>
    201
    202 <p>The row for <em>julia</em> is missing from our diff, because only <em>date_updated</em> has changed, but the row for <em>alice</em> still appears, because the <em>purchases</em> field has also changed.</p>
    203
    204 <p>There’s one implementation detail that’s important to know — the macro uses a replicated join in order to be able to run quickly on very large tables, so the sample table needs to be able to fit in memory.</p>
    205
    206 <hr>
    207
    208 <p><br></p>
    209
    210 <p><strong>4. Counting distinct records, but only up to a limited amount — the <em>CountDistinctUpTo</em> UDF</strong></p>
    211
    212 <p>Sometimes our analytical logic requires us to filter out accounts that don’t have enough data. For example, we might want to look only at customers with a certain small minimum number of transactions. This is not difficult to do in Pig; you can group by the customer’s id, count the number of distinct transactions, and filter out the customers that don’t have enough.</p>
    213
    214 <p>Let’s use following table as an example:</p>
    215
    216 <p><br>
    217 <script src="https://gist.github.com/eyala/73dc69d0b5f513c53c4dac72c71daf7c.js"></script>
    218 <br></p>
    219
    220 <p>You can use the following “pure” Pig script to get the number of distinct transactions per name:</p>
    221 <pre class="highlight pig"><code><span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'transactions.csv'</span> <span class="k">USING</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="n">chararray</span><span class="p">,</span> <span class="n">transaction_id</span><span class="p">:</span><span class="n">int</span><span class="p">);</span>
    222
    223 <span class="n">grouped</span> <span class="o">=</span> <span class="k">GROUP</span> <span class="n">data</span> <span class="k">BY</span> <span class="n">name</span><span class="p">;</span>
    224
    225 <span class="n">counts</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">grouped</span> <span class="p">{</span>
    226 <span class="n">distincts</span> <span class="o">=</span> <span class="k">DISTINCT</span> <span class="n">data</span><span class="p">.</span><span class="n">transaction_id</span><span class="p">;</span>
    227 <span class="k">GENERATE</span> <span class="k">group</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="n">distincts</span><span class="p">)</span> <span class="k">AS</span> <span class="n">distinct_count</span><span class="p">;</span>
    228 <span class="p">};</span>
    229
    230 <span class="k">DUMP</span> <span class="n">counts</span><span class="p">;</span>
    231 </code></pre>
    232
    233 <p>This will produce the following output:</p>
    234
    235 <p><br>
    236 <script src="https://gist.github.com/eyala/a9cd0ffb99039758f63b9d08c40b1124.js"></script>
    237 <br></p>
    238
    239 <p>Note that Julia has a count of 1, because although she has 2 rows, they have the same transaction id.</p>
    240
    241 <p>However, accounts in PayPal can differ wildly in their scope. For example, a transactions table might have only a few purchases for an individual, but millions for a large company. This is an example of data skew, and the procedure I described above would not work effectively in such cases. This has to do with how Pig translates the nested foreach statement — it will keep all the distinct records in memory while counting.</p>
    242
    243 <p>In order to get the same count with much better performance, you can use the <em>CountDistinctUpTo</em> UDF. Let’s look at the following Pig script, which counts distinct transactions up to 3 and 5:</p>
    244 <pre class="highlight pig"><code><span class="k">REGISTER</span> <span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">5</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span>
    245
    246 <span class="k">DEFINE</span> <span class="n">CountDistinctUpTo3</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">CountDistinctUpTo</span><span class="p">(</span><span class="s1">'3'</span><span class="p">);</span>
    247 <span class="k">DEFINE</span> <span class="n">CountDistinctUpTo5</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">CountDistinctUpTo</span><span class="p">(</span><span class="s1">'5'</span><span class="p">);</span>
    248
    249 <span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'transactions.csv'</span> <span class="k">USING</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="n">chararray</span><span class="p">,</span> <span class="n">transaction_id</span><span class="p">:</span><span class="n">int</span><span class="p">);</span>
    250
    251 <span class="n">grouped</span> <span class="o">=</span> <span class="k">GROUP</span> <span class="n">data</span> <span class="k">BY</span> <span class="n">name</span><span class="p">;</span>
    252
    253 <span class="n">counts</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">grouped</span> <span class="k">GENERATE</span> <span class="k">group</span><span class="p">,</span><span class="n">CountDistinctUpTo3</span><span class="p">(</span><span class="n">$1</span><span class="p">)</span> <span class="k">as</span> <span class="n">cnt3</span><span class="p">,</span> <span class="n">CountDistinctUpTo5</span><span class="p">(</span><span class="n">$1</span><span class="p">)</span> <span class="k">AS</span> <span class="n">cnt5</span><span class="p">;</span>
    254
    255 <span class="k">DUMP</span> <span class="n">counts</span><span class="p">;</span>
    256 </code></pre>
    257
    258 <p>This results in the following output:</p>
    259
    260 <p><br>
    261 <script src="https://gist.github.com/eyala/19e22fb251fe2222b3ccea6f78e37a85.js"></script>
    262 <br></p>
    263
    264 <p>Notice that when we ask <em>CountDistinctUpTo</em> to stop at 3, <em>quentin</em> gets a count of 3, even though he has 4 transactions. When we use 5 as a parameter to <em>CountDistinctUpTo</em>, he gets the actual count of 4.</p>
    265
    266 <p>In our example, there’s no real reason to use the <em>CountDistinctUpTo</em> UDF. But in our “real” use case, stopping the count at a small number instead of counting millions saves resources and time. The improvement is because the UDF doesn’t need to keep all the records in memory in order to return the desired result.</p>
    267
    268 <hr>
    269
    270 <p><br></p>
    271
    272 <p>I hope that I’ve managed to explain how to use our new contributions to DataFu. You can find all of the files used in this post by clicking the GitHub gists.</p>
    273
    274 <hr>
    275
    276 <p>A version of this post has appeared in the <a href="https://medium.com/paypal-engineering/a-guide-to-paypals-contributions-to-apache-datafu-b30cc25e0312">PayPal Engineering Blog.</a></p>
    277
    278
    279 </article>
    280 </div>
    281
    282
    283
    284 <div class="footer">
    285
    286 <div class="feather">
    287 <a href="http://www.apache.org/" target="_blank"><img src="/images/feather.png" alt="Apache Feather" title="Apache Feather"/></a>
    288 </div>
    289
    290 <div class="copyright">
    291 Copyright &copy; 2011-2019 The Apache Software Foundation, Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br>
    292 Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather logo are either registered trademarks or trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a> in the United States and other countries.
    293 </div>
    294 </div>
    295
    296 </div>
    297
    298 </body>
    299 </html>