svn commit: r304759 - in vendor/llvm/dist: docs include/llvm/Transforms/Scalar lib/Analysis lib/Target/AArch64 lib/Target/PowerPC lib/Transforms/Scalar lib/Transforms/Utils lib/Transforms/Vectorize...

Wed Aug 24 17:35:40 UTC 2016

Author: dim
Date: Wed Aug 24 17:35:37 2016
New Revision: 304759
URL: https://svnweb.freebsd.org/changeset/base/304759

Log:
  Vendor import of llvm release_39 branch r279477:
  https://llvm.org/svn/llvm-project/llvm/branches/release_39@279477

Added:
  vendor/llvm/dist/test/CodeGen/AArch64/ldst-paired-aliasing.ll
  vendor/llvm/dist/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
Deleted:
  vendor/llvm/dist/test/Transforms/Reassociate/prev_insts_canonicalized.ll
Modified:
  vendor/llvm/dist/docs/ReleaseNotes.rst
  vendor/llvm/dist/include/llvm/Transforms/Scalar/Reassociate.h
  vendor/llvm/dist/lib/Analysis/ScalarEvolution.cpp
  vendor/llvm/dist/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
  vendor/llvm/dist/lib/Target/PowerPC/PPCISelLowering.cpp
  vendor/llvm/dist/lib/Transforms/Scalar/Reassociate.cpp
  vendor/llvm/dist/lib/Transforms/Utils/CloneFunction.cpp
  vendor/llvm/dist/lib/Transforms/Vectorize/SLPVectorizer.cpp
  vendor/llvm/dist/test/Analysis/ScalarEvolution/flags-from-poison.ll
  vendor/llvm/dist/test/CodeGen/AArch64/ldst-opt.ll
  vendor/llvm/dist/test/CodeGen/PowerPC/ppc64-sibcall.ll
  vendor/llvm/dist/test/Transforms/Inline/inline_constprop.ll
  vendor/llvm/dist/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll
  vendor/llvm/dist/test/Transforms/Reassociate/xor_reassoc.ll

Modified: vendor/llvm/dist/docs/ReleaseNotes.rst
==============================================================================

--- vendor/llvm/dist/docs/ReleaseNotes.rst	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/docs/ReleaseNotes.rst	Wed Aug 24 17:35:37 2016	(r304759)
@@ -5,12 +5,6 @@ LLVM 3.9 Release Notes
 .. contents::
     :local:
 
-.. warning::
-   These are in-progress notes for the upcoming LLVM 3.9 release.  You may
-   prefer the `LLVM 3.8 Release Notes <http://llvm.org/releases/3.8.0/docs
-   /ReleaseNotes.html>`_.
-
-
 Introduction
 ============
 
@@ -26,11 +20,6 @@ have questions or comments, the `LLVM De
 <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ is a good place to send
 them.
 
-Note that if you are reading this file from a Subversion checkout or the main
-LLVM web page, this document applies to the *next* release, not the current
-one.  To see the release notes for a specific release, please see the `releases
-page <http://llvm.org/releases/>`_.
-
 Non-comprehensive list of changes in this release
 =================================================
 * The LLVMContext gains a new runtime check (see
@@ -45,10 +34,10 @@ Non-comprehensive list of changes in thi
   please see the documentation on :doc:`CMake`. For information about the CMake
   language there is also a :doc:`CMakePrimer` document available.
 
-* .. note about C API functions LLVMParseBitcode,
-   LLVMParseBitcodeInContext, LLVMGetBitcodeModuleInContext and
-   LLVMGetBitcodeModule having been removed. LLVMGetTargetMachineData has been
-   removed (use LLVMGetDataLayout instead).
+* C API functions LLVMParseBitcode,
+  LLVMParseBitcodeInContext, LLVMGetBitcodeModuleInContext and
+  LLVMGetBitcodeModule having been removed. LLVMGetTargetMachineData has been
+  removed (use LLVMGetDataLayout instead).
 
 * The C API function LLVMLinkModules has been removed.
 
@@ -68,52 +57,35 @@ Non-comprehensive list of changes in thi
   iterator to the next instruction instead of ``void``. Targets that previously
   did ``MBB.erase(I); return;`` now probably want ``return MBB.erase(I);``.
 
-* ``SelectionDAGISel::Select`` now returns ``void``. Out of tree targets will
+* ``SelectionDAGISel::Select`` now returns ``void``. Out-of-tree targets will
   need to be updated to replace the argument node and remove any dead nodes in
   cases where they currently return an ``SDNode *`` from this interface.
 
-* Raised the minimum required CMake version to 3.4.3.
-
 * Added the MemorySSA analysis, which hopes to replace MemoryDependenceAnalysis.
   It should provide higher-quality results than MemDep, and be algorithmically
   faster than MemDep. Currently, GVNHoist (which is off by default) makes use of
   MemorySSA.
 
-.. NOTE
-   For small 1-3 sentence descriptions, just add an entry at the end of
-   this list. If your description won't fit comfortably in one bullet
-   point (e.g. maybe you would like to give an example of the
-   functionality, or simply have a lot to talk about), see the `NOTE` below
-   for adding a new subsection.
-
-* ... next change ...
-
-.. NOTE
-   If you would like to document a larger change, then you can add a
-   subsection about it right here. You can copy the following boilerplate
-   and un-indent it (the indentation causes it to be inside this comment).
-
-   Special New Feature
-   -------------------
-
-   Makes programs 10x faster by doing Special New Thing.
+* The minimum density for lowering switches with jump tables has been reduced
+  from 40% to 10% for functions which are not marked ``optsize`` (that is,
+  compiled with ``-Os``).
 
 GCC ABI Tag
 -----------
 
-Recently, many of the Linux distributions (ex. `Fedora <http://developerblog.redhat.com/2015/02/10/gcc-5-in-fedora/>`_,
+Recently, many of the Linux distributions (e.g. `Fedora <http://developerblog.redhat.com/2015/02/10/gcc-5-in-fedora/>`_,
 `Debian <https://wiki.debian.org/GCC5>`_, `Ubuntu <https://wiki.ubuntu.com/GCC5>`_)
 have moved on to use the new `GCC ABI <https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Attributes.html>`_
 to work around `C++11 incompatibilities in libstdc++ <https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html>`_.
 This caused `incompatibility problems <https://gcc.gnu.org/ml/gcc-patches/2015-04/msg00153.html>`_
-with other compilers (ex. Clang), which needed to be fixed, but due to the
+with other compilers (e.g. Clang), which needed to be fixed, but due to the
 experimental nature of GCC's own implementation, it took a long time for it to
-land in LLVM (`here <https://reviews.llvm.org/D18035>`_ and
-`here <https://reviews.llvm.org/D17567>`_), not in time for the 3.8 release.
+land in LLVM (`D18035 <https://reviews.llvm.org/D18035>`_ and
+`D17567 <https://reviews.llvm.org/D17567>`_), not in time for the 3.8 release.
 
-Those patches are now present in the 3.9.0 release and should be working on the
+Those patches are now present in the 3.9.0 release and should be working in the
 majority of cases, as they have been tested thoroughly. However, some bugs were
-`filled in GCC <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71712>`_ and have not
+`filed in GCC <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71712>`_ and have not
 yet been fixed, so there may be corner cases not covered by either GCC or Clang.
 Bug fixes to those problems should be reported in Bugzilla (either LLVM or GCC),
 and patches to LLVM's trunk are very likely to be back-ported to future 3.9.x
@@ -131,6 +103,10 @@ Changes to the LLVM IR
   ``llvm.masked.gather`` and ``llvm.masked.scatter`` were introduced to the
   LLVM IR to allow selective memory access for vector data types.
 
+* The new ``notail`` attribute prevents optimization passes from adding ``tail``
+  or ``musttail`` markers to a call. It is used to prevent tail call
+  optimization from being performed on the call.
+
 Changes to LLVM's IPO model
 ---------------------------
 
@@ -145,7 +121,7 @@ Support for ThinLTO
 -------------------
 
 LLVM now supports ThinLTO compilation, which can be invoked by compiling
-and linking with -flto=thin. The gold linker plugin, as well as linkers
+and linking with ``-flto=thin``. The gold linker plugin, as well as linkers
 that use the new ThinLTO API in libLTO (like ld64), will transparently
 execute the ThinLTO backends in parallel threads.
 For more information on ThinLTO and the LLVM implementation, see the
@@ -238,7 +214,7 @@ fixes:**
 Changes to the PowerPC Target
 -----------------------------
 
- Moved some optimizations from O3 to O2 (D18562)
+* Moved some optimizations from O3 to O2 (D18562)
 
 * Enable sibling call optimization on ppc64 ELFv1/ELFv2 abi
 
@@ -266,18 +242,6 @@ Changes to the AMDGPU Target
  * Mesa 11.0.x is no longer supported
 
 
-Changes to the OCaml bindings
------------------------------
-
- During this release ...
-
-Support for attribute 'notail' has been added
----------------------------------------------
-
-This marker prevents optimization passes from adding 'tail' or
-'musttail' markers to a call. It is used to prevent tail call
-optimization from being performed on the call.
-
 External Open Source Projects Using LLVM 3.9
 ============================================
 
@@ -285,8 +249,6 @@ An exciting aspect of LLVM is that it is
 a lot of other language and tools projects. This section lists some of the
 projects that have already been updated to work with LLVM 3.9.
 
-* A project
-
 LDC - the LLVM-based D compiler
 -------------------------------
 

Modified: vendor/llvm/dist/include/llvm/Transforms/Scalar/Reassociate.h
==============================================================================
--- vendor/llvm/dist/include/llvm/Transforms/Scalar/Reassociate.h	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/include/llvm/Transforms/Scalar/Reassociate.h	Wed Aug 24 17:35:37 2016	(r304759)
@@ -65,7 +65,7 @@ public:
   PreservedAnalyses run(Function &F, FunctionAnalysisManager &);
 
 private:
-  void BuildRankMap(Function &F, ReversePostOrderTraversal<Function *> &RPOT);
+  void BuildRankMap(Function &F);
   unsigned getRank(Value *V);
   void canonicalizeOperands(Instruction *I);
   void ReassociateExpression(BinaryOperator *I);

Modified: vendor/llvm/dist/lib/Analysis/ScalarEvolution.cpp
==============================================================================
--- vendor/llvm/dist/lib/Analysis/ScalarEvolution.cpp	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/lib/Analysis/ScalarEvolution.cpp	Wed Aug 24 17:35:37 2016	(r304759)
@@ -4822,6 +4822,10 @@ bool ScalarEvolution::isSCEVExprNeverPoi
   // from different loops, so that we know which loop to prove that I is
   // executed in.
   for (unsigned OpIndex = 0; OpIndex < I->getNumOperands(); ++OpIndex) {
+    // I could be an extractvalue from a call to an overflow intrinsic.
+    // TODO: We can do better here in some cases.
+    if (!isSCEVable(I->getOperand(OpIndex)->getType()))
+      return false;
     const SCEV *Op = getSCEV(I->getOperand(OpIndex));
     if (auto *AddRec = dyn_cast<SCEVAddRecExpr>(Op)) {
       bool AllOtherOpsLoopInvariant = true;

Modified: vendor/llvm/dist/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
==============================================================================
--- vendor/llvm/dist/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp	Wed Aug 24 17:35:37 2016	(r304759)
@@ -1258,8 +1258,11 @@ AArch64LoadStoreOpt::findMatchingInsn(Ma
         if (MIIsUnscaled) {
           // If the unscaled offset isn't a multiple of the MemSize, we can't
           // pair the operations together: bail and keep looking.
-          if (MIOffset % MemSize)
+          if (MIOffset % MemSize) {
+            trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);
+            MemInsns.push_back(&MI);
             continue;
+          }
           MIOffset /= MemSize;
         } else {
           MIOffset *= MemSize;
@@ -1424,9 +1427,6 @@ bool AArch64LoadStoreOpt::isMatchingUpda
   default:
     break;
   case AArch64::SUBXri:
-    // Negate the offset for a SUB instruction.
-    Offset *= -1;
-  // FALLTHROUGH
   case AArch64::ADDXri:
     // Make sure it's a vanilla immediate operand, not a relocation or
     // anything else we can't handle.
@@ -1444,6 +1444,9 @@ bool AArch64LoadStoreOpt::isMatchingUpda
 
     bool IsPairedInsn = isPairedLdSt(MemMI);
     int UpdateOffset = MI.getOperand(2).getImm();
+    if (MI.getOpcode() == AArch64::SUBXri)
+      UpdateOffset = -UpdateOffset;
+
     // For non-paired load/store instructions, the immediate must fit in a
     // signed 9-bit integer.
     if (!IsPairedInsn && (UpdateOffset > 255 || UpdateOffset < -256))
@@ -1458,13 +1461,13 @@ bool AArch64LoadStoreOpt::isMatchingUpda
         break;
 
       int ScaledOffset = UpdateOffset / Scale;
-      if (ScaledOffset > 64 || ScaledOffset < -64)
+      if (ScaledOffset > 63 || ScaledOffset < -64)
         break;
     }
 
     // If we have a non-zero Offset, we check that it matches the amount
     // we're adding to the register.
-    if (!Offset || Offset == MI.getOperand(2).getImm())
+    if (!Offset || Offset == UpdateOffset)
       return true;
     break;
   }

Modified: vendor/llvm/dist/lib/Target/PowerPC/PPCISelLowering.cpp
==============================================================================
--- vendor/llvm/dist/lib/Target/PowerPC/PPCISelLowering.cpp	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/lib/Target/PowerPC/PPCISelLowering.cpp	Wed Aug 24 17:35:37 2016	(r304759)
@@ -4033,11 +4033,18 @@ PPCTargetLowering::IsEligibleForTailCall
   if (CalleeCC != CallingConv::Fast && CalleeCC != CallingConv::C)
     return false;
 
-  // Functions containing by val parameters are not supported.
+  // Caller contains any byval parameter is not supported.
   if (std::any_of(Ins.begin(), Ins.end(),
                   [](const ISD::InputArg& IA) { return IA.Flags.isByVal(); }))
     return false;
 
+  // Callee contains any byval parameter is not supported, too.
+  // Note: This is a quick work around, because in some cases, e.g.
+  // caller's stack size > callee's stack size, we are still able to apply
+  // sibling call optimization. See: https://reviews.llvm.org/D23441#513574
+  if (any_of(Outs, [](const ISD::OutputArg& OA) { return OA.Flags.isByVal(); }))
+    return false;
+
   // No TCO/SCO on indirect call because Caller have to restore its TOC
   if (!isFunctionGlobalAddress(Callee) &&
       !isa<ExternalSymbolSDNode>(Callee))

Modified: vendor/llvm/dist/lib/Transforms/Scalar/Reassociate.cpp
==============================================================================
--- vendor/llvm/dist/lib/Transforms/Scalar/Reassociate.cpp	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/lib/Transforms/Scalar/Reassociate.cpp	Wed Aug 24 17:35:37 2016	(r304759)
@@ -145,8 +145,7 @@ static BinaryOperator *isReassociableOp(
   return nullptr;
 }
 
-void ReassociatePass::BuildRankMap(
-    Function &F, ReversePostOrderTraversal<Function *> &RPOT) {
+void ReassociatePass::BuildRankMap(Function &F) {
   unsigned i = 2;
 
   // Assign distinct ranks to function arguments.
@@ -155,6 +154,7 @@ void ReassociatePass::BuildRankMap(
     DEBUG(dbgs() << "Calculated Rank[" << I->getName() << "] = " << i << "\n");
   }
 
+  ReversePostOrderTraversal<Function *> RPOT(&F);
   for (BasicBlock *BB : RPOT) {
     unsigned BBRank = RankMap[BB] = ++i << 16;
 
@@ -2172,28 +2172,13 @@ void ReassociatePass::ReassociateExpress
 }
 
 PreservedAnalyses ReassociatePass::run(Function &F, FunctionAnalysisManager &) {
-  // Reassociate needs for each instruction to have its operands already
-  // processed, so we first perform a RPOT of the basic blocks so that
-  // when we process a basic block, all its dominators have been processed
-  // before.
-  ReversePostOrderTraversal<Function *> RPOT(&F);
-  BuildRankMap(F, RPOT);
+  // Calculate the rank map for F.
+  BuildRankMap(F);
 
   MadeChange = false;
-  for (BasicBlock *BI : RPOT) {
-    // Use a worklist to keep track of which instructions have been processed
-    // (and which insts won't be optimized again) so when redoing insts,
-    // optimize insts rightaway which won't be processed later.
-    SmallSet<Instruction *, 8> Worklist;
-
-    // Insert all instructions in the BB
-    for (Instruction &I : *BI)
-      Worklist.insert(&I);
-
+  for (Function::iterator BI = F.begin(), BE = F.end(); BI != BE; ++BI) {
     // Optimize every instruction in the basic block.
-    for (BasicBlock::iterator II = BI->begin(), IE = BI->end(); II != IE;) {
-      // This instruction has been processed.
-      Worklist.erase(&*II);
+    for (BasicBlock::iterator II = BI->begin(), IE = BI->end(); II != IE;)
       if (isInstructionTriviallyDead(&*II)) {
         EraseInst(&*II++);
       } else {
@@ -2202,22 +2187,27 @@ PreservedAnalyses ReassociatePass::run(F
         ++II;
       }
 
-      // If the above optimizations produced new instructions to optimize or
-      // made modifications which need to be redone, do them now if they won't
-      // be handled later.
-      while (!RedoInsts.empty()) {
-        Instruction *I = RedoInsts.pop_back_val();
-        // Process instructions that won't be processed later, either
-        // inside the block itself or in another basic block (based on rank),
-        // since these will be processed later.
-        if ((I->getParent() != BI || !Worklist.count(I)) &&
-            RankMap[I->getParent()] <= RankMap[BI]) {
-          if (isInstructionTriviallyDead(I))
-            EraseInst(I);
-          else
-            OptimizeInst(I);
-        }
-      }
+    // Make a copy of all the instructions to be redone so we can remove dead
+    // instructions.
+    SetVector<AssertingVH<Instruction>> ToRedo(RedoInsts);
+    // Iterate over all instructions to be reevaluated and remove trivially dead
+    // instructions. If any operand of the trivially dead instruction becomes
+    // dead mark it for deletion as well. Continue this process until all
+    // trivially dead instructions have been removed.
+    while (!ToRedo.empty()) {
+      Instruction *I = ToRedo.pop_back_val();
+      if (isInstructionTriviallyDead(I))
+        RecursivelyEraseDeadInsts(I, ToRedo);
+    }
+
+    // Now that we have removed dead instructions, we can reoptimize the
+    // remaining instructions.
+    while (!RedoInsts.empty()) {
+      Instruction *I = RedoInsts.pop_back_val();
+      if (isInstructionTriviallyDead(I))
+        EraseInst(I);
+      else
+        OptimizeInst(I);
     }
   }
 

Modified: vendor/llvm/dist/lib/Transforms/Utils/CloneFunction.cpp
==============================================================================
--- vendor/llvm/dist/lib/Transforms/Utils/CloneFunction.cpp	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/lib/Transforms/Utils/CloneFunction.cpp	Wed Aug 24 17:35:37 2016	(r304759)
@@ -566,6 +566,12 @@ void llvm::CloneAndPruneIntoFromInst(Fun
     if (!I)
       continue;
 
+    // Skip over non-intrinsic callsites, we don't want to remove any nodes from
+    // the CGSCC.
+    CallSite CS = CallSite(I);
+    if (CS && CS.getCalledFunction() && !CS.getCalledFunction()->isIntrinsic())
+      continue;
+
     // See if this instruction simplifies.
     Value *SimpleV = SimplifyInstruction(I, DL);
     if (!SimpleV)

Modified: vendor/llvm/dist/lib/Transforms/Vectorize/SLPVectorizer.cpp
==============================================================================
--- vendor/llvm/dist/lib/Transforms/Vectorize/SLPVectorizer.cpp	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/lib/Transforms/Vectorize/SLPVectorizer.cpp	Wed Aug 24 17:35:37 2016	(r304759)
@@ -82,8 +82,13 @@ static cl::opt<int> MinVectorRegSizeOpti
     "slp-min-reg-size", cl::init(128), cl::Hidden,
     cl::desc("Attempt to vectorize for this register size in bits"));
 
-// FIXME: Set this via cl::opt to allow overriding.
-static const unsigned RecursionMaxDepth = 12;
+static cl::opt<unsigned> RecursionMaxDepth(
+    "slp-recursion-max-depth", cl::init(12), cl::Hidden,
+    cl::desc("Limit the recursion depth when building a vectorizable tree"));
+
+static cl::opt<unsigned> MinTreeSize(
+    "slp-min-tree-size", cl::init(3), cl::Hidden,
+    cl::desc("Only vectorize small trees if they are fully vectorizable"));
 
 // Limit the number of alias checks. The limit is chosen so that
 // it has no negative effect on the llvm benchmarks.
@@ -1842,7 +1847,7 @@ int BoUpSLP::getTreeCost() {
         VectorizableTree.size() << ".\n");
 
   // We only vectorize tiny trees if it is fully vectorizable.
-  if (VectorizableTree.size() < 3 && !isFullyVectorizableTinyTree()) {
+  if (VectorizableTree.size() < MinTreeSize && !isFullyVectorizableTinyTree()) {
     if (VectorizableTree.empty()) {
       assert(!ExternalUses.size() && "We should not have any external users");
     }
@@ -2124,11 +2129,61 @@ void BoUpSLP::reorderInputsAccordingToOp
 }
 
 void BoUpSLP::setInsertPointAfterBundle(ArrayRef<Value *> VL) {
-  Instruction *VL0 = cast<Instruction>(VL[0]);
-  BasicBlock::iterator NextInst(VL0);
-  ++NextInst;
-  Builder.SetInsertPoint(VL0->getParent(), NextInst);
-  Builder.SetCurrentDebugLocation(VL0->getDebugLoc());
+
+  // Get the basic block this bundle is in. All instructions in the bundle
+  // should be in this block.
+  auto *Front = cast<Instruction>(VL.front());
+  auto *BB = Front->getParent();
+  assert(all_of(make_range(VL.begin(), VL.end()), [&](Value *V) -> bool {
+    return cast<Instruction>(V)->getParent() == BB;
+  }));
+
+  // The last instruction in the bundle in program order.
+  Instruction *LastInst = nullptr;
+
+  // Find the last instruction. The common case should be that BB has been
+  // scheduled, and the last instruction is VL.back(). So we start with
+  // VL.back() and iterate over schedule data until we reach the end of the
+  // bundle. The end of the bundle is marked by null ScheduleData.
+  if (BlocksSchedules.count(BB)) {
+    auto *Bundle = BlocksSchedules[BB]->getScheduleData(VL.back());
+    if (Bundle && Bundle->isPartOfBundle())
+      for (; Bundle; Bundle = Bundle->NextInBundle)
+        LastInst = Bundle->Inst;
+  }
+
+  // LastInst can still be null at this point if there's either not an entry
+  // for BB in BlocksSchedules or there's no ScheduleData available for
+  // VL.back(). This can be the case if buildTree_rec aborts for various
+  // reasons (e.g., the maximum recursion depth is reached, the maximum region
+  // size is reached, etc.). ScheduleData is initialized in the scheduling
+  // "dry-run".
+  //
+  // If this happens, we can still find the last instruction by brute force. We
+  // iterate forwards from Front (inclusive) until we either see all
+  // instructions in the bundle or reach the end of the block. If Front is the
+  // last instruction in program order, LastInst will be set to Front, and we
+  // will visit all the remaining instructions in the block.
+  //
+  // One of the reasons we exit early from buildTree_rec is to place an upper
+  // bound on compile-time. Thus, taking an additional compile-time hit here is
+  // not ideal. However, this should be exceedingly rare since it requires that
+  // we both exit early from buildTree_rec and that the bundle be out-of-order
+  // (causing us to iterate all the way to the end of the block).
+  if (!LastInst) {
+    SmallPtrSet<Value *, 16> Bundle(VL.begin(), VL.end());
+    for (auto &I : make_range(BasicBlock::iterator(Front), BB->end())) {
+      if (Bundle.erase(&I))
+        LastInst = &I;
+      if (Bundle.empty())
+        break;
+    }
+  }
+
+  // Set the insertion point after the last instruction in the bundle. Set the
+  // debug location to Front.
+  Builder.SetInsertPoint(BB, next(BasicBlock::iterator(LastInst)));
+  Builder.SetCurrentDebugLocation(Front->getDebugLoc());
 }
 
 Value *BoUpSLP::Gather(ArrayRef<Value *> VL, VectorType *Ty) {
@@ -2206,7 +2261,9 @@ Value *BoUpSLP::vectorizeTree(TreeEntry 
 
   if (E->NeedToGather) {
     setInsertPointAfterBundle(E->Scalars);
-    return Gather(E->Scalars, VecTy);
+    auto *V = Gather(E->Scalars, VecTy);
+    E->VectorizedValue = V;
+    return V;
   }
 
   unsigned Opcode = getSameOpcode(E->Scalars);
@@ -2253,7 +2310,10 @@ Value *BoUpSLP::vectorizeTree(TreeEntry 
         E->VectorizedValue = V;
         return V;
       }
-      return Gather(E->Scalars, VecTy);
+      setInsertPointAfterBundle(E->Scalars);
+      auto *V = Gather(E->Scalars, VecTy);
+      E->VectorizedValue = V;
+      return V;
     }
     case Instruction::ExtractValue: {
       if (canReuseExtract(E->Scalars, Instruction::ExtractValue)) {
@@ -2265,7 +2325,10 @@ Value *BoUpSLP::vectorizeTree(TreeEntry 
         E->VectorizedValue = V;
         return propagateMetadata(V, E->Scalars);
       }
-      return Gather(E->Scalars, VecTy);
+      setInsertPointAfterBundle(E->Scalars);
+      auto *V = Gather(E->Scalars, VecTy);
+      E->VectorizedValue = V;
+      return V;
     }
     case Instruction::ZExt:
     case Instruction::SExt:

Modified: vendor/llvm/dist/test/Analysis/ScalarEvolution/flags-from-poison.ll
==============================================================================
--- vendor/llvm/dist/test/Analysis/ScalarEvolution/flags-from-poison.ll	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/test/Analysis/ScalarEvolution/flags-from-poison.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -688,3 +688,52 @@ outer.be:
 exit:
   ret void
 }
+
+
+; PR28932: Don't assert on non-SCEV-able value %2.
+%struct.anon = type { i8* }
+ at a = common global %struct.anon* null, align 8
+ at b = common global i32 0, align 4
+declare { i32, i1 } @llvm.ssub.with.overflow.i32(i32, i32)
+declare void @llvm.trap()
+define i32 @pr28932() {
+entry:
+  %.pre = load %struct.anon*, %struct.anon** @a, align 8
+  %.pre7 = load i32, i32* @b, align 4
+  br label %for.cond
+
+for.cond:                                         ; preds = %cont6, %entry
+  %0 = phi i32 [ %3, %cont6 ], [ %.pre7, %entry ]
+  %1 = phi %struct.anon* [ %.ph, %cont6 ], [ %.pre, %entry ]
+  %tobool = icmp eq %struct.anon* %1, null
+  %2 = tail call { i32, i1 } @llvm.ssub.with.overflow.i32(i32 %0, i32 1)
+  %3 = extractvalue { i32, i1 } %2, 0
+  %4 = extractvalue { i32, i1 } %2, 1
+  %idxprom = sext i32 %3 to i64
+  %5 = getelementptr inbounds %struct.anon, %struct.anon* %1, i64 0, i32 0
+  %6 = load i8*, i8** %5, align 8
+  %7 = getelementptr inbounds i8, i8* %6, i64 %idxprom
+  %8 = load i8, i8* %7, align 1
+  br i1 %tobool, label %if.else, label %if.then
+
+if.then:                                          ; preds = %for.cond
+  br i1 %4, label %trap, label %cont6
+
+trap:                                             ; preds = %if.else, %if.then
+  tail call void @llvm.trap()
+  unreachable
+
+if.else:                                          ; preds = %for.cond
+  br i1 %4, label %trap, label %cont1
+
+cont1:                                            ; preds = %if.else
+  %conv5 = sext i8 %8 to i64
+  %9 = inttoptr i64 %conv5 to %struct.anon*
+  store %struct.anon* %9, %struct.anon** @a, align 8
+  br label %cont6
+
+cont6:                                            ; preds = %cont1, %if.then
+  %.ph = phi %struct.anon* [ %9, %cont1 ], [ %1, %if.then ]
+  store i32 %3, i32* @b, align 4
+  br label %for.cond
+}

Modified: vendor/llvm/dist/test/CodeGen/AArch64/ldst-opt.ll
==============================================================================
--- vendor/llvm/dist/test/CodeGen/AArch64/ldst-opt.ll	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/test/CodeGen/AArch64/ldst-opt.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -1,4 +1,4 @@
-; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-atomic-cfg-tidy=0 -verify-machineinstrs -o - %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -o - %s | FileCheck %s
 
 ; This file contains tests for the AArch64 load/store optimizer.
 
@@ -1232,3 +1232,104 @@ for.body:
 end:
   ret void
 }
+
+define void @post-indexed-sub-doubleword-offset-min(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-sub-doubleword-offset-min
+; CHECK: ldr x{{[0-9]+}}, [x{{[0-9]+}}], #-256
+; CHECK: str x{{[0-9]+}}, [x{{[0-9]+}}], #-256
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %gep1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  store i64 %load1, i64* %gep2
+  %load2 = load i64, i64* %phi1
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 -32
+  %gep4 = getelementptr i64, i64* %phi1, i64 -32
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}
+
+define void @post-indexed-doubleword-offset-out-of-range(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-doubleword-offset-out-of-range
+; CHECK: ldr x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #256
+; CHECK: str x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #256
+
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %gep1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  store i64 %load1, i64* %gep2
+  %load2 = load i64, i64* %phi1
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 32
+  %gep4 = getelementptr i64, i64* %phi1, i64 32
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}
+
+define void @post-indexed-paired-min-offset(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-paired-min-offset
+; CHECK: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512
+; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %gep1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  %load2 = load i64, i64* %phi1
+  store i64 %load1, i64* %gep2
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 -64
+  %gep4 = getelementptr i64, i64* %phi1, i64 -64
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}
+
+define void @post-indexed-paired-offset-out-of-range(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-paired-offset-out-of-range
+; CHECK: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #512
+; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #512
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %phi1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  %load2 = load i64, i64* %gep1
+  store i64 %load1, i64* %gep2
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 64
+  %gep4 = getelementptr i64, i64* %phi1, i64 64
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}

Added: vendor/llvm/dist/test/CodeGen/AArch64/ldst-paired-aliasing.ll
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ vendor/llvm/dist/test/CodeGen/AArch64/ldst-paired-aliasing.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -0,0 +1,47 @@
+; RUN: llc -mcpu cortex-a53 < %s | FileCheck %s
+target datalayout = "e-m:e-i64:64-i128:128-n8:16:32:64-S128"
+target triple = "aarch64--linux-gnu"
+
+declare void @f(i8*, i8*)
+declare void @f2(i8*, i8*)
+declare void @_Z5setupv()
+declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) #3
+
+define i32 @main() local_unnamed_addr #1 {
+; Make sure the stores happen in the correct order (the exact instructions could change).
+; CHECK-LABEL: main:
+; CHECK: str q0, [sp, #48]
+; CHECK: ldr w8, [sp, #48]
+; CHECK: stur q1, [sp, #72]
+; CHECK: str q0, [sp, #64]
+; CHECK: str w9, [sp, #80]
+
+for.body.lr.ph.i.i.i.i.i.i63:
+  %b1 = alloca [10 x i32], align 16
+  %x0 = bitcast [10 x i32]* %b1 to i8*
+  %b2 = alloca [10 x i32], align 16
+  %x1 = bitcast [10 x i32]* %b2 to i8*
+  tail call void @_Z5setupv()
+  %x2 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 6
+  %x3 = bitcast i32* %x2 to i8*
+  call void @llvm.memset.p0i8.i64(i8* %x3, i8 0, i64 16, i32 8, i1 false)
+  %arraydecay2 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 0
+  %x4 = bitcast [10 x i32]* %b1 to <4 x i32>*
+  store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %x4, align 16
+  %incdec.ptr.i7.i.i.i.i.i.i64.3 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 4
+  %x5 = bitcast i32* %incdec.ptr.i7.i.i.i.i.i.i64.3 to <4 x i32>*
+  store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %x5, align 16
+  %incdec.ptr.i7.i.i.i.i.i.i64.7 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 8
+  store i32 1, i32* %incdec.ptr.i7.i.i.i.i.i.i64.7, align 16
+  %x6 = load i32, i32* %arraydecay2, align 16
+  %cmp6 = icmp eq i32 %x6, 1
+  br i1 %cmp6, label %for.inc, label %if.then
+
+for.inc:
+  call void @f(i8* %x0, i8* %x1)
+  ret i32 0
+
+if.then:
+  call void @f2(i8* %x0, i8* %x1)
+  ret i32 0
+}

Modified: vendor/llvm/dist/test/CodeGen/PowerPC/ppc64-sibcall.ll
==============================================================================
--- vendor/llvm/dist/test/CodeGen/PowerPC/ppc64-sibcall.ll	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/test/CodeGen/PowerPC/ppc64-sibcall.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -189,3 +189,15 @@ define void @w_caller(i8* %ptr) {
 ; CHECK-SCO-LABEL: w_caller:
 ; CHECK-SCO: bl w_callee
 }
+
+%struct.byvalTest = type { [8 x i8] }
+ at byval = common global %struct.byvalTest zeroinitializer
+
+define void @byval_callee(%struct.byvalTest* byval %ptr) { ret void }
+define void @byval_caller() {
+  tail call void @byval_callee(%struct.byvalTest* byval @byval)
+  ret void
+
+; CHECK-SCO-LABEL: bl byval_callee
+; CHECK-SCO: bl byval_callee
+}

Modified: vendor/llvm/dist/test/Transforms/Inline/inline_constprop.ll
==============================================================================
--- vendor/llvm/dist/test/Transforms/Inline/inline_constprop.ll	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/test/Transforms/Inline/inline_constprop.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -299,8 +299,8 @@ entry:
 }
 
 ; CHECK-LABEL: define i32 @PR28802(
-; CHECK: call i32 @PR28802.external(i32 0)
-; CHECK: ret i32 0
+; CHECK: %[[call:.*]] = call i32 @PR28802.external(i32 0)
+; CHECK: ret i32 %[[call]]
 
 define internal i32 @PR28848.callee(i32 %p2, i1 %c) {
 entry:
@@ -322,3 +322,25 @@ entry:
 }
 ; CHECK-LABEL: define i32 @PR28848(
 ; CHECK: ret i32 0
+
+define internal void @callee7(i16 %param1, i16 %param2) {
+entry:
+  br label %bb
+
+bb:
+  %phi = phi i16 [ %param2, %entry ]
+  %add = add i16 %phi, %param1
+  ret void
+}
+
+declare i16 @caller7.external(i16 returned)
+
+define void @caller7() {
+bb1:
+  %call = call i16 @caller7.external(i16 1)
+  call void @callee7(i16 0, i16 %call)
+  ret void
+}
+; CHECK-LABEL: define void @caller7(
+; CHECK: %call = call i16 @caller7.external(i16 1)
+; CHECK-NEXT: ret void

Modified: vendor/llvm/dist/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll
==============================================================================
--- vendor/llvm/dist/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -1,8 +1,8 @@
 ; RUN: opt < %s -reassociate -S | FileCheck %s
 ; CHECK-LABEL: faddsubAssoc1
-; CHECK: [[TMP1:%.*]] = fsub fast half 0xH8000, %a
-; CHECK: [[TMP2:%.*]] = fadd fast half %b, [[TMP1]]
-; CHECK: fmul fast half [[TMP2]], 0xH4500
+; CHECK: [[TMP1:%tmp.*]] = fmul fast half %a, 0xH4500
+; CHECK: [[TMP2:%tmp.*]] = fmul fast half %b, 0xH4500
+; CHECK: fsub fast half [[TMP2]], [[TMP1]]
 ; CHECK: ret
 ; Input is A op (B op C)
 define half @faddsubAssoc1(half %a, half %b) {

Modified: vendor/llvm/dist/test/Transforms/Reassociate/xor_reassoc.ll
==============================================================================
--- vendor/llvm/dist/test/Transforms/Reassociate/xor_reassoc.ll	Wed Aug 24 17:26:11 2016	(r304758)
+++ vendor/llvm/dist/test/Transforms/Reassociate/xor_reassoc.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -88,8 +88,8 @@ define i32 @xor_special2(i32 %x, i32 %y)
   %xor1 = xor i32 %xor, %and
   ret i32 %xor1
 ; CHECK-LABEL: @xor_special2(
-; CHECK: %xor = xor i32 %y, 123
-; CHECK: %xor1 = xor i32 %xor, %x
+; CHECK: %xor = xor i32 %x, 123
+; CHECK: %xor1 = xor i32 %xor, %y
 ; CHECK: ret i32 %xor1
 }
 

Added: vendor/llvm/dist/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ vendor/llvm/dist/test/Transforms/SLPVectorizer/AArch64/gather-root.ll	Wed Aug 24 17:35:37 2016	(r304759)
@@ -0,0 +1,87 @@
+; RUN: opt < %s -slp-vectorizer -S | FileCheck %s --check-prefix=DEFAULT
+; RUN: opt < %s -slp-schedule-budget=0 -slp-min-tree-size=0 -slp-threshold=-30 -slp-vectorizer -S | FileCheck %s --check-prefix=GATHER
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64--linux-gnu"
+
+ at a = common global [80 x i8] zeroinitializer, align 16
+
+; DEFAULT-LABEL: @PR28330(
+; DEFAULT: %tmp17 = phi i32 [ %tmp34, %for.body ], [ 0, %entry ]
+; DEFAULT: %[[S0:.+]] = select <8 x i1> %1, <8 x i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80>
+; DEFAULT: %[[R0:.+]] = shufflevector <8 x i32> %[[S0]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
+; DEFAULT: %[[R1:.+]] = add <8 x i32> %[[S0]], %[[R0]]
+; DEFAULT: %[[R2:.+]] = shufflevector <8 x i32> %[[R1]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; DEFAULT: %[[R3:.+]] = add <8 x i32> %[[R1]], %[[R2]]
+; DEFAULT: %[[R4:.+]] = shufflevector <8 x i32> %[[R3]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; DEFAULT: %[[R5:.+]] = add <8 x i32> %[[R3]], %[[R4]]
+; DEFAULT: %[[R6:.+]] = extractelement <8 x i32> %[[R5]], i32 0
+; DEFAULT: %tmp34 = add i32 %[[R6]], %tmp17
+;
+; GATHER-LABEL: @PR28330(
+; GATHER: %tmp17 = phi i32 [ %tmp34, %for.body ], [ 0, %entry ]
+; GATHER: %tmp19 = select i1 %tmp1, i32 -720, i32 -80
+; GATHER: %tmp21 = select i1 %tmp3, i32 -720, i32 -80
+; GATHER: %tmp23 = select i1 %tmp5, i32 -720, i32 -80
+; GATHER: %tmp25 = select i1 %tmp7, i32 -720, i32 -80
+; GATHER: %tmp27 = select i1 %tmp9, i32 -720, i32 -80
+; GATHER: %tmp29 = select i1 %tmp11, i32 -720, i32 -80
+; GATHER: %tmp31 = select i1 %tmp13, i32 -720, i32 -80
+; GATHER: %tmp33 = select i1 %tmp15, i32 -720, i32 -80
+; GATHER: %[[I0:.+]] = insertelement <8 x i32> undef, i32 %tmp19, i32 0
+; GATHER: %[[I1:.+]] = insertelement <8 x i32> %[[I0]], i32 %tmp21, i32 1
+; GATHER: %[[I2:.+]] = insertelement <8 x i32> %[[I1]], i32 %tmp23, i32 2
+; GATHER: %[[I3:.+]] = insertelement <8 x i32> %[[I2]], i32 %tmp25, i32 3
+; GATHER: %[[I4:.+]] = insertelement <8 x i32> %[[I3]], i32 %tmp27, i32 4
+; GATHER: %[[I5:.+]] = insertelement <8 x i32> %[[I4]], i32 %tmp29, i32 5
+; GATHER: %[[I6:.+]] = insertelement <8 x i32> %[[I5]], i32 %tmp31, i32 6
+; GATHER: %[[I7:.+]] = insertelement <8 x i32> %[[I6]], i32 %tmp33, i32 7
+; GATHER: %[[R0:.+]] = shufflevector <8 x i32> %[[I7]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
+; GATHER: %[[R1:.+]] = add <8 x i32> %[[I7]], %[[R0]]
+; GATHER: %[[R2:.+]] = shufflevector <8 x i32> %[[R1]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; GATHER: %[[R3:.+]] = add <8 x i32> %[[R1]], %[[R2]]
+; GATHER: %[[R4:.+]] = shufflevector <8 x i32> %[[R3]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; GATHER: %[[R5:.+]] = add <8 x i32> %[[R3]], %[[R4]]
+; GATHER: %[[R6:.+]] = extractelement <8 x i32> %[[R5]], i32 0
+; GATHER: %tmp34 = add i32 %[[R6]], %tmp17
+
+define void @PR28330(i32 %n) {
+entry:
+  %tmp0 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
+  %tmp1 = icmp eq i8 %tmp0, 0
+  %tmp2 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 2), align 2
+  %tmp3 = icmp eq i8 %tmp2, 0
+  %tmp4 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1
+  %tmp5 = icmp eq i8 %tmp4, 0
+  %tmp6 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4
+  %tmp7 = icmp eq i8 %tmp6, 0
+  %tmp8 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
+  %tmp9 = icmp eq i8 %tmp8, 0
+  %tmp10 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
+  %tmp11 = icmp eq i8 %tmp10, 0
+  %tmp12 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
+  %tmp13 = icmp eq i8 %tmp12, 0
+  %tmp14 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
+  %tmp15 = icmp eq i8 %tmp14, 0
+  br label %for.body
+
+for.body:
+  %tmp17 = phi i32 [ %tmp34, %for.body ], [ 0, %entry ]
+  %tmp19 = select i1 %tmp1, i32 -720, i32 -80
+  %tmp20 = add i32 %tmp17, %tmp19
+  %tmp21 = select i1 %tmp3, i32 -720, i32 -80
+  %tmp22 = add i32 %tmp20, %tmp21
+  %tmp23 = select i1 %tmp5, i32 -720, i32 -80
+  %tmp24 = add i32 %tmp22, %tmp23
+  %tmp25 = select i1 %tmp7, i32 -720, i32 -80
+  %tmp26 = add i32 %tmp24, %tmp25
+  %tmp27 = select i1 %tmp9, i32 -720, i32 -80
+  %tmp28 = add i32 %tmp26, %tmp27
+  %tmp29 = select i1 %tmp11, i32 -720, i32 -80
+  %tmp30 = add i32 %tmp28, %tmp29
+  %tmp31 = select i1 %tmp13, i32 -720, i32 -80
+  %tmp32 = add i32 %tmp30, %tmp31
+  %tmp33 = select i1 %tmp15, i32 -720, i32 -80
+  %tmp34 = add i32 %tmp32, %tmp33
+  br label %for.body
+}