Collectives Pattern. Parallel Computing CIS 410/510 Department of Computer and Information Science. Lecture 8 Collective Pattern

Similar documents
Collectives Pattern CS 472 Concurrent & Parallel Programming University of Evansville

Digital Integrated CircuitDesign

CHAPTER 1 INTRODUCTION

Decision Mathematics practice paper

Divide & conquer. Which works better for multi-cores: insertion sort or merge sort? Why?

CSS 343 Data Structures, Algorithms, and Discrete Math II. Balanced Search Trees. Yusuf Pisan

Lectures: Feb 27 + Mar 1 + Mar 3, 2017

DATA STRUCTURES USING C

Animation Demos. Shows time complexities on best, worst and average case.

V out. V in VRM. I Load

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

Design of Parallel Algorithms. Communication Algorithms

CS1800: More Counting. Professor Kevin Gold

Chapter 4: Patterns and Relationships

CSc 110, Spring Lecture 40: Sorting Adapted from slides by Marty Stepp and Stuart Reges

Animation Demos. Shows time complexities on best, worst and average case.

GENERALIZATION: RANK ORDER FILTERS

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

MITOCW 7. Counting Sort, Radix Sort, Lower Bounds for Sorting

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1

Bibliography. S. Gill Williamson

SORTING BY REVERSALS. based on chapter 7 of Setubal, Meidanis: Introduction to Computational molecular biology

Tiling Problems. This document supersedes the earlier notes posted about the tiling problem. 1 An Undecidable Problem about Tilings of the Plane

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER

Implementation of Parallel Multiplier-Accumulator using Radix- 2 Modified Booth Algorithm and SPST

Analysis of Workflow Graphs through SESE Decomposition

Lecture #1. Course Overview

A Lower Bound for Comparison Sort

SOME MORE DECREASE AND CONQUER ALGORITHMS

Lecture 12: Divide and Conquer Algorithms. Divide and Conquer Algorithms

Programming Abstractions

Lecture 20: Combinatorial Search (1997) Steven Skiena. skiena

EECS150 - Digital Design Lecture 23 - Arithmetic and Logic Circuits Part 4. Outline

The Theory Behind the z/architecture Sort Assist Instructions

Link State Routing. Brad Karp UCL Computer Science. CS 3035/GZ01 3 rd December 2013

CSE373: Data Structure & Algorithms Lecture 23: More Sorting and Other Classes of Algorithms. Nicki Dell Spring 2014

Generations Automatic Stand-Alone Lace By Bernie Griffith Generations Software

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Introduction to. Algorithms. Lecture 10. Prof. Constantinos Daskalakis CLRS

Part 1. Using LabVIEW to Measure Current

TASK NOP CIJEVI ROBOTI RELJEF. standard output

Skip Lists S 3 S 2 S 1. 2/6/2016 7:04 AM Skip Lists 1

Outline. In One Slide. LR Parsing. LR Parsing. No Stopping The Parsing! Bottom-Up Parsing. LR(1) Parsing Tables #2

Chapter 7: Sorting 7.1. Original

Sorting. APS105: Computer Fundamentals. Jason Anderson

Computing Permutations with Stacks and Deques

Lecture 20 November 13, 2014

Algorithms and Data Structures CS 372. The Sorting Problem. Insertion Sort - Summary. Merge Sort. Input: Output:

Low Power R4SDC Pipelined FFT Processor Architecture

A Survey on Power Reduction Techniques in FIR Filter

The Eliot Bank and Gordonbrock Schools Federation. Calculation Policy. Addition Subtraction Multiplication Division Take away practically

Design and Implementation of High Radix Booth Multiplier using Koggestone Adder and Carry Select Adder

ISSN Vol.03,Issue.02, February-2014, Pages:

CS 758/858: Algorithms

Previous Lecture. How can computation sort data faster for you? Sorting Algorithms: Speed Comparison. Recursive Algorithms 10/31/11

CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS

What is an image? Images and Displays. Representative display technologies. An image is:

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

West Windsor-Plainsboro Regional School District Advanced Topics in Computer Science Grades 9-12

Scheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48

Practical issue: Group definition. TSTE17 System Design, CDIO. Quadrature Amplitude Modulation (QAM) Components of a digital communication system

ECE 242 Data Structures and Algorithms. Simple Sorting II. Lecture 5. Prof.

BMT 2018 Combinatorics Test Solutions March 18, 2018

Mahendra Engineering College, Namakkal, Tamilnadu, India.

COS 226 Algorithms and Data Structures Fall Midterm Exam

COS 226 Algorithms and Data Structures Fall Midterm Exam

Design and Implementation of Wallace Tree Multiplier Using Kogge Stone Adder and Brent Kung Adder

Design and Analyse Low Power Wallace Multiplier Using GDI Technique

FPGA IMPLENTATION OF REVERSIBLE FLOATING POINT MULTIPLIER USING CSA

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Sciences. Homework #9 Solution

Abstract. 1. Introduction. Department of Electronics and Communication Engineering Coimbatore Institute of Engineering and Technology

CSE 237A Winter 2018 Homework 1

MITOCW R3. Document Distance, Insertion and Merge Sort

Mathematics Competition Practice Session 6. Hagerstown Community College: STEM Club November 20, :00 pm - 1:00 pm STC-170

Introduction to ANSYS DesignModeler

The Mathematica Journal A Generator of Rook Polynomials

4 + 3 = 7 10= Starting at the bigger number and counting on

ProCo 2017 Advanced Division Round 1

More on recursion. Fundamentals of Computer Science Keith Vertanen

Implementation of FPGA based Design for Digital Signal Processing

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

Computer Graphics (CS/ECE 545) Lecture 7: Morphology (Part 2) & Regions in Binary Images (Part 1)

Problem Set 4 Due: Wednesday, November 12th, 2014

Freecell Solver - Evolution of a C Program. Shlomi Fish

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

Ron Breukelaar Hendrik Jan Hoogeboom Walter Kosters. ( LIACS algoritmen )

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

Outline of the Lecture

Computer Vision, Lecture 3

Issue 1 June 1987 MERLIN II. COMMUNICATIONS SYTEM System Manual

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Design and Simulation of Convolution Using Booth Encoded Wallace Tree Multiplier

Development of a MATLAB Data Acquisition and Control Toolbox for BASIC Stamp Microcontrollers

What is a Sorting Function?

Improvement of Himawari-8 observation data quality

Rate of Change and Slope by Paul Alves

CS101 Lecture 01: Introduction. What You ll Learn Today

Counting in Algorithms

MITOCW watch?v=3e1zf1l1vhy

Transcription:

Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science

Outline q What are Collectives? q Reduce Pattern q Scan Pattern q Sorting 2

Collectives q Collective operations deal with a collection of data as a whole, rather than as separate elements q Collective patterns include: Reduce Scan Partition Scatter Gather 3

Collectives q Collective operations deal with a collection of data as a whole, rather than as separate elements q Collective patterns include: Reduce Scan Partition Scatter Reduce and Scan will be covered in this lecture Gather 4

Reduce q Reduce is used to combine a collection of elements into one summary value q A combiner function combines elements pairwise q A combiner function only needs to be associative to be parallelizable q Example combiner functions: Addition Multiplication Maximum / Minimum 5

Reduce Serial Reduc4on Parallel Reduc4on 6

Reduce q Vectorization 7

Reduce q Tiling is used to break chunks of work up for workers to reduce serially 8

Reduce Add Example 1 2 5 4 9 7 0 1 9

Reduce Add Example 1 2 5 4 9 7 0 1 3 8 12 21 28 28 29 29 10

Reduce Add Example 1 2 5 4 9 7 0 1 11

Reduce Add Example 1 2 5 4 9 7 0 1 3 9 1 16 12 17 29 29 12

Reduce q We can fuse the map and reduce patterns 13

Reduce q Precision can become a problem with reductions on floating point data q Different orderings of floating point data can change the reduction value 14

Reduce Example: Dot Product q 2 vectors of same length q Map (*) to multiply the components q Then reduce with (+) to get the final answer Also: 15

Dot Product Example Uses q Essential operation in physics, graphics, video games, q Gaming analogy: in Mario Kart, there are boost pads on the ground that increase your speed red vector is your speed (x and y direction) blue vector is the orientation of the boost pad (x and y direction). Larger numbers are more power. How much boost will you get? For the analogy, imagine the pad mul4plies your speed: If you come in going 0, you ll get nothing If you cross the pad perpendicularly, you ll get 0 [just like the banana oblitera4on, it will give you 0x boost in the perpendicular direc4on] Photo source Ref: hrp://bererexplained.com/ar4cles/vector- calculus- understanding- the- dot- product/ 16

Scan q The scan pattern produces partial reductions of input sequence, generates new sequence q Trickier to parallelize than reduce q Inclusive scan vs. exclusive scan Inclusive scan: includes current element in partial reduction Exclusive scan: excludes current element in partial reduction, partial reduction is of all prior elements prior to current element 17

Scan Example Uses q Lexical comparison of strings e.g., determine that strategy should appear before stratification in a dictionary q Add multi-precision numbers (those that cannot be represented in a single machine word) q Evaluate polynomials q Implement radix sort or quicksort q Delete marked elements in an array q Dynamically allocate processors q Lexical analysis parsing programs into tokens q Searching for regular expressions q Labeling components in 2-D images q Some tree algorithms e.g., finding the depth of every vertex in a tree 18

Scan Serial Scan Parallel Scan 19

Scan q One algorithm for parallelizing scan is to perform an up sweep and a down sweep q Reduce the input on the up sweep q The down sweep produces the intermediate results Up sweep compute reduc4on Down sweep compute intermediate values 20

Scan Maximum Example 1 4 0 2 7 2 4 3 1 4 0 2 7 2 4 3 21

Scan Maximum Example 1 4 0 2 7 2 4 3 4 4 1 4 0 2 7 2 4 3 4 4 2 7 4 7 4 7 7 7 7 7 7 4 7 7 1 4 4 4 7 7 7 7 1 4 4 4 7 7 7 7 22

Scan q Three phase scan with tiling 23

Scan 24

Scan q Just like reduce, we can also fuse the map pattern with the scan pattern 25

Scan 26

Merge Sort as a reduction q We can sort an array via a pair of a map and a reduce q Map each element into a vector containing just that element <> is the merge operation: [1,3,5,7] <> [2,6,15] = [1,2,3,5,6,7,15] [] is the empty list q How fast is this? 27

Right Biased Sort Start with [14,3,4,8,7,52,1] Map to [[14],[3],[4],[8],[7],[52],[1]] Reduce: [14] <> ([3] <> ([4] <> ([8] <> ([7] <> ([52] <> [1]))))) = [14] <> ([3] <> ([4] <> ([8] <> ([7] <> [1,52])))) = [14] <> ([3] <> ([4] <> ([8] <> [1,7,52]))) = [14] <> ([3] <> ([4] <> [1,7,8,52])) = [14] <> ([3] <> [1,4,7,8,52]) = [14] <> [1,3,4,7,8,52] = [1,3,4,7,8,14,52] 28

Right Biased Sort Cont q How long did that take? q We did O(n) merges but each one took O(n) time q O(n 2 ) q We wanted merge sort, but instead we got insertion sort! 29

Tree Shape Sort Start with [14,3,4,8,7,52,1] Map to [[14],[3],[4],[8],[7],[52],[1]] Reduce: (([14] <> [3]) <> ([4] <> [8])) <> (([7] <> [52]) <> [1]) = ([3,14] <> [4,8]) <> ([7,52] <> [1]) = [3,4,8,14] <> [1,7,52] = [1,3,4,7,8,14,52] 30

Tree Shaped Sort Performance q Even if we only had a single processor this is better We do O(log n) merges Each one is O(n) So O(n*log(n)) q But opportunity for parallelism is not so great O(n) assuming sequential merge Takeaway: the shape of reduction matters! 31