Welcome to Parallel Graph AnalytiX (PGX)

Graph analysis lets you reveal latent information that is encoded, not as fields in your data, but as direct and indirect relationships - metadata - between elements of your data - information that is not obvious to the naked eye, but can have tremendous value once uncovered.

What is PGX?

PGX is a toolkit for graph analysis - both running algorithms such as PageRank against graphs, and performing SQL-like pattern-matching against graphs, using the results of algorithmic analysis.  Algorithms are parallelized for extreme performance. The PGX toolkit includes both a single-node in-memory engine, and a distributed engine for extremely large graphs. Graphs can be loaded from a variety of sources including flat files, SQL and NoSQL databases and Apache Spark and Hadoop; incremental updates are supported.

The tools included as part of the PGX distribution include:

  • The PGX shell - an interactive REPL that uses the Groovy programming language for interactive analysis (there are also APIs for a variety of languages)
  • A very large collection of built-in algorithms which are part of the Analyst API - covering such domains as community detection, ranking, partitioning, recommendation generation and more
  • The Green-Marl domain-specific-language for writing graph analysis algorithms in a simple and readable form, which the runtime can transparently parallelize - the PGX runtime can compile and run Green-Marl programs
  • PGQL - Property Graph Query Language - an SQL-like language for graph pattern-matching, which includes both SQL-like value-based constraints and topological constraints
  • Apache Zeppelin interpreter - as an alternative to using the PGX shell directly, PGX can be embedded in Apache Zeppelin and analysis can be done collaboratively using online notebooks

The typical usage pattern in PGX is to

  • Start the shell or create a new notebook
  • Load a graph from some data source
  • Run one or more algorithms against the data
  • Query the graph using PGQL and referencing properties added by the algorithms previously run

In addition, there are features for filtering graphs, extracting subgraphs and much more, and graphs can be saved for later use.

What's new in the PGX Runtime

In our latest PGX version, we have added awesome features like Apache Spark support, the ability to export compiled Green-Marl programs as Java JAR files and more. Check out our what's new page for the latest features.

What can I do with PGX?

  • Load graphs from a variety of sources such as relational databases, NoSQL databases, Apache Spark / Hadoop, and flat files

  • Applying graph pattern matching: PGX includes an SQL-like query language for pattern-matching subgraphs based on their connections, properties or both. Matched subgraphs can have further analytics run against them.

  • Running parallel, high-performance graph algorithms: PGX provides built-in implementations of many popular graph algorithms. The user can easily apply these algorithms on their graph data sets by simply invoking the appropriate methods.

  • Running custom graph algorithms: PGX is also able to execute custom (i.e. user-provided) graph algorithms. Users can write up their own graph algorithms with the Green-Marl DSL and compile and run them using PGX.
  • Mutating Graphs: Complicated graph analyses often consist of multiple steps, where some of the steps require graph mutating operations. For example, one may want to create an undirected version of the graph, to renumber the nodes in the graph, or remove repeated edges between nodes. PGX provides fast, parallel built-in implementations of such operations.
  • Browsing and exporting results: Once the analysis is finished, the users can browse the results of their analysis and export them into the file system.

What are the key benefits of PGX?

  • Fast, parallel, in-memory execution: PGX is a fast, parallel, in-memory graph analytic framework. PGX adopts light-weight in-memory data structures which allow fast execution of graph algorithms. Moreover, PGX exploits multiple CPUs of modern computer systems by running parallelized graph algorithms. Note that not only the built-in algorithms are parallelized, but also custom graph algorithms are automatically parallelized with the help of a DSL compiler.

  • Rich built-in algorithms: PGX provides built-in implementations of many popular graph algorithms including computing various centrality measures, finding shortest paths, finding/evaluating clusters and components, and predicting future edges, etc. (Note: The OTN public release contains only a small subset of these algorithms. See the documentation and contact us if you want to remove this limitation.)

  • Easy implementation and efficient execution of custom algorithms: PGX adopts the Green-Marl DSL for the sake of both ease of implementation of custom algorithms and their efficient execution. The users can program their own graph algorithms intuitively by using the high-level graph-specific data type and operators in Green-Marl. PGX can execute the given Green-Marl program efficiently by parallelizing the given Green-Marl program and mapping it into the PGX-internal API.

  • Interactive Shell: PGX provides a shell application with which the user can exercise the PGX features in an interactive manner. That is, the user can simply start the shell and type commands from the shell command line, instead of creating a whole Java application for his/her analysis.

  • Deploy as a webservice: PGX ships with a web application which can be deployed in a container like Weblogic, Jetty or Tomcat. This allows you to use your interactive shell and other APIs on a remote instance. You can deploy PGX on a server-class machine and have multiple clients share access to the resources of that machine.

  • Hadoop support: You can use PGX to analyze graphs on a Hadoop cluster. You can run PGX as a Yarn application and connect to it from the interactive shell or other APIs. PGX also supports loading and storing graphs from HDFS.


How can I use PGX? What does the PGX API look like?

PGX can be used in several ways:

  1. In a Java (or Scala or Groovy or other JVM language) application: The entire runtime, PGX or the PGX client (talking to a remote PGX server) can be used as a library embedded in a Java application.

  2. Interactively from the shell: The user can also make use of PGX, as if it is a separate application, by using the PGX shell. Once the user starts up the PGX shell, he/she can load graphs, invoke algorithms, and browse/export results in a very simple manner using the shell.

  3. In an Apache Zeppelin notebook: A zeppelin interpreter is available for download, which embeds the PGX shell in Zeppelin (which can talk to an embedded or remote PGX server instance). Analyses can be collaboratively, interactively developed in a web browser and formatted as reports.

  4. Remote usage: For both use cases above, you can either use PGX locally or remotely. In the remote case you need to start PGX on a webserver and provide the client with a hostname and port to connect to. If you use PGX locally, it will simply spin up a local PGX instance on which you can work without any HTTP overhead.

See the tutorials for more information on how to use PGX.

What is the license of PGX?

This version of PGX is released under the OTN license. Please see the documentation for more details about the OTN release and its limitations.

How can I install PGX in my system?

Please see the installation documentation, which explains how to install PGX.