Neo4j Logo
Neo4j
Thesis Topic Opportunity Spring 2025
🌎Malmö
2w ago

Job Description

About Neo4j:

Neo4j is the leader in Graph Database & Analytics, helping organizations uncover hidden relationships and patterns across billions of data connections deeply, easily and quickly. Customers use Neo4j to gain a deeper understanding and reveal new ways of solving their most pressing problems. Over 75% of Fortune 100 companies use Neo4j, along with a vibrant community of 250,000+ developers, data scientists, and architects across the globe.

At Neo4j, we’re proud to be building the technology that powers breakthrough solutions for our customers, helping them cure diseases, fight fraud, crush pandemics, and accomplish their most ambitious missions—even if it’s getting humans to Mars. Learn more at neo4j.com and follow us @Neo4j.

 
Our Vision: 

At Neo4j, we have always strived to help the world make sense of data.  

As business, society and knowledge become increasingly connected, our technology promotes innovation by helping organizations to find and understand data relationships. We created, drive and lead the graph database category, and we’re disrupting how organizations leverage their data to innovate and stay competitive.

The Role:

Are you at the end of your studies and want to immerse yourself in graph technology? We are now looking for students who want to do their Master’s Thesis alongside us at Neo4j!
As part of Neo4j engineering in Malmö, you will work with a diverse team of talented colleagues worldwide. You will receive advice and continuous support from us - we are experts in graph technology and positioned to help you perform to the best of your ability.

Past Thesis Topics:
Force Directed Drawing Algorithms and Parameter Optimisation:
Through my thesis I have implemented and compared some different graph drawing algorithms in addition to some methods to speed up the slow parts of these algorithms. These algorithms were then used to test what to the best of my knowledge is a novel approach to select parameter values for graph drawing algorithms. For this, I use methods similar to those used in Machine Learning to select parameter values and measure the utility of any set of parameters by creating my own utility function. I created this function by looking at objective measures of drawing quality that are commonly known, such as the number of edge crossings, along with the time it took to draw a given graph. The resulting method for parameter optimisation could find significant increases in the speed of graph drawing for several of my implemented drawing algorithms without compromising drawing quality. Furthermore, the approach is not specific to any parameter set, and can with some modification be applied to any graph drawing algorithm dependent on some constants.

Modeling Profiling Data in a Graph Database for Performance Analysis:
Benchmarking is an important part of the development process for any mission-critical application. By inspecting profiling data, developers can identify bottlenecks and performance regressions before they reach the customers. 
Neo4j runs an extensive benchmarking suite on its database, resulting in a huge collection of profiling data collected each week. These profiles are commonly visualized individually as flame graphs which are inspected manually. Finding patterns and differences among multiple profiles is difficult to do manually, due to the size and complexity of the data. We propose a framework for identifying bottlenecks and regressions by modeling the profiling data as call-stack trees in a graph database. We demonstrate the usefulness of the framework for cross-profile analysis such as time series analysis and aggregation-based methods. We conclude that there is much potential in this approach and our thesis can be used as a decision basis for organizations wanting to implement a similar framework.
Using a graph database to model profiling data has many advantages and is suitable for the tree-like structure of the data. It makes the data more accessible and facilitates flexible querying in which the user can ask questions about the data and perform non-trivial aggregation. It has already aided Neo4j in the process of pinpointing the cause of some performance issues. The main disadvantage is the complexity involved in importing large quantities of data.

Navigating Failures in Distributed Systems: A Comparative Study of Failure Detection Algorithms:
Failure detection algorithms are used to identify unhealthy nodes in distributed systems. The goal of this study was to improve Neo4j’s use of failure detection algorithms by exploring two paths: either optimising their existing Lighthouse algorithm or by implementing a new algorithm. Existing algorithms were surveyed and the SWIM algorithm was implemented. A baseline was established and evaluated against parameter-optimized versions of SWIM and Lighthouse in a simulated network. The results show that Baseline is scalable and reliable but slow, Lighthouse is fast but less accurate, and SWIM is moderately fast and the least accurate but generates the least network load. In conclusion, the chosen parameters of a failure detector are to a great extent more important than the algorithm itself. Furthermore, to successfully optimise parameters it is crucial to have a scalable simulator and precise system requirements to manage the trade-off between speed, accuracy, and network load.

Cache replacement policies and their impact on graph database operations:
In this master thesis project, the page caching strategy of the Neo4j database is researched and attempted to be improved. Focusing on the eviction protocol of the page cache, several different algorithms are evaluated in both experimental prototyping using Python, and in the Neo4j database kernel. Using the measurements of the prototypes and the results of the Neo4j benchmarks conclude that the current page replacement policy is hard to beat with a different strategy. However, modifying the current page replacement policy by using a global instead of thread-local data structure and tuning parameters increased the hit rate and throughput. Furthermore, the measurements on the different implementations showed that the hit rate can be increased at the cost of some overhead, but implementing a complicated algorithm quickly increases the overhead and might decrease the throughput enough to make the algorithm ineffective.

Randomly generating execution plans for bug detection in Neo4j:
In recent years, Graph Database Management Systems(GDBMS) has increased in popularity for many use cases. One of the most popular GDBMS is Neo4j, which uses Cypher as a query language. With the increasing use of GDBMS in many business-critical applications, the need to test Neo4j and its competitors has become critical. One common practice for identifying bugs in a database system is using randomly generated tests, known as fuzz testing. Previously, this has been done by randomly generating queries, and several tools are currently available for this purpose. When executing a Cypher query, the query goes through several processing steps to ensure a correct result returns quickly. One of the intermediate structures used in the query processing is the execution plan, which details how the runtime should solve the query. In this thesis, we propose a novel approach to fuzz testing GDBMS by randomly generating execution plans. Our tool utilizes differential testing between different Neo4j runtimes, which allows for identifying incorrect results returned from one or more of the runtimes. These types of bugs are known as logic bugs. We can also identify situations when the Neo4j runtimes throw unexpected exceptions. The testing suite identified 20 bugs within the Neo4j, of which 11 were logic bugs. This approach to fuzz testing has proven helpful in identifying errors within the Neo4j runtimes, which previously received insufficient coverage by fuzz testing using queries. Other database management systems that utilize execution plans can benefit from the approach proposed by this thesis. The main drawbacks of this new approach are that it is not easily portable between different GDBMS and requires access to the query processing source code.

We tackle challenges in:
  • Concurrency and parallelism
  • Distributed systems and fault tolerance
  • Language design and type systems
  • Performance tuning and benchmarking
  • Cloud architecture and service design
  • Site Reliability Engineering and cloud automation
  • Continuous Integration and Continuous Delivery
  • Graph algorithms and machine learning
Please send us a description in English of:
  1. Your area of study
  2. Your thesis idea and the area of engineering that it corresponds to
  3. If you are not completely sure, that is okay - please let us know if you would like to find out more information
  4. If you are applying as a group, please apply separately and indicate who you are applying together with in your Cover Letter.
Why Join Neo4j?

Neo4j is, without question, the most popular graph database in the world. We have customers in every industry across the globe, and our products are a proven product/market fit. Joining our team is an opportunity to shape the future of data and analytics. Below are just a few exciting facts about Neo4j. 

  • Neo4j is one of the fastest scaling technology companies in this industry. Well over $100M ARR and still rapidly growing.
  • Raised biggest round of funding in all of database history ($325M Series F).
  • Backed by world class investors like Google Ventures (GV), Neo4j has raised over $582M in funding and is currently valued at $2Bn. This  puts them among the most well-funded database companies in history.
  • 75% of Fortune 100 use Neo4j with more than 800 enterprise customers including Comcast, eBay, Adobe, Lyft, UBS, IBM, Volvo Cars and many more.
  • Emil Eifrem (CEO) has built an amazing culture that prides itself on relationships, inclusiveness, innovation and customer success.
  • Countless awards in the industry. Massive Enterprises and individual developers/ data scientists love Neo4j. Strong sense of community and ecosystem is built around the platform.
  • A recent Forrester Total Economic Impact Study pegged Neo4j as delivering 417% ROI to customers.

Research shows that members of underrepresented communities are less likely to apply for jobs when they don’t meet all of the qualifications. If this is part of the reason you hesitate to apply, we’d encourage you to reconsider and give us the opportunity to review your application. At Neo4j, we are committed to building awareness and helping to improve these issues. 

One of our central objectives is to provide an inclusive, diverse, and equitable workplace for everyone to develop their potential and have a positive, career-defining experience. We look forward to receiving your application.

Neo4j Values:

Neo4j is a Silicon Valley company with a Swedish soul. We foster collaboration and each of us is empowered to contribute and put our innovative stamp on projects. We hire candidates who reflect the following Neo4j core values:

(we)-[:VALUE]->(relationships)
(we)-[:FOCUS_ON]->(userSuccess)
(we)-[:THRIVE_IN]->(:Culture {type: [‘Open’, ‘Inclusive’]})
(we)-[:ASSUME]->(:Intent {direction:’Positive’})
(we)-[:WELCOME]->(:Discussions {nature: ‘IntellectuallyHonest’})
(we)-[:DELIVER_ON]->(ourCommitments) 

Neo4j is committed to protecting and respecting your privacy. Please read the privacy notice regarding Neo4j's recruitment process to understand how we will handle the personal data that you provide. 

More information at www.neo4j.com.