Google Summer of Code 2020
This year I got selected for Google Summer of Code 2020 under The Honeynet Project. This year GSoC was very special for me because I finally got selected for the organization, for which I’ve been trying to get selected from past 2 years.
I got to know about Google Summer of Code back in 2018, when I learned that my elder brother has done it 3 years in a row. He wanted me to try to get selected for any org that I like. So I started looking for projects on GSoC archives and came across the Honeynet Project org. I was interested in it because I really liked all the projects under it, projects like Snare&Tanner, Thug, etc. I think these projects attracted my attention because they have a feel of being related to the
Information Security field.
In GSoC 2018 I submitted 2 proposals, one for Snare & Tanner under Honeynet Project and another one for a tool called Addon-Checker under XBMC Foundation. I wasn’t selected for Snare & Tanner that year but got selected for another project. You can read the blog post about it here
In GSoC 2019, I again applied for a project called Cowrie under Honeynet Project and wasn’t selected. I think it was because the other applicant, who got selected had a slightly better proposal then I did.
In 2020, I decided to start early and make some more contributions to Snare & Tanner. Also, I spent quite some time understanding the exact work required and make a good proposal for it. The mentor of the project was very helpful from the very beginning. She reviewed my proposal twice or thrice before I made a final proposal submission on the GSoC portal
The work that I proposed in my GSoC proposal was:
- Adding support for persistent storage for sessions, in TANNER.
- Improving cloning and storing functionality of SNARE.
- Small works like adding a new injection template, using async FTP library etc.
I started to work before the official
coding begins period, even though I didn’t write any code but I started to discuss everything with my mentor like what we should do first, how we should start etc. Our initial idea was to start with the most important task which was to add support for PostgreSQL so the storage capability of TANNER, can become a bit more stable. In the early discussion, we decided to add PostgreSQL as an optional DB meaning users will have the option to choose between PostgreSQL and Redis DB. But later we came to the conclusion that it would be nice if we can use the combination of both the databases to improve the overall storage functionality.
Currently, in TANNER, the only way to store all the session data (analyzed and unanalyzed) was to keep it within the Redis. In TANNER the unanalyzed sessions data is the unprocessed data that is sent by the SNARE and it contains information like time of the session, user agent, etc. And analyzed sessions are those which TANNER generates after processing the unanalyzed data. The analyzed data contains information like possible owners i.e whether the sessions were created by a normal user or a crawler/tool, type of attacks, etc
Back in 2016 when new features were being added to TANNER, Redis was selected as the best choice because of the speed that it offers. Redis provides its speed functionality by storing data in the memory instead of keeping it in the disk. It’s kind of similar to how a cache of any system works.
As the project grew we started to gather more and more data and storing everything in Redis caused it to consume a lot of memory which resulted in unexpected crashes on low spec systems.
Below is the diagram which explains how the SNARE & TANNER is working with the only Redis:
Disadvantages of this setup
As mentioned above higher consumption of memory on a low spec system would result in an unexpected crash of TANNER server. Also, higher memory consumption would cause other memory issues like slowing down other applications present on the system.
First, we decided that we can add support for Postgres, and then we’ll give users the option to run tanner either with Redis or with Postgres. But later after some discussion, we decided that we’ll use a combination of Redis and Postgres.
The way we decided to use them can be seen in the following diagram:
As it’s clear from the above diagram that after analysis we are storing the data in the Postgres and then we are deleting the analyzed data from Redis.
The advantages of using the combination of both Postgres and Redis are
When TANNER receives a session from SNARE it will be able to store that session data in Redis with much higher speed.
Once the session is analyzed then it can be stored in persistent disk storage like Postgres.
Change in Data format
One of the major advantages of having Postgres as the storage is because it gives us the power to perform queries before presenting that data to the user.
TANNER API is used to access the data stored in the DB. Let’s take an example to understand how the API has changed now, we have an endpoint /snare-stats/
Now in the old (Only Redis) setup, we were extracting all the data from the DB, and then we were selecting only those keys that were required for that endpoint. But now with the new (Redis+Postgres) setup, we don’t extract all the data. What we do is we run SQL queries on the data before actually extracting the data, this saves us from unnecessary processing.
If you’d like to see the queries we are executing then please check out the code here
Detail about the format
We decided to store all the data in 4 tables named sessions, cookies, owners, paths. Below you can see the schema of all the tables:
If you’d like to see the initial discussion about this format then check out this small gist that I made.
Initially, our plan was that I’ll spend around 2 months working on this task, and then in the last month I’ll work on all the other tasks that I proposed but little did we know that adding support for another DB wouldn’t be as easy as we thought.
- Finding and fixing a weird bug: I think this was the most frustrating bug that I came across during the whole GSoC time period.
So we have 4 tables to in our Postgres DB and out of this there is one main tabled called
sessionsand then there are other like cookies, owners, and paths. The Sessions data store a kind of reference to these tables and the bug was that when you looked at the data in the sessions table it showed there was data in the other tables but when you look at those tables separately the data was missing out. Example:
a1894717-bf39-4f67-8493-66debde31e0b | 776edef4-ec5f-4fe0-8485-02cf9ff1f64d | 18.104.22.168 | 43972 | France | FR | Paris | 75001 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 | 2020-06-23 20:09:01 | 2020-06-23 20:16:23 | 9.010287049751257 | 0.11098427662381329 | 3983 | 0 | 0 |
In this, the number
3983 is the
accepted_path value which basically means how many paths a bot/attacker tried in one session. But if we checked the count in the
paths table it showed
select count(*) from paths where session_id='a1894717-bf39-4f67-8493-66debde31e0b'; count ------- 6
The issue was that while we moved the session data between Redis(storing the initial data) and Postgres(storing analyzed sessions) some sessions were missing an
attack_type. And all this was a fault of a small function called
set_attack_type and it was because we were doing some URL encoding while taking information from SNARE to Redis but wasn’t doing the same after analyzing the data and sending it to POSTGRES.
- Support multiple filters in API: The API as well as the Web application needed to support the multiple filters same time. So say if you want all the sessions from two different IP addresses the API call would look like:
Now this looks easy and a lot of
python devs might see this and be like
oh you can split it on space and then it's all easy but it wasn’t that easy. I had to make sure that no unknown filters were passed and the biggest issue was that I had to generate a SQL query based on all the filters so it(API) can get the exact data out of the DB.
Testing: I think this is one of the most important things that I have learned this GSoC. Testing your code is very important and I kind of already knew this but I wasn’t doing this practically. But a lot of times I felt that I could have caught certain bugs right when they appeared if I had written the test cases for the newly written code right after finishing the code. The problem that I faced during the testing phase exactly “while writing them” but it was while running them. Once Pytest would finish running the tests it would show warnings about “unclosed connection”. Fixing this was a bit difficult because most of the libraries used in this project are async and I was quite confused with the working. The most difficult part was that sometimes when I locally ran the tests there were no warnings but in the Travis logs, we could see ike 5-10 warnings so it was hard testing it locally.
Shift to SQLAlchemy: This wasn’t the problem but I am writing about this because this definitely took some of our time. So in the starting I and my mentor thought that using raw SQL queries wouldn’t be as such of an issue. But once I made some intial changes we both felt that the code was getting a
bit dirtyand hard to understand. So we decided it would be better if we use SQLAlchemy, since this ORM would provide lot of functionality which would keep our code clean and simple.
So out of the suggested task, I was able to finish the task of adding support for persistent storage and the other small task like adding new template injections or updating the FTP support to use the async library. Even though I had completed the PostgreSQL task in around 2.5 months and could have tried to do another big task but then my mentor suggested that instead of trying to have a lot of new features, it would be better to improve the existing features that we have. That is why we decided not to go ahead with “changing the snare storage and cloning functionality” and decided to do other small tasks.
It’s very important to test your code. You can use whatever automated tools possible to try and test them. Also once you feel that you are done adding a new feature, write the tests for the code right away.
During the Proposal making phase I had written things like “Take test coverage to 100%” and “finish PostgreSQL integration in 2 weeks” but in reality, the DB integration took around ~2.5 months and the test coverage is at around 78% now 😅. So after the GSoC I have realized that it’s better to have few tasks in your timeline with clear deliverables, goals, etc rather than trying to add loads of tas without thinking them through.
Sometimes everything is right in front of you you just need to take a good look. This happened with me multiple times, the bug was right in front of me but I still had to ask my mentor about it.
Before this I had never worked on any DB related work. I mean I did studied DBMS in my 3rd year of college but never got a chance to work on a real world project so I learned a lot about PostgreSQL, Redis, SQLAlchemy etc.
I had planned that even after the GSoC ends I’ll continue to work on the project but due to this year being the final year of my degree my life has changed a lot. Campus placement has started in my college so now I have to prepare for the tests and interviews. Along with that, I have to prepare my thesis and a major project. So right now I am not sure when I’ll get back to contributing to the project.
In the end, I just want to thank the Honeynet Project org and my mentor for seeing the potential in me and selecting me for this year GSoC. Also other special thanks to the mentor for being so helpful and patient with me.