- We have a dataset containing information about the buses travelling in Bengaluru. We have obtained it from Bengaluru Metropolitan Transport Corporation (BMTC).
- The region of interest is an approximately 40km by 40km square area. See the following figure:
- The data was collected from around two thousand buses for one day, between 7:00am to 7:00pm.
- The buses follow different routes within the city.
- Each bus is identified with a unique ID. A bus carries a device which records the data: latitude, longitude, speed, and timestamp.
Create a model to estimate the travel time, in minutes, between source-destination pairs using the provided dataset.
We are providing the following three files in the dataset (download link ):
- BMTC.parquet.gzip: It contains the GPS traces of around two thousand buses.
- Input.csv: It contains geographical coordinates of various sources-destination pairs.
- GroundTruth.csv: It contains the ground truth travel times between the source-destination pairs provided in Input.csv. It is provided to help participants assess their solutions.
The file contains information in five columns, described as follows:
- BusID: The (unique) ID associated with the device present in a bus.
- Latitude: Latitude (geographical coordinate) of a bus, as recorded by the device.
- Longitude: Longitude (geographical coordinate) of the bus, as recorded by the device.
- Speed: Instantaneous speed of the bus in kmph.
- Timestamp: Timestamp in IST format. The format of datetime is yyyy-mm-dd HH:MM:SS.
For better understanding, following is a snapshot from the dataset:
BusID | Latitude | Longitude | Speed | Timestamp | |
0 | 150212121 | 13.06593 | 77.45269 | 20 | 2019-08-01 18:59:18 |
1 | 150212121 | 13.06627 | 77.45211 | 27 | 2019-08-01 18:59:28 |
2 | 150212121 | 13.06661 | 77.45152 | 24 | 2019-08-01 18:59:38 |
3 | 150212121 | 13.06697 | 77.45089 | 28 | 2019-08-01 18:59:48 |
4 | 150212121 | 13.06727 | 77.45035 | 26 | 2019-08-01 18:59:58 |
5 | 150218000 | 13.00571 | 77.68619 | 46 | 2019-08-01 07:22:33 |
6 | 150218000 | 13.00525 | 77.68542 | 35 | 2019-08-01 07:22:42 |
7 | 150218000 | 13.00504 | 77.68509 | 0 | 2019-08-01 07:22:51 |
8 | 150218000 | 13.00504 | 77.68509 | 0 | 2019-08-01 07:23:01 |
9 | 150218000 | 13.00498 | 77.68497 | 13 | 2019-08-01 07:23:11 |
Note: The devices may not record the data with same sampling intervals. The recordings may also be noisy.
The file contains four columns, described as follows:
- Source_Lat: The latitude of a source.
- Source_Long: The longitude of a source.
- Dest_Lat: The latitude of a destination.
- Dest_Long: The longitude of a destination.
For better understanding, following is the format of a typical input file:
Source_Lat | Source_Long | Dest_Lat | Dest_Long | |
0 | 13.067272 | 77.45035 | 13.00525 | 77.68542 |
1 | 13.005042 | 77.68509 | 13.06627 | 77.45211 |
2 | 13.065925 | 77.45269 | 13.00498 | 77.68497 |
3 | 13.005247 | 77.68542 | 13.06661 | 77.45152 |
The file contains one column TT, i.e. the actual travel time between a source-destination pair. The value in the i-th row corresponds to the travel time between i-th source-destination pair in Input.csv.
For better understanding, following is the format of a typical ground truth file:
TT | |
0 | 1.99 |
1 | 6.21 |
2 | 7.34 |
3 | 5.20 |
You can use the ground truth from the dataset to check if your code is working well.
Output (Estimated Travel Time)
Your output will be the estimated travel time (ETT), in minutes, between a given source-destination pair. For each source-destination pair, you should fill this value in the ETT column of a pandas dataframe, as illustrated below:
Source_Lat | Source_Long | Dest_Lat | Dest_Long | ETT | |
0 | 13.067272 | 77.45035 | 13.00525 | 77.68542 | 2.34 |
1 | 13.005042 | 77.68509 | 13.06627 | 77.45211 | 5.51 |
2 | 13.065925 | 77.45269 | 13.00498 | 77.68497 | 3.72 |
3 | 13.005247 | 77.68542 | 13.06661 | 77.45152 | 5.13 |
- Team size: At most two individuals.
- Programming language: Python (3.5 and beyond)
- Packages: The participants can use the commonly available Python packages like pandas, geopandas, numpy, scikit-learn, pyarrow, matplotlib, scipy, math, string, random, datetime, etc. In case the submissions use different packages, we reserve the right to not consider and evaluate them for the Hackathon.
- Step1: Download the dataset (download link ). Create a folder data and copy the three files from the dataset into it.
- Step 2: Create a GitHub repository as per the folder structure illustrated in the example below (see image):
where, for the purpose of illustration, srika_DS_456AB is the GitHub repository, and data contains the dataset files.(We will provide the repository name to each participating individual or team.)
We require the participants to keep their respective repositories private for the duration of the hackathon.
- Step 3: Build EstimatedTravelTime() within
Ensure that the following holds (refer to the above example for the folder structure):- data folder contains the data file (BMTC.parquet.gzip), the input file (Input.csv), and the ground truth file (GroundTruth.csv). We are providing all these files. These files remain unchanged while a team works on the problem.
- The GitHub repository (srika_DS_456AB folder, in the above example) contains the following:
- Python code: Name the Python code file as and build the function EstimatedTravelTime() that predicts the travel time between source-destination pairs.
Use relative paths as per the template given below: - Instruction file: If there are any special instructions that are required for us to run your code, please add them in an Instructions.txt file.
- Python code: Name the Python code file as and build the function EstimatedTravelTime() that predicts the travel time between source-destination pairs.
- Step 4 (Before the hackathon deadline gets over): Add cnihackathon22 as a collaborator to your GitHub repository, and grant the Read access.
Evaluation criteria
- The teams should build models to estimate the travel time between a source and destination, and code should work beyond the provided inputs. The model should try to minimize the mean absolute difference between the actual and predicted values (\(L1\) error).
- We will fetch the last GitHub commit done before the submision deadline and evaluate it.
- We will have our own test input files against which the submissions will be evaluated.
- The Jury: We will shortlist 5-6 best performing teams and ask them to present and explain their submissions before a jury.
The leaderboard lists the valid submissions and the final results. Many congratulations to the winners!
Useful links
- Dataset: download link